Hi again, On 26/03/15 11:19, Andrea Schweer wrote: > Another gap in my code is incremental exports. At the moment, the > export part of my code dumps all of the data. I think it would be nice > for back-up purposes to be able to specify a start date from which to > export, so that people can export eg monthly and back up this data. I > left this out for now because a) I need to get on with upgrading my > repositories and b) I wanted my code to be general enough to deal with > both the usage stats and the authority data. The stats core has an > obviously usable date field (time). The authority core doesn't really > -- there is a creation date and a last-modified date; presumably for > back-up purposes the last-modified date would be more useful. Again, > thoughts / code welcome.
What do people think, how would you use an incremental export option (and would you use one at all)? Thinking with my sysadmin hat on, I suppose it would be nice to be able to run this in "dump data from the last day/week/month" mode, rather than having to specify a date range -- much easier to put this into a crontab if the arguments don't change. Any comments on this? Incremental exports could help with the problem raised in the IRC meeting last night: potential data loss in the fixing-the-data use case. Documents added between the start of the export (when the code determines which documents to include) and the start of the import (when the core is wiped) are deleted. With an incremental export mode, you could export the bulk of the data as a one-off (and if that takes 12 hours like for Hardy, not a problem), then do another incremental update from the time you ran the first export (and worst case, another incremental one after that). That way you'd minimise the gap. For the fixing-the-data use case, it _may_ be possible to run the export as-is, then delete-by-query all documents without a uid, then run the import without the --clear flag. That should re-index all documents without duplicating any of them, but I think I tried this at some point and solr was too clever (it didn't re-index the documents when nothing had changed). Might be worth a try (not by me, I'm about to go home for the day). Adding the commands to the launcher.xml file could address my concerns around the time field. The time field could be passed into my code as a parameter, and we just hard-code the correct one in launcher.xml. That would still need to be changed when the schema changes though. Sorry if this is a bit ramble-y, I think I'm at a point with this code where input from others would be really helpful. cheers, Andrea -- Dr Andrea Schweer IRR Technical Specialist, ITS Information Systems The University of Waikato, Hamilton, New Zealand ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel