Re: [Dspace-devel] We need to think a bit more about how we use the 'statistics' Solr core

Andrea Schweer Wed, 25 Mar 2015 20:54:07 -0700

Hi again,

On 26/03/15 11:19, Andrea Schweer wrote:
> Another gap in my code is incremental exports. At the moment, the 
> export part of my code dumps all of the data. I think it would be nice 
> for back-up purposes to be able to specify a start date from which to 
> export, so that people can export eg monthly and back up this data. I 
> left this out for now because a) I need to get on with upgrading my 
> repositories and b) I wanted my code to be general enough to deal with 
> both the usage stats and the authority data. The stats core has an 
> obviously usable date field (time). The authority core doesn't really 
> -- there is a creation date and a last-modified date; presumably for 
> back-up purposes the last-modified date would be more useful. Again, 
> thoughts / code welcome.


What do people think, how would you use an incremental export option 
(and would you use one at all)? Thinking with my sysadmin hat on, I 
suppose it would be nice to be able to run this in "dump data from the 
last day/week/month" mode, rather than having to specify a date range -- 
much easier to put this into a crontab if the arguments don't change. 
Any comments on this?

Incremental exports could help with the problem raised in the IRC 
meeting last night: potential data loss in the fixing-the-data use case. 
Documents added between the start of the export (when the code 
determines which documents to include) and the start of the import (when 
the core is wiped) are deleted. With an incremental export mode, you 
could export the bulk of the data as a one-off (and if that takes 12 
hours like for Hardy, not a problem), then do another incremental update 
from the time you ran the first export (and worst case, another 
incremental one after that). That way you'd minimise the gap.

For the fixing-the-data use case, it _may_ be possible to run the export 
as-is, then delete-by-query all documents without a uid, then run the 
import without the --clear flag. That should re-index all documents 
without duplicating any of them, but I think I tried this at some point 
and solr was too clever (it didn't re-index the documents when nothing 
had changed). Might be worth a try (not by me, I'm about to go home for 
the day).

Adding the commands to the launcher.xml file could address my concerns 
around the time field. The time field could be passed into my code as a 
parameter, and we just hard-code the correct one in launcher.xml. That 
would still need to be changed when the schema changes though.

Sorry if this is a bit ramble-y, I think I'm at a point with this code 
where input from others would be really helpful.

cheers,
Andrea

-- 
Dr Andrea Schweer
IRR Technical Specialist, ITS Information Systems
The University of Waikato, Hamilton, New Zealand


------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dspace-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Re: [Dspace-devel] We need to think a bit more about how we use the 'statistics' Solr core

Reply via email to