I wonder how hard the "incremental export" is to implement?  If it's 
really not that complex overall, then it seems like it'd be a quick win 
for just doing the Solr Stats backups in general.

If I want to ensure my Solr Stats are "safe" in DSpace 5, my only real 
option is to back them up via a CSV export (and a full dump is the only 
option right now). Since my stats will obviously only grow and grow over 
time, this full CSV export is going to take longer and longer to perform 
-- so it may become less possible to perform as an 'overnight backup'.

So, I see a possible need for incremental exports even without the issue 
we pointed out in IRC -- the possible data loss while re-indexing via an 
export & re-import.

But, the question is whether this is something that is a "quick win" or 
if it requires larger changes (I admit, I'm not well versed on the Solr 
APIs/queries when it comes to this).

As a sidenote: with regards to the Authority index, it seems like the 
data in that index is possible to *repopulate* from the existing 
metadata in the database (using ./dspace index-authority). So, it seems 
like that index may not suffer from the same problems as the Stats one 
(though I haven't tried it -- just reading the docs):

https://wiki.duraspace.org/display/DSDOC5x/ORCID+Integration#ORCIDIntegration-Importingexistingauthors&keepingtheindexuptodate

- Tim

On 3/25/2015 10:52 PM, Andrea Schweer wrote:
> Hi again,
>
> On 26/03/15 11:19, Andrea Schweer wrote:
>> Another gap in my code is incremental exports. At the moment, the
>> export part of my code dumps all of the data. I think it would be nice
>> for back-up purposes to be able to specify a start date from which to
>> export, so that people can export eg monthly and back up this data. I
>> left this out for now because a) I need to get on with upgrading my
>> repositories and b) I wanted my code to be general enough to deal with
>> both the usage stats and the authority data. The stats core has an
>> obviously usable date field (time). The authority core doesn't really
>> -- there is a creation date and a last-modified date; presumably for
>> back-up purposes the last-modified date would be more useful. Again,
>> thoughts / code welcome.
>
> What do people think, how would you use an incremental export option
> (and would you use one at all)? Thinking with my sysadmin hat on, I
> suppose it would be nice to be able to run this in "dump data from the
> last day/week/month" mode, rather than having to specify a date range --
> much easier to put this into a crontab if the arguments don't change.
> Any comments on this?
>
> Incremental exports could help with the problem raised in the IRC
> meeting last night: potential data loss in the fixing-the-data use case.
> Documents added between the start of the export (when the code
> determines which documents to include) and the start of the import (when
> the core is wiped) are deleted. With an incremental export mode, you
> could export the bulk of the data as a one-off (and if that takes 12
> hours like for Hardy, not a problem), then do another incremental update
> from the time you ran the first export (and worst case, another
> incremental one after that). That way you'd minimise the gap.
>
> For the fixing-the-data use case, it _may_ be possible to run the export
> as-is, then delete-by-query all documents without a uid, then run the
> import without the --clear flag. That should re-index all documents
> without duplicating any of them, but I think I tried this at some point
> and solr was too clever (it didn't re-index the documents when nothing
> had changed). Might be worth a try (not by me, I'm about to go home for
> the day).
>
> Adding the commands to the launcher.xml file could address my concerns
> around the time field. The time field could be passed into my code as a
> parameter, and we just hard-code the correct one in launcher.xml. That
> would still need to be changed when the schema changes though.
>
> Sorry if this is a bit ramble-y, I think I'm at a point with this code
> where input from others would be really helpful.
>
> cheers,
> Andrea
>

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to