Hi,

On 12/03/15 09:11, Mark H. Wood wrote:
> Several recent issues (DS-2337, DS-2487, and perhaps DS-2488) suggest
> that we should step back and take a long look at how we are using the
> Solr 'statistics' core. terms of promotion

I agree with Mark, we need to at least make sure that we keep the data 
safe across upgrades. Just a note, even the dspace.log files are not 
helping 100% since they don't contain information that is now stored in 
the solr statistics (referer, user agent) and some of the derived 
information may change over time (geo / DNS lookups of IP addresses). 
Usage statistics are not the primary purpose of a repository, but my 
repository managers at least have made it very clear that this data is 
important to them (in terms of promotion of the repository etc).

> I think we need to give some more thought to how we can readily
> preserve usage records over DSpace upgrades and system failures.

Let's also not forget that the authority core might be in a similar 
situation at some point down the track. When enabled, it is the main 
data source for authority data, if I understand things correctly. The 
DSpace authority key, in that case, holds an id that is only meaningful 
in the context of the authority solr core. So you would not lose the 
disambiguation, at least, but you would lose the link(s) to external 
authority sources.

I'm assuming that the ElasticSearch statistics are affected by a similar 
issue - using a mechanism not designed as a primary data source to 
actually be the primary data source. But I haven't looked at the 
ElasticSearch stats at all, so I may well be wrong on this.

OAI and discovery are fine, they just hold a copy of data from elsewhere 
and there is no problem with blowing away these cores and re-creating 
them from the source data.

> I should admit here that I am skeptical of using Solr as the
> statistics store *at all*, however well it works most of the time.
> But it is not my purpose in this note to advocate for something
> different.

I'm not sure I have a solution either, other than perhaps a clear 
statement from the committers to keep the data safe. At the minimum, it 
will mean that every pull request will need to be examined for changes 
to the solr schema and if there is one, it needs to come with an upgrade 
path or the PR can't be merged.

We could also put some resources into improving the existing 
import/export functionality for the statistics so that no data gets lost 
during those processes. This would allow people to back up their 
statistics data regularly or at least before upgrades. We'd need 
something similar for the authority core.

And/or we could put some resources into generic solr reindexing code. We 
have a first cut from Terry Brady at Georgetown, linked from 
https://jira.duraspace.org/browse/DS-2489 (to add uids to the statistics 
core). I've taken his approach and made it a little more generic; Hardy 
is testing it at the moment but it looks like it won't quite get us 
there. It's linked from https://jira.duraspace.org/browse/DS-2486

cheers,
Andrea

-- 
Dr Andrea Schweer
IRR Technical Specialist, ITS Information Systems
The University of Waikato, Hamilton, New Zealand


------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Reply via email to