Hi, On 12/03/15 09:11, Mark H. Wood wrote: > Several recent issues (DS-2337, DS-2487, and perhaps DS-2488) suggest > that we should step back and take a long look at how we are using the > Solr 'statistics' core. terms of promotion
I agree with Mark, we need to at least make sure that we keep the data safe across upgrades. Just a note, even the dspace.log files are not helping 100% since they don't contain information that is now stored in the solr statistics (referer, user agent) and some of the derived information may change over time (geo / DNS lookups of IP addresses). Usage statistics are not the primary purpose of a repository, but my repository managers at least have made it very clear that this data is important to them (in terms of promotion of the repository etc). > I think we need to give some more thought to how we can readily > preserve usage records over DSpace upgrades and system failures. Let's also not forget that the authority core might be in a similar situation at some point down the track. When enabled, it is the main data source for authority data, if I understand things correctly. The DSpace authority key, in that case, holds an id that is only meaningful in the context of the authority solr core. So you would not lose the disambiguation, at least, but you would lose the link(s) to external authority sources. I'm assuming that the ElasticSearch statistics are affected by a similar issue - using a mechanism not designed as a primary data source to actually be the primary data source. But I haven't looked at the ElasticSearch stats at all, so I may well be wrong on this. OAI and discovery are fine, they just hold a copy of data from elsewhere and there is no problem with blowing away these cores and re-creating them from the source data. > I should admit here that I am skeptical of using Solr as the > statistics store *at all*, however well it works most of the time. > But it is not my purpose in this note to advocate for something > different. I'm not sure I have a solution either, other than perhaps a clear statement from the committers to keep the data safe. At the minimum, it will mean that every pull request will need to be examined for changes to the solr schema and if there is one, it needs to come with an upgrade path or the PR can't be merged. We could also put some resources into improving the existing import/export functionality for the statistics so that no data gets lost during those processes. This would allow people to back up their statistics data regularly or at least before upgrades. We'd need something similar for the authority core. And/or we could put some resources into generic solr reindexing code. We have a first cut from Terry Brady at Georgetown, linked from https://jira.duraspace.org/browse/DS-2489 (to add uids to the statistics core). I've taken his approach and made it a little more generic; Hardy is testing it at the moment but it looks like it won't quite get us there. It's linked from https://jira.duraspace.org/browse/DS-2486 cheers, Andrea -- Dr Andrea Schweer IRR Technical Specialist, ITS Information Systems The University of Waikato, Hamilton, New Zealand ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel