Re: [Dspace-devel] We need to think a bit more about how we use the 'statistics' Solr core

Peter Dietz Thu, 12 Mar 2015 11:39:03 -0700

ES is equally guilty of being a statistics data source, by storing
original/raw. So, statistics is something that complicates DSpace's role in
preserving assets, since stats are a value-add, and not a core repository
function. But, since repo managers enjoy statistics, we can't not offer
statistics. I would however like to offload the role of stats to a third
party, such as Google Analytics though.


Back to the relevant discussion. Both SOLR and ES prefer to be just
indexes, something that you could rebuild if necessary. If you have all
dspace.log's you potentially could rebuild, but its very laborsome. I've
considered having an alternative log file, logs/usage-stats.<date>.log,
that was similar to the output of stats-log-exporter|convertor, and input
of stats-log-importer. Thus, that would be the source of record, and the
stats engines could rebuild from this. Currently more information is being
stored in the stats engines than gets logged to dspace.log (useragent,
hostname, ...).

I've added the ability for SOLR to export its data to csv:
https://github.com/DSpace/DSpace/commit/f57619d726c07535ce786a3f79e9c39d56fd9031
So, potentially, one could run that regularly to have backup data points...

________________
Peter Dietz
Longsight
www.longsight.com
pe...@longsight.com
p: 740-599-5005 x809

On Wed, Mar 11, 2015 at 6:11 PM, Andrea Schweer <schw...@waikato.ac.nz>
wrote:

> Hi,
>
> On 12/03/15 09:11, Mark H. Wood wrote:
> > Several recent issues (DS-2337, DS-2487, and perhaps DS-2488) suggest
> > that we should step back and take a long look at how we are using the
> > Solr 'statistics' core. terms of promotion
>
> I agree with Mark, we need to at least make sure that we keep the data
> safe across upgrades. Just a note, even the dspace.log files are not
> helping 100% since they don't contain information that is now stored in
> the solr statistics (referer, user agent) and some of the derived
> information may change over time (geo / DNS lookups of IP addresses).
> Usage statistics are not the primary purpose of a repository, but my
> repository managers at least have made it very clear that this data is
> important to them (in terms of promotion of the repository etc).
>
> > I think we need to give some more thought to how we can readily
> > preserve usage records over DSpace upgrades and system failures.
>
> Let's also not forget that the authority core might be in a similar
> situation at some point down the track. When enabled, it is the main
> data source for authority data, if I understand things correctly. The
> DSpace authority key, in that case, holds an id that is only meaningful
> in the context of the authority solr core. So you would not lose the
> disambiguation, at least, but you would lose the link(s) to external
> authority sources.
>
> I'm assuming that the ElasticSearch statistics are affected by a similar
> issue - using a mechanism not designed as a primary data source to
> actually be the primary data source. But I haven't looked at the
> ElasticSearch stats at all, so I may well be wrong on this.
>
> OAI and discovery are fine, they just hold a copy of data from elsewhere
> and there is no problem with blowing away these cores and re-creating
> them from the source data.
>
> > I should admit here that I am skeptical of using Solr as the
> > statistics store *at all*, however well it works most of the time.
> > But it is not my purpose in this note to advocate for something
> > different.
>
> I'm not sure I have a solution either, other than perhaps a clear
> statement from the committers to keep the data safe. At the minimum, it
> will mean that every pull request will need to be examined for changes
> to the solr schema and if there is one, it needs to come with an upgrade
> path or the PR can't be merged.
>
> We could also put some resources into improving the existing
> import/export functionality for the statistics so that no data gets lost
> during those processes. This would allow people to back up their
> statistics data regularly or at least before upgrades. We'd need
> something similar for the authority core.
>
> And/or we could put some resources into generic solr reindexing code. We
> have a first cut from Terry Brady at Georgetown, linked from
> https://jira.duraspace.org/browse/DS-2489 (to add uids to the statistics
> core). I've taken his approach and made it a little more generic; Hardy
> is testing it at the moment but it looks like it won't quite get us
> there. It's linked from https://jira.duraspace.org/browse/DS-2486
>
> cheers,
> Andrea
>
> --
> Dr Andrea Schweer
> IRR Technical Specialist, ITS Information Systems
> The University of Waikato, Hamilton, New Zealand
>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for
> all
> things parallel software development, from weekly thought leadership blogs
> to
> news, videos, case studies, tutorials and more. Take a look and join the
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Dspace-devel mailing list
> Dspace-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-devel
>

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/

_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Re: [Dspace-devel] We need to think a bit more about how we use the 'statistics' Solr core

Reply via email to