Re: [Dspace-devel] We need to think a bit more about how we use the 'statistics' Solr core

Brian Freels-Stendel Thu, 12 Mar 2015 11:48:00 -0700

I've always been leery of statistics like those DSpace keeps.  They're more 
akin to a library patron picking a book off a shelf and setting it on a table, 
rather than actually using it for anything.  (I know, getting rid of them would 
bring masses of pitchfork-toting authors to all our doors.)  The metric 
academia was more concerned with (before the Web) was citation counts.  Google 
Scholar has such a thing that might be useful, depending; I haven't dug deeply 
enough to know how complete it might be.

http://scholar.google.com/intl/en-US/scholar/citations.html

Just food for thought.

B--

From: Peter Dietz [mailto:pe...@longsight.com]
Sent: Thursday, March 12, 2015 12:35 PM
To: Andrea Schweer
Cc: DSpace Developers
Subject: Re: [Dspace-devel] We need to think a bit more about how we use the 
'statistics' Solr core

ES is equally guilty of being a statistics data source, by storing 
original/raw. So, statistics is something that complicates DSpace's role in 
preserving assets, since stats are a value-add, and not a core repository 
function. But, since repo managers enjoy statistics, we can't not offer 
statistics. I would however like to offload the role of stats to a third party, 
such as Google Analytics though.

Back to the relevant discussion. Both SOLR and ES prefer to be just indexes, 
something that you could rebuild if necessary. If you have all dspace.log's you 
potentially could rebuild, but its very laborsome. I've considered having an 
alternative log file, logs/usage-stats.<date>.log, that was similar to the 
output of stats-log-exporter|convertor, and input of stats-log-importer. Thus, 
that would be the source of record, and the stats engines could rebuild from 
this. Currently more information is being stored in the stats engines than gets 
logged to dspace.log (useragent, hostname, ...).

I've added the ability for SOLR to export its data to csv: 
https://github.com/DSpace/DSpace/commit/f57619d726c07535ce786a3f79e9c39d56fd9031
So, potentially, one could run that regularly to have backup data points...

________________
Peter Dietz
Longsight
www.longsight.com<http://www.longsight.com>
pe...@longsight.com<mailto:pe...@longsight.com>
p: 740-599-5005 x809

On Wed, Mar 11, 2015 at 6:11 PM, Andrea Schweer 
<schw...@waikato.ac.nz<mailto:schw...@waikato.ac.nz>> wrote:
Hi,

On 12/03/15 09:11, Mark H. Wood wrote:
> Several recent issues (DS-2337, DS-2487, and perhaps DS-2488) suggest
> that we should step back and take a long look at how we are using the
> Solr 'statistics' core. terms of promotion

I agree with Mark, we need to at least make sure that we keep the data
safe across upgrades. Just a note, even the dspace.log files are not
helping 100% since they don't contain information that is now stored in
the solr statistics (referer, user agent) and some of the derived
information may change over time (geo / DNS lookups of IP addresses).
Usage statistics are not the primary purpose of a repository, but my
repository managers at least have made it very clear that this data is
important to them (in terms of promotion of the repository etc).

> I think we need to give some more thought to how we can readily
> preserve usage records over DSpace upgrades and system failures.

Let's also not forget that the authority core might be in a similar
situation at some point down the track. When enabled, it is the main
data source for authority data, if I understand things correctly. The
DSpace authority key, in that case, holds an id that is only meaningful
in the context of the authority solr core. So you would not lose the
disambiguation, at least, but you would lose the link(s) to external
authority sources.

I'm assuming that the ElasticSearch statistics are affected by a similar
issue - using a mechanism not designed as a primary data source to
actually be the primary data source. But I haven't looked at the
ElasticSearch stats at all, so I may well be wrong on this.

OAI and discovery are fine, they just hold a copy of data from elsewhere
and there is no problem with blowing away these cores and re-creating
them from the source data.

> I should admit here that I am skeptical of using Solr as the
> statistics store *at all*, however well it works most of the time.
> But it is not my purpose in this note to advocate for something
> different.

I'm not sure I have a solution either, other than perhaps a clear
statement from the committers to keep the data safe. At the minimum, it
will mean that every pull request will need to be examined for changes
to the solr schema and if there is one, it needs to come with an upgrade
path or the PR can't be merged.

We could also put some resources into improving the existing
import/export functionality for the statistics so that no data gets lost
during those processes. This would allow people to back up their
statistics data regularly or at least before upgrades. We'd need
something similar for the authority core.

And/or we could put some resources into generic solr reindexing code. We
have a first cut from Terry Brady at Georgetown, linked from
https://jira.duraspace.org/browse/DS-2489 (to add uids to the statistics
core). I've taken his approach and made it a little more generic; Hardy
is testing it at the moment but it looks like it won't quite get us
there. It's linked from https://jira.duraspace.org/browse/DS-2486

cheers,
Andrea

--
Dr Andrea Schweer
IRR Technical Specialist, ITS Information Systems
The University of Waikato, Hamilton, New Zealand

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net<mailto:Dspace-devel@lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/dspace-devel

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/

_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Re: [Dspace-devel] We need to think a bit more about how we use the 'statistics' Solr core

Reply via email to