Re: [Dspace-devel] We need to think a bit more about how we use the 'statistics' Solr core

Mark H. Wood Wed, 25 Mar 2015 06:27:44 -0700

On Tue, Mar 24, 2015 at 11:34:44AM -0400, Peter Dietz wrote:
> Also. What are people thinking would be a safe preservation location for
> usage events? i.e. for people concerned about resources.


What I've been thinking is duplicated DVD-ROMs in fire-insulated
storages, right alongside of content backups.

For *really* long-term statistics, I might argue for computing a few
aggregations that would be useful for further statistical processing,
and only keeping, say, a five-year moving window of actual event
data.  We probably need some advice from people who design long-term
statistical studies, to find out what sorts of aggregates would be
most useful.

> Could it be feasible to export all SOLR usage event data to
> log/usage-event.<date>.log, and then have all new real-time usage
> events from now on written to usage-event.log. Then when we need to
> populate a new statistics engine. We could populate it by indexing
> usage.event.logs

Sounds nice.  The tricky thing with exporting from Solr is if you have
fields that are indexed but not stored.  How do you recover those
data?

The best approach with Solr (or ES), I think, is to keep raw records
in some simple format somewhere else, and only use the index engine as
a cache.  W.r.t. DSpace: write an event consumer that produces simple
flat files, and add it to the consumer list, so that the indexer and
your event archival files are fed in parallel.

If I haven't kept any external records that could be used to
reconstitute the cached data, then I have a problem, and if the
cache doesn't contain the actual data for some fields then there is
just no way to recover all of the data.

> >From my perspective, we've got 200GB+ of solr/ES indexes across instances,
> plus memory and CPU and ES instances, and it would be nice to outsource
> this work. Especially if GA is free.

The thing I always remember about Google's freebies is that they tend
to disappear as Google loses interest, having learned what they wanted
to know.  They're kind enough to share their tools with us, but the
design and expected lifetime of those tools are tuned to Google's
needs.

Statistical analysis is a creative enterprise.  The more decisions we
make up front, the more processing we do up front, the more clever
ideas we forestall.  I think DSpace should capture events and do as
little more as is consistent with privacy and with practical issues of
storage and access.  Then people who want to learn things can load
them into OOCalc or R or what-have-you and crunch to their hearts'
content.

Rather than one-size-fits-all builtin statistical data products, our
UI(s) should have a few empty spots:  Here there be Statistics.  Plug
in what your site thinks important.  DSpace could ship with some
plain-Jane example plugins good for casual use.

Here' we've been asked to export some sort of usage data to be
gathered together with other services' equivalents and combined in a
sort of usage portal independent of any single product.  What's
interesting to us is not what DSpace is doing or what ContentDM is
doing, but what the organization is doing.

-- 
Mark H. Wood
Lead Technology Analyst

University Library
Indiana University - Purdue University Indianapolis
755 W. Michigan Street
Indianapolis, IN 46202
317-274-0749
www.ulib.iupui.edu

signature.asc
Description: Digital signature

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/

_______________________________________________
Dspace-devel mailing list
Dspace-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-devel

Re: [Dspace-devel] We need to think a bit more about how we use the 'statistics' Solr core

Reply via email to