On Tue, Mar 24, 2015 at 11:34:44AM -0400, Peter Dietz wrote: > Also. What are people thinking would be a safe preservation location for > usage events? i.e. for people concerned about resources.
What I've been thinking is duplicated DVD-ROMs in fire-insulated storages, right alongside of content backups. For *really* long-term statistics, I might argue for computing a few aggregations that would be useful for further statistical processing, and only keeping, say, a five-year moving window of actual event data. We probably need some advice from people who design long-term statistical studies, to find out what sorts of aggregates would be most useful. > Could it be feasible to export all SOLR usage event data to > log/usage-event.<date>.log, and then have all new real-time usage > events from now on written to usage-event.log. Then when we need to > populate a new statistics engine. We could populate it by indexing > usage.event.logs Sounds nice. The tricky thing with exporting from Solr is if you have fields that are indexed but not stored. How do you recover those data? The best approach with Solr (or ES), I think, is to keep raw records in some simple format somewhere else, and only use the index engine as a cache. W.r.t. DSpace: write an event consumer that produces simple flat files, and add it to the consumer list, so that the indexer and your event archival files are fed in parallel. If I haven't kept any external records that could be used to reconstitute the cached data, then I have a problem, and if the cache doesn't contain the actual data for some fields then there is just no way to recover all of the data. > >From my perspective, we've got 200GB+ of solr/ES indexes across instances, > plus memory and CPU and ES instances, and it would be nice to outsource > this work. Especially if GA is free. The thing I always remember about Google's freebies is that they tend to disappear as Google loses interest, having learned what they wanted to know. They're kind enough to share their tools with us, but the design and expected lifetime of those tools are tuned to Google's needs. Statistical analysis is a creative enterprise. The more decisions we make up front, the more processing we do up front, the more clever ideas we forestall. I think DSpace should capture events and do as little more as is consistent with privacy and with practical issues of storage and access. Then people who want to learn things can load them into OOCalc or R or what-have-you and crunch to their hearts' content. Rather than one-size-fits-all builtin statistical data products, our UI(s) should have a few empty spots: Here there be Statistics. Plug in what your site thinks important. DSpace could ship with some plain-Jane example plugins good for casual use. Here' we've been asked to export some sort of usage data to be gathered together with other services' equivalents and combined in a sort of usage portal independent of any single product. What's interesting to us is not what DSpace is doing or what ContentDM is doing, but what the organization is doing. -- Mark H. Wood Lead Technology Analyst University Library Indiana University - Purdue University Indianapolis 755 W. Michigan Street Indianapolis, IN 46202 317-274-0749 www.ulib.iupui.edu
signature.asc
Description: Digital signature
------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________ Dspace-devel mailing list Dspace-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-devel