[Dspace-tech] Thoughts about statistics (was: alternative to solr statistics)

Mark H. Wood Tue, 18 Oct 2011 05:53:29 -0700

This points out a problem that I think we (and many other contemporary
projects) have all over the place:  our application is expected to grow
steadily  and without limit, yet we assume over and over again that
the problem is small and bounded.


There is no way around it:  if your repository is large and busy,
sooner or later you will be disappointed by the performance of ad-hoc
queries no matter how many resources you throw at them.

One answer to this is to depend less on ad-hoc queries.  Do you have
some "usual questions" to be answered over and over?  Do you really
need up-to-the-second answers?  Would it be good enough to run
periodic reports and accumulate them?  Some other machine with SPSS or
R or whatever can grind cases all night, if need be, and leave your
monthly abstract waiting in your inbox the next day.  (I want to find
the time to extend DSpace to facilitate this.)  If the periodic
abstractions are saved in raw form before rendering, they become cheap
inputs to longer-range reports.  There are *far* more efficient
methods than those presently provided for extracting information from
vast quantities of data.

Once periodic statistical products are available, they can be simply
fetched over and over again and slotted into DSpace pages to provide
tolerably up-to-date views of activity quickly and cheaply.  We just
don't do that yet.

Once periodic statistical products are available, we don't have to
keep twenty years of event data in Solr; we can purge old cases to
dead storage and combine precalculated summaries with live statistics
over only the latest events to keep the numbers fresh without having
responsiveness suffer more and more over time.  We just don't do that
yet.

Once we have a well-designed way to get cases out of DSpace for use
with other tools, we can produce as many streams as we wish, selected
any way that makes sense.  We can cheaply provide custom-tailored data
products to individual contributors and other consumers for their own
analysis.  We just don't do that yet.

There's still an important place for ad-hoc query, but how often would
something less expensive do just as well?  ALL cases are historical;
they're not going to change.  We only need to recalculate when we
change our view of the cases.

-- 
Mark H. Wood, Lead System Programmer   [email protected]
Asking whether markets are efficient is like asking whether people are smart.

pgpz5JzmCYH3E.pgp
Description: PGP signature

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct

_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

[Dspace-tech] Thoughts about statistics (was: alternative to solr statistics)

Reply via email to