+1 to what Mark Wood says.

An additional (parallel) thought -- when I was at U of Illinois, we ran 
into similar scalability issues with one of the older statistics 
"add-ons" we were using (the one initially built by U of Rochester that 
stored stats in the DSpace Database).  The way we got around it was the 
following:

We made a distinct decision to aggregate our data & actively purge older 
"event data". This resulted in an *immediate* increase in scalability. 
To better explain...

Essentially this older U of Rochester stats engine worked similar to the 
new Solr Statistics engine, except that it used the DB instead of Solr. 
So, it tracked each statistical "event", including IP address, what the 
event was, etc. Over time the stats queries became rather expensive as 
the tables grew and grew. The tables were also full of IP address info 
that we really didn't need to keep around forever, and also information 
about old web spiders that we really didn't care about.  (As you can 
tell, this is all very parallel to the current Solr Statistics issues.)

So, as I said, we aggregated things. We decided to only keep IP 
addresses/full statistical events for a period of *one month*.  After 
that, all non-spider hits were aggregated/totaled into a "monthly 
totals" table (we threw out anything that was a web spider -- as that 
data was not useful and just made tables larger & queries more complex).

Although I don't think we went this far at U of Illinois, you could do a 
secondary aggregation and then aggregate/total stats again at a *yearly* 
level.

The idea here is that you make conscious decisions around what 
information is important and aggregate it. Stuff that is not important 
to keep forever (e.g. exact IP addresses for all hits, information from 
known-spiders) can just be discarded during the aggregation process. 
The aggregation simplifies larger queries (especially ones for 
yearly/monthly info, as you no longer need to perform complex 
calculations -- it's just a simple lookup)

If we brought this same sort of idea forward into Solr, I think you'd be 
less likely to encounter such performance issues. We'd only keep around 
full event details for a limited period of time (a month / 6 months), 
after which we'd discard information which was not necessary to generate 
the reports & aggregate everything else.

Just an idea -- I've never tried this before with the Solr Statistics 
engine. But, a Solr savvy person could likely figure out a way to 
implement this for the benefit of all of us.

- Tim

On 10/18/2011 7:52 AM, Mark H. Wood wrote:
> This points out a problem that I think we (and many other contemporary
> projects) have all over the place:  our application is expected to grow
> steadily  and without limit, yet we assume over and over again that
> the problem is small and bounded.
>
> There is no way around it:  if your repository is large and busy,
> sooner or later you will be disappointed by the performance of ad-hoc
> queries no matter how many resources you throw at them.
>
> One answer to this is to depend less on ad-hoc queries.  Do you have
> some "usual questions" to be answered over and over?  Do you really
> need up-to-the-second answers?  Would it be good enough to run
> periodic reports and accumulate them?  Some other machine with SPSS or
> R or whatever can grind cases all night, if need be, and leave your
> monthly abstract waiting in your inbox the next day.  (I want to find
> the time to extend DSpace to facilitate this.)  If the periodic
> abstractions are saved in raw form before rendering, they become cheap
> inputs to longer-range reports.  There are *far* more efficient
> methods than those presently provided for extracting information from
> vast quantities of data.
>
> Once periodic statistical products are available, they can be simply
> fetched over and over again and slotted into DSpace pages to provide
> tolerably up-to-date views of activity quickly and cheaply.  We just
> don't do that yet.
>
> Once periodic statistical products are available, we don't have to
> keep twenty years of event data in Solr; we can purge old cases to
> dead storage and combine precalculated summaries with live statistics
> over only the latest events to keep the numbers fresh without having
> responsiveness suffer more and more over time.  We just don't do that
> yet.
>
> Once we have a well-designed way to get cases out of DSpace for use
> with other tools, we can produce as many streams as we wish, selected
> any way that makes sense.  We can cheaply provide custom-tailored data
> products to individual contributors and other consumers for their own
> analysis.  We just don't do that yet.
>
> There's still an important place for ad-hoc query, but how often would
> something less expensive do just as well?  ALL cases are historical;
> they're not going to change.  We only need to recalculate when we
> change our view of the cases.
>
>
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct
>
>
>
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to