+1 to what Mark Wood says. An additional (parallel) thought -- when I was at U of Illinois, we ran into similar scalability issues with one of the older statistics "add-ons" we were using (the one initially built by U of Rochester that stored stats in the DSpace Database). The way we got around it was the following:
We made a distinct decision to aggregate our data & actively purge older "event data". This resulted in an *immediate* increase in scalability. To better explain... Essentially this older U of Rochester stats engine worked similar to the new Solr Statistics engine, except that it used the DB instead of Solr. So, it tracked each statistical "event", including IP address, what the event was, etc. Over time the stats queries became rather expensive as the tables grew and grew. The tables were also full of IP address info that we really didn't need to keep around forever, and also information about old web spiders that we really didn't care about. (As you can tell, this is all very parallel to the current Solr Statistics issues.) So, as I said, we aggregated things. We decided to only keep IP addresses/full statistical events for a period of *one month*. After that, all non-spider hits were aggregated/totaled into a "monthly totals" table (we threw out anything that was a web spider -- as that data was not useful and just made tables larger & queries more complex). Although I don't think we went this far at U of Illinois, you could do a secondary aggregation and then aggregate/total stats again at a *yearly* level. The idea here is that you make conscious decisions around what information is important and aggregate it. Stuff that is not important to keep forever (e.g. exact IP addresses for all hits, information from known-spiders) can just be discarded during the aggregation process. The aggregation simplifies larger queries (especially ones for yearly/monthly info, as you no longer need to perform complex calculations -- it's just a simple lookup) If we brought this same sort of idea forward into Solr, I think you'd be less likely to encounter such performance issues. We'd only keep around full event details for a limited period of time (a month / 6 months), after which we'd discard information which was not necessary to generate the reports & aggregate everything else. Just an idea -- I've never tried this before with the Solr Statistics engine. But, a Solr savvy person could likely figure out a way to implement this for the benefit of all of us. - Tim On 10/18/2011 7:52 AM, Mark H. Wood wrote: > This points out a problem that I think we (and many other contemporary > projects) have all over the place: our application is expected to grow > steadily and without limit, yet we assume over and over again that > the problem is small and bounded. > > There is no way around it: if your repository is large and busy, > sooner or later you will be disappointed by the performance of ad-hoc > queries no matter how many resources you throw at them. > > One answer to this is to depend less on ad-hoc queries. Do you have > some "usual questions" to be answered over and over? Do you really > need up-to-the-second answers? Would it be good enough to run > periodic reports and accumulate them? Some other machine with SPSS or > R or whatever can grind cases all night, if need be, and leave your > monthly abstract waiting in your inbox the next day. (I want to find > the time to extend DSpace to facilitate this.) If the periodic > abstractions are saved in raw form before rendering, they become cheap > inputs to longer-range reports. There are *far* more efficient > methods than those presently provided for extracting information from > vast quantities of data. > > Once periodic statistical products are available, they can be simply > fetched over and over again and slotted into DSpace pages to provide > tolerably up-to-date views of activity quickly and cheaply. We just > don't do that yet. > > Once periodic statistical products are available, we don't have to > keep twenty years of event data in Solr; we can purge old cases to > dead storage and combine precalculated summaries with live statistics > over only the latest events to keep the numbers fresh without having > responsiveness suffer more and more over time. We just don't do that > yet. > > Once we have a well-designed way to get cases out of DSpace for use > with other tools, we can produce as many streams as we wish, selected > any way that makes sense. We can cheaply provide custom-tailored data > products to individual contributors and other consumers for their own > analysis. We just don't do that yet. > > There's still an important place for ad-hoc query, but how often would > something less expensive do just as well? ALL cases are historical; > they're not going to change. We only need to recalculate when we > change our view of the cases. > > > > > ------------------------------------------------------------------------------ > All the data continuously generated in your IT infrastructure contains a > definitive record of customers, application performance, security > threats, fraudulent activity and more. Splunk takes this data and makes > sense of it. Business sense. IT sense. Common sense. > http://p.sf.net/sfu/splunk-d2d-oct > > > > _______________________________________________ > DSpace-tech mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspace-tech ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct _______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

