Hi Jesús,

We've run into SOLR statistics performance problems as well. You've posted
that you have a very large solr index, and unfortunately solr performance
degrade's as the index grows. We don't allow non-administrator's to view
statistics for a collection/community/item on production because it slows
the system down too much. However, when we need to provide a report, we copy
the SOLR index to another computer, such as your workstation, and view the
statistics locally. A local computer with a lot of memory will run solr
fine, however a busy server does not also run SOLR that well.

If you want to be able to present reports on your production system, I'm
thinking the only thing you can throw at the problem is resources. Perhaps
adding an additional server just to host SOLR, similar to how you might have
an additional server just to host mySQL or postgresql. My co-worker and I
were wondering about the idea of switching out the dspace-stats
implementation with a different engine, such as removing solr, and using
something beefier such as ElasticSearch, however we haven't implemented
anything.

As has been mentioned by some others. You might be able to figure out how to
get Google Analytics to track all of the hits to your items, communities,
collections, bitstreams. In such case, you could then query Google Analytics
API for this information.

Finally, something to "anonymize" the solr statistics information would be a
good thing. We currently have IP address for every visitor to every resource
for every single request. Assuming we had a good grip on robots, I think we
could aggregate this to just record the number of hits to a given resource
per hour. After aggregating, and pruning, you might end up with a much
smaller solr database. Instead of tens of millions, perhaps just hundreds of
thousands of records. I think one should consult the COUNTER project before
altering your statistics though.



Peter Dietz



2011/10/17 Richard Rodgers <rrodg...@mit.edu>

> Hi Jesús:
>
> A lot of statistics work has been done for DSpace over time, but each
> project focuses on different sets of requirements:
> does the data need to appear in the UI, does it offer real-time
> availability (just to name two of the strengths of the SOLR-based system)?
>
> One example of an alternative is
> https://wiki.duraspace.org/display/DSPACE/StatisticsAddOn, though I don't
> know if this has been
> maintained against versions newer than DSpace 1.6.2
>
> We run an entirely off-line, monthly reporting system using a database
> designed to accommodate a set of internal administrative requirements  -
> where statistics are delivered as a spreadsheet - , but that might
> not fulfill your requirements.
>
> The tech list archives and the wiki are a good place to start, but you
> could also post to the list what your use case(s) are, and see if any
> existing
> work better meets your needs.
>
> Hope this helps,
>
> Richard R
>
>
> On Oct 17, 2011, at 6:00 AM, Jesús Martín García wrote:
>
> Hi!
>
> I've been wondering if there is some kind of alternative to solr
> statistics, due to the high load of ram to our system (514 millions of
> records) which it's not easy to scale and it's very very slow. So...Has
> someone done some work on an alternative?
>
> Thanks in advance,
>
> Regards,
>
> Jesús
>
> --
> .......................................................................
>       __
>     /   /       Jesús Martín García
> C E / S / C A   Tècnic de Projectes
>   /__ /         Centre de Serveis Científics i Acadèmics de Catalunya
>
> Gran Capità, 2-4 (Edifici Nexus) · 08034 Barcelona
> T. 93 551 6213 · F. 93 205 6979 · jmar...@cesca.cat
> .......................................................................
>
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
>
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct
> _______________________________________________
> DSpace-tech mailing list
> DSpace-tech@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspace-tech
>
>
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to