Hi Peter,

Thank you for your ideas to this problem.

I don't think the auto-commit can be the problem. There is some data 
just not enough. I looked at one of those problem-periods and looked at 
three stats occurrences which are one after the other in Solr
(e.g. 2014-04-04T08:53:38.027Z,  2014-04-04T08:53:38.112Z, 
2014-04-04T08:56:17.714Z) and compared this data with ElasticSearch 
stats. ES got 4 additional entries between the first and the second and 
67 additional ones between the second and the third entry. There could 
have been maximal one commit between the three Solr entries. (Auto 
commit is set to the default 15 min in our case.)

There were not any Tomcat restarts in the time frame and period of data 
loss lasted for three days that time.

There was not any data loss during the weekend, so I still have not got 
any log files.

Solr is running on one machine only and users can't access it directly - 
STATUS of solr confirms that there are no deletions in statistics core. 
If the index were corrupt I would not expect Solr to sort itself out on 
its own. Solr and ES use the same code to detect robots and there are 
still plenty of them in ES data. The load on the machine was not high 
enough to trigger any nagios errors.

I really have no idea what is happening.

We are working on the upgrade to DSpace 4 but are not there yet.
It is a pretty irritating problem - we partially justifying our 
existence as a service by showing that the community is using us. So 
knowing that there are gaps in the stats data is a problem.

Best regards,
Anja



On 12/04/2014 02:20, Peter Dietz wrote:
> Hi Anja,
>
> One idea I have is that with solr, for performance reasons, we have an
> auto-commit process where UsageEvents don't write/commit/persist into
> SOLR until the commit gets triggered, so they live only in memory until
> triggered to write.
>
> ...so... If these periods had a higher than normal, or perhaps even
> normal occurrence of tomcat restarts, then perhaps pending documents are
> never written, thus lost, upon restart.
>
> Perhaps in the servlet container shutdown process, we could add
> something to have it signal for dspace/solr to write/save/flush/persist
> the documents before shutdown.
>
> Off the top of my head I don't recall how I've written to the elastic
> search API, but I'm assuming I never made these auto-commit / bulk /
> batch submit changes since I never encountered performance issues with
> elastic search. I'm guessing one UsageEvent equals one commit to Elastic
> Search, so no data loss on shutdown.
>
> This is just my guess of what could be happening. I suppose there could
> be other explanations too, such as corrupt solr index, but I would guess
> that would lose a greater amount of data. Another guess would be a
> server migration that didn't sync all data properly... An unguarded solr
> index that a mischievous user did a delete query... It's possibly
> possible that solr and elastic search dspace-stats could have slightly
> different robot rule processing (unlikely), so if your usage baseline
> was entirely robots, then GoogleBot taking a few days off from crawling
> you could cause a valley...
>
> Stats is tricky, part of me wishes I just leveraged Google analytics for
> everything, just to have one less system to manage. However I do like
> the flexibility when you build it yourself.
>
> On Apr 11, 2014 9:54 AM, "Anja Le Blanc" <[email protected]
> <mailto:[email protected]>> wrote:
>
>     Hello All,
>
>     (We are running on DSpace 1.8.2)
>
>     I was looking at our stats data for the last year and a half and I
>     noticed periodical drops in views/downloads which are inconsistent with
>     the overall usage pattern. (I did not filter out bots for that
>     exercise.) Numbers dropped for 1 to 5 days to below 10 and even to 0
>     sometimes (from an average of about 5000 per day). I counted about 8
>     such events since Jan 2013. (There are possibly more which don't stand
>     out as much.) Our DSpace was always running and being monitored during
>     that period.
>
>     In our set-up we record stats in both Solr and ElasticSearch (at least
>     we have done for the last half year). The data for ElasticSearch do not
>     show drops for the days where Solr has data gaps. ElsaticSearch stats
>     recording is triggered by the same DSpace events as Solr is.
>
>     Unfortunately we have not kept log files for the periods with Solr data
>     gaps.
>
>     Has anyone else seen unexpected fluctuations in their stats?
>     Anyone any idea of what could cause it. DSpace and Solr were running at
>     the time since there are some data just not enough.
>
>     To look at the data I use for views
>     
> http://localhost:8080/solr/statistics/select/?q=type+%3A+2+&version=2.2&start=0&rows=0&indent=on&facet=true&facet.range=time&f.time.facet.range.start=2013-01-01T00:00:00Z&f.time.facet.range.gap=%2B1DAY&f.time.facet.range.end=2014-04-11T00:00:00Z
>
>
>     downloads
>     
> http://localhost:8080/solr/statistics/select/?q=type+%3A+0+&version=2.2&start=0&rows=0&indent=on&facet=true&facet.range=time&f.time.facet.range.start=2013-01-01T00:00:00Z&f.time.facet.range.gap=%2B1DAY&f.time.facet.range.end=2014-04-11T00:00:00Z
>
>     Interestingly we can prove that there were more events.
>
>     Any comments welcome :-)
>
>     Best regards,
>     Anja
>
>     
> ------------------------------------------------------------------------------
>     Put Bad Developers to Shame
>     Dominate Development with Jenkins Continuous Integration
>     Continuously Automate Build, Test & Deployment
>     Start a new project now. Try Jenkins in the cloud.
>     http://p.sf.net/sfu/13600_Cloudbees
>     _______________________________________________
>     DSpace-tech mailing list
>     [email protected]
>     <mailto:[email protected]>
>     https://lists.sourceforge.net/lists/listinfo/dspace-tech
>     List Etiquette:
>     https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette
>

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
List Etiquette: https://wiki.duraspace.org/display/DSPACE/Mailing+List+Etiquette

Reply via email to