On Jan 14, 2014 6:14 AM, "Sean Dague" <[email protected]> wrote: > > I'm doing some fundamental refactors on the ER bot to help us try to > figure out why we are often not tagging bugs that we should be, and have > found that we're no longer really indexing in real time (which may be a > huge part of this). > > Basically we've got a more or less hard timeout of 13 minutes (it's up > to 20 attemps with a 40s wait between for random historical reasons) > from gerrit fail reporting to having the console log index in ES. (We > give it another 13 minutes after that to gather all the rest of the job > appropriate logs). > > Because of the way we process events, timing out on one fail often means > the next one actually might work, because you'll get 13 minutes from the > time ER looked at your change, not since your change was posted (we're > single threaded in this part of the loop). > > What I'm seeing right now is that starting up the bot locally it will > always timeout waiting for results of the first failure that it gets, > then if you get lucky, it might classify the 2nd fail. > > Given that, we really need to be tracking and alerting on ES delays some > how, otherwise we're going to loose a lot of the value on this. > > -Sean > > -- > Sean Dague > http://dague.net > > > _______________________________________________ > OpenStack-Infra mailing list > [email protected] > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra >
There are a couple things we can do about this. First, we should reenable the logstash 05-08 workers to double the worker count. Second, we should enable the new geard graphite statistics so that we can see queue length trends. I can work on this when I get back from Perth, but don't let that stop anyone from attacking it first. Clark
_______________________________________________ OpenStack-Infra mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-infra
