Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

Sean Dague Sun, 05 Jan 2014 16:23:51 -0800

On 01/05/2014 05:49 PM, James E. Blair wrote:

Sean Dague <s...@dague.net> writes:

I think the main user-visible aspect of this decision is the delay
before unprocessed bugs are made visible.  If a bug starts affecting a
number of jobs, it might be nice to see what bug numbers people are
using for rechecks without waiting for the next cron run.


So my experience is that most rechecks happen > 1 hr after a patch
fails. And the people that are sitting on patches for bugs that have
never been seen before find their way to IRC.

The current state of the world is not all roses and unicorns. The
recheck daemon has died, and not been noticed that it was dead for
*weeks*. So a guarantee that we are only 1 hr delayed would actually
be on average better than the delays we've seen over the last six
months of following the event stream.


I wasn't suggesting that we keep the recheck daemon, I was suggesting
moving the real-time observation of rechecks into the elastic-recheck
daemon which will remain an important component of this system for the
foreseeable future.  It is fairly reliable and if it does die, we will
desperately want get it running again and fix the underlying problem
because it is so helpful.

That's a possible place to put it. The daemon is a bit of a mess at themoment, so I was hoping to not refactor it until the end of the month aspart of the cleaning up to handle the additional jobs.

I also think that caching should probably actually happen in gerritlib
itself. There is a concern that too many things are hitting gerrit,
and the result is that everyone is implementing their own client side
caching to try to be nice. (like the pickles in Russell's review stats
programs). This seems like the wrong place to do be doing it.


That's not a bad idea, however it doesn't really address the fact that
you're looking for events -- you need to run a very large bulk query to
find all of the reviews over a certain amount of time.  You could reduce
this by caching results and then only querying reviews that are newer
than the last update.  But even so, you'll always have to query for that
window.  That's not as bad as querying for the same two weeks of data
every X minutes, but since there's already a daemon watching all of the
events anyway in real time, you already have the information if you just
don't discard it.

I don't really want to trust us not failing, because we do. So we'regoing to need replay ability anyway.

But, part of the reason for this email was to sort these sorts of
issues out, so let me know if you think the caching issue is an
architectural blocker.

Because if we're generally agreed on the architecture forward and are
just reviewing for correctness, the code can move fast, and we can
actually have ER 1.0 by the end of the month. Architecture review in
gerrit is where we grind to a halt.


It looks like the bulk queries take about 4 full minutes of Gerrit CPU
time to fetch data from the last two weeks (and the last two weeks have
been quiet; I'd expect the next two weeks to take longer).  I don't
think it's going to kill us, but I think there are some really easy ways
to make this way more efficient, which isn't just about being nice to
Gerrit, but is also about being responsive for users.

Interesting, I thought this was more like 1 minute. 4 definitely gets abit wonkier.

My first preference is still to use the real-time data that the e-r
daemon collects already and feed it to the dashboard.

If you feel like the inter-process communication needed for that will
slow you down too much, then my second preference would be to introduce
local caching of the results so that you can query for
"-age:<query-interval>" instead of the full two weeks every time.  (And
if it's generalized enough, sure let's add that to gerritlib.)

Yeh, the biggest complexity is the result merge. I was finding that-age:4h still ended up return nearly 20% of the entire dataset, andwasn't as much quicker as you'd expect.

But the new data and the old data are overlapping a lot, because you canonly query by time on the review, not on the comments. And those areleaves in funny ways.

I think the right way to do that would be build on top of pandas dataseries merge functionality. All good things, just new building blocks wedon't have yet.

I really think we at least ought to do one of those.  Running the same
bulk query repeatedly is, in this case, so inefficient that I think this
little bit of optimization is not premature.

Sure, I wonder how the various other review stats tools are handlingthis case. Putting Russell and Ilya (Stackalytics) into the mix. Becauseit seems like we should have a common solution here for all the toolshitting gerrit on cron for largely the same info.


        -Sean

--
Sean Dague
Samsung Research America
s...@dague.net / sean.da...@samsung.com
http://dague.net

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [elastic-recheck] Thoughts on next steps

Reply via email to