Re: IMO, we should make a 0.19.4 because of HBASE-1457

Ryan Rawson Sun, 31 May 2009 01:50:26 -0700

I have substantially fixed the Patch, I indicated all the bugs I fixed.
They are all very difficult and huge bugs, ones which will make a cluster
inoperable.

I think we could stand to redo the entire region assignment - I ran into a
few bugs whereby we want to assign ROOT/META to the 'best' region but it
doesn't check in because it's trapped trying to talk to the down ROOT/META
server (!).  Instead we now assign ROOT/META to the first server to check
in, thus speeding the recovery anyways.

This all is sort of predicated on a non-master push strategy, im not sure
about the complexity of having the master push assignments.

We could also probably have a dedicated thread to process regionserver
shutdowns, I ran into a few issues (fixed with a priority queue and reduced
timeouts) where we couldn't recover META because i had kill -9ed the META
server while a todo was in process.  It would just hang waiting for timeouts
and META to come back, in the mean time the ProcessServerShutdown which
would recover the META was waiting.

There were lots of weird race conditions when the cluster churn starts going
up while ROOT/META is unassigned/down/unavailable. I think I nailed a bunch
of them.

I hope this cleanly applies to 0.19.3, but it might take only a little bit
of hacking.  It's ZK independent I think...

On Sat, May 30, 2009 at 10:26 PM, Kirill Shabunov <[email protected]> wrote:

> +1
>
> --Kirill
>
>
> stack wrote:
>
>> What do people think?  0.19.x is being run in production and HBASE-1457
>> seems to nail issues we've been having recovering when regionservers
>> hosting
>> -ROOT- and/or .META. go down.  There's some little issues that need
>> looking
>> into but should be a patch later this w/e .
>>
>> St.Ack
>> P.S I'll also fix the javadoc warnings Billy Pearson identified in 0.19
>> branch.
>>
>>

Re: IMO, we should make a 0.19.4 because of HBASE-1457

Reply via email to