[jira] Commented: (HBASE-3362) If .META. offline between OPENING and OPENED, then wrong server location in .META. is possible

HBase Review Board (JIRA) Thu, 16 Dec 2010 10:17:23 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12972172#action_12972172
 ]

HBase Review Board commented on HBASE-3362:
-------------------------------------------

Message from: [email protected]

bq.  On 2010-12-16 00:14:36, Jonathan Gray wrote:
bq.  > 
trunk/src/main/java/org/apache/hadoop/hbase/regionserver/handler/OpenRegionHandler.java,
 line 173
bq.  > <http://review.cloudera.org/r/1298/diff/1/?file=18309#file18309line173>
bq.  >
bq.  >     This is a busy wait loop?
bq.  >     
bq.  >     Should we add a wait/notify on something passed to the thread and w/ 
a timeout of the period?
bq.  >     
bq.  >     And then we should probably also have some kind of max timeout.  
Even if minutes, there could be weird cluster state where the RS misses META 
availability but someone else might handle it properly, so max timeout might be 
good?
bq.  
bq.  stack wrote:
bq.      I need to add a small sleep.  I'd rather do this than wait/notify.  
t.isAlive should be enough.  Regards max timeout, I should add check if server 
is stopped ... and for max timeout, what you think?  Ten minutes?  Then abort?
bq.  
bq.  Jonathan Gray wrote:
bq.      I was thinking 5 minutes.
bq.      
bq.      How long you going to sleep for?  That seems like an unideal way to do 
this.  I would prefer wait/notify and have timeout on wait be this 1/3 period, 
but small sleep could work.  If really small, we're in busy loop again.  If too 
big, we increase how long we have to wait.  This is on critical path of every 
single region open.
bq.      
bq.      If we go down path of threads doing work, I don't see why we don't 
want to use wait/notify to let the blocked thread know when it's done.

5 minute is not enough.  IIRC, it was > 5 minutes before the region came back 
online.  Let me see.

I want to avoid mother thread depending on daughter thread signaling it to 
stop... seems redundant when I'm watching the daughter with the isAlive already.

The sleep would be short.  1ms or so.  Normally we'd not trip into the sleep.  
The operation will have compeleted before we have chance to sleep.  It'd only 
sleep when no progress can be made.

I'll add wait/notify for you to get this patch cleared past review, np.

- stack

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1298/#review2083
-----------------------------------------------------------

> If .META. offline between OPENING and OPENED, then wrong server location in 
> .META. is possible
> ----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3362
>                 URL: https://issues.apache.org/jira/browse/HBASE-3362
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 0.90.0
>
>
> This is a good one.  It happened to me testing OOME in split logging.
> * Balancer moves region to new location, regionservrer X.
> * New location regionserver X successfully opens the region and then goes to 
> update .META.
> * At this point, the server carrying .META. crashes.
> * Regionserver X is stuck waiting on .META. to come back online.  It takes so 
> long master times out the region-in-transition
> * Master assigns the region elsewhere to regionserver Y
> * It opens successfully on regionserver Y and then it also parks waiting on 
> .META. coming online
> * .META. comes online
> * The two servers X and Y race to update .META.
> I saw case where server X edit went in after server Ys edit which means that 
> lookups in .META. get the wrong server.  HBCK can detect this situation.
> RegionServer X when it wakes up coreeclty notices that its lost control of 
> the region but the damage is done -- where damage is .META. edit.
> Chatting with Jon, he suggested that regionserver X should 'rollback' the 
> .META. edit -- do explicit delete of what it added.  This would work I think 
> but chatting more, I'll make a fix that keeps updating the zookeeper OPENING 
> state while edit goes on in a separate thread.  Our continuous setting of 
> OPENING will make it so region-in-transition does not timeout.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3362) If .META. offline between OPENING and OPENED, then wrong server location in .META. is possible

Reply via email to