[jira] Commented: (HBASE-3147) Regions stuck in transition after rolling restart, perpetual timeout handling but nothing happens

HBase Review Board (JIRA) Tue, 26 Oct 2010 00:57:50 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-3147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924883#action_12924883
 ]


HBase Review Board commented on HBASE-3147:
-------------------------------------------

Message from: [email protected]

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/1087/
-----------------------------------------------------------

(Updated 2010-10-26 00:55:02.299835)


Review request for hbase and stack.


Changes
-------

This patch is almost there.  Its much better.  Fixed testing for .META. server 
by looking in map of servers to regions; that won't work since its a map of 
user regions only.  Instead get from catalogtracker.

Locally TestRegionRebalancing failed.  I need to look at that.

On cluster, we turned up an unexpected state as server was opening a region it 
was also going down. Need to dig in on that too.

Want to also add tests at least for moved .meta.


Summary
-------

Adds new handling of the timeouts for PENDING_OPEN and PENDING_CLOSE in-memory 
master RIT states.

Adds some new broken RIT states into TestMasterFailover.

Some of these broken states don't seem possible to me but as long as we aren't 
breaking the existing behaviors and tests I think it's okay if we handle odd 
cases that can be mocked.  Who knows what will happen in the real world.

The reason TestMasterFailover didn't/doesn't really test for the issue in 
HBASE-3147 is this new broken condition happens when an RS dies / goes offline 
rather than a master failover concurrent w/ RS failure.


v4 of the patch adds to Jons' fixes.  It adds a shutdown server handler for 
root and another for meta so the processing of servers hosting meta/root do not 
get frozen out.  I've seen this in my testing.


This addresses bug HBASE-3147.
    http://issues.apache.org/jira/browse/HBASE-3147


Diffs (updated)
-----

  trunk/src/main/java/org/apache/hadoop/hbase/catalog/CatalogTracker.java 
1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/catalog/MetaReader.java 1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/executor/EventHandler.java 
1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/executor/ExecutorService.java 
1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java 
1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/master/HMaster.java 1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java 1027351 
  
trunk/src/main/java/org/apache/hadoop/hbase/master/handler/MetaServerShutdownHandler.java
 PRE-CREATION 
  
trunk/src/main/java/org/apache/hadoop/hbase/master/handler/ServerShutdownHandler.java
 1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/MetaNodeTracker.java 
1027351 
  trunk/src/main/java/org/apache/hadoop/hbase/zookeeper/ZKAssign.java 1027351 
  trunk/src/test/java/org/apache/hadoop/hbase/catalog/TestCatalogTracker.java 
1027351 
  trunk/src/test/java/org/apache/hadoop/hbase/master/TestMasterFailover.java 
1027351 

Diff: http://review.cloudera.org/r/1087/diff


Testing
-------

TestMasterFailover passes.


Thanks,

Jonathan




> Regions stuck in transition after rolling restart, perpetual timeout handling 
> but nothing happens
> -------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3147
>                 URL: https://issues.apache.org/jira/browse/HBASE-3147
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>             Fix For: 0.90.0
>
>         Attachments: HBASE-3147-v6.patch
>
>
> The rolling restart script is great for bringing on the weird stuff.  On my 
> little loaded cluster if I run it, it horks the cluster and it doesn't 
> recover.  I notice two issues that need fixing:
> 1. We'll miss noticing that a server was carrying .META. and it never gets 
> assigned -- the shutdown handlers get stuck in perpetual wait on a .META. 
> assign that will never happen.
> 2. Perpetual cycling of the this sequence per region not succesfully assigned:
> {code}
>  2010-10-23 21:37:57,404 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed 
> out:  usertable,user510588360,1287547556587.7f2d92497d2d03917afd574ea2aca55b. 
> state=PENDING_OPEN,                       ts=1287869814294  45154 2010-10-23 
> 21:37:57,404 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region 
> has been PENDING_OPEN or OPENING for too long, reassigning 
> region=usertable,user510588360,1287547556587.                                 
>     7f2d92497d2d03917afd574ea2aca55b.  45155 2010-10-23 21:37:57,404 DEBUG 
> org.apache.hadoop.hbase.zookeeper.ZKAssign: master:60000-0x2bd57d1475046a 
> Attempting to transition node 7f2d92497d2d03917afd574ea2aca55b from 
> RS_ZK_REGION_OPENING to M_ZK_REGION_OFFLINE  45156 2010-10-23 21:37:57,404 
> WARN org.apache.hadoop.hbase.zookeeper.ZKAssign: 
> master:60000-0x2bd57d1475046a Attempt to transition the unassigned node for 
> 7f2d92497d2d03917afd574ea2aca55b from RS_ZK_REGION_OPENING to                 
> M_ZK_REGION_OFFLINE failed, the node existed but was in the state 
> M_ZK_REGION_OFFLINE  45157 2010-10-23 21:37:57,404 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager: Region transitioned OPENING 
> to OFFLINE so skipping timeout, 
> region=usertable,user510588360,1287547556587.7f2d92497d2d03917afd574ea2aca55b.
>   
> ,,,
> {code}
> Timeout period again elapses an then same sequence.
> This is what I've been working on.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-3147) Regions stuck in transition after rolling restart, perpetual timeout handling but nothing happens

Reply via email to