[ 
https://issues.apache.org/jira/browse/HBASE-3368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12989993#comment-12989993
 ] 

stack commented on HBASE-3368:
------------------------------

There is a problem with this 'fix'.  It leaves a region in RIT and its not 
cleared because this happens:

{code}
2011-02-03 06:42:51,614 WARN org.apache.hadoop.hbase.master.AssignmentManager: 
Received OPENED for region 811f9efb3df65b2173d7ce0c80ac2a99 from server 
sv2borg184,60020,1296715278941 but region was in  the state null and not in 
expected PENDING_OPEN or OPENING states
{code}

Above happens because on receipt of the split message, we offline parent which 
involves:

{code}
  public void regionOffline(final HRegionInfo regionInfo) {
    synchronized(this.regionsInTransition) {
      if (this.regionsInTransition.remove(regionInfo.getEncodedName()) != null) 
{
        this.regionsInTransition.notifyAll();
      }
    }
    // remove the region plan as well just in case.
    clearRegionPlan(regionInfo);
    setOffline(regionInfo);
  }
{code}

.. i.e. we remove the region from RIT on receipt of RIT though its in OPENING 
or OPENED state.


> Split message can come in before region opened message; results in 'Region 
> has been PENDING_CLOSE for too long' cycle
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3368
>                 URL: https://issues.apache.org/jira/browse/HBASE-3368
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>            Assignee: stack
>            Priority: Blocker
>             Fix For: 0.90.0
>
>
> Another good one.  Look at these excerpts from master log:
> {code}
> 2010-12-16 00:49:45,749 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Received REGION_SPLIT: 
> TestTable,0078922610,1292373363753.490b382bae33642d12cd717b5785698b.: 
> Daughters; 
> TestTable,0078922610,1292460584999.c8b95dfc9a671083bafdaa0341279777., 
> TestTable,0078933586,  
> 1292460584999.7cc636c9a7274eec4e784df2efebbca3. from 
> XXX185,60020,1292460570976
> ....
> 2010-12-16 00:49:46,132 DEBUG 
> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Opened region 
> TestTable,0078922610,1292373363753.490b382bae33642d12cd717b5785698b. on 
> XXX185,60020,1292460570976
> {code}
> ... so the split will have cleared the parent from in-memory data structures 
> and then the open handler will add them back (though region is offlined, 
> split).
> Then the balancer runs....... only no one is holding the region thats being 
> balanced.
> Over on XXX185 I see the open and then split at these times:
> {code}
> 2010-12-16 00:49:43,740 DEBUG 
> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Opened 
> TestTable,0078922610,1292373363753.490b382bae33642d12cd717b5785698b.
> .....
> 2010-12-16 00:49:45,003 INFO 
> org.apache.hadoop.hbase.regionserver.SplitTransaction: Starting split of 
> region TestTable,0078922610,1292373363753.490b382bae33642d12cd717b5785698b.
> {code}
> So, the fact that it takes the Master a while to get around to the zk watcher 
> processing messes us up.
> Root problem is that we're using two different message buses, zk and then 
> heartbeat.  Intent is to do all over zk and remove hearbeat but looking at 
> what to do for 0.90.0.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to