[ 
https://issues.apache.org/jira/browse/HBASE-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13554171#comment-13554171
 ] 

Ted Yu edited comment on HBASE-7551 at 1/15/13 7:47 PM:
--------------------------------------------------------

In trunk, TestSplitTransactionOnCluster, without patch, once hung in calling 
admin.disableTable(tableName) at line 771.

Toward the end of test output, I saw:
{code}
013-01-15 10:51:12,747 WARN  
[org.apache.hadoop.hdfs.server.datanode.DataBlockScanner@4efde2c1] 
util.NativeCodeLoader(52): Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable
2013-01-15 10:51:32,346 FATAL [10.10.8.197,64388,1358275830786-BalancerChore] 
hbase.Chore(79): 10.10.8.197,64388,1358275830786-BalancerChoreerror
java.lang.NoClassDefFoundError: 
org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer$1
  at 
org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.getRegionLoadCost(StochasticLoadBalancer.java:673)
  at 
org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.computeRegionLoadCost(StochasticLoadBalancer.java:647)
  at 
org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.computeCost(StochasticLoadBalancer.java:427)
  at 
org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:192)
  at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1309)
  at 
org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48)
{code}
                
      was (Author: [email protected]):
    TestSplitTransactionOnCluster once hung in calling 
admin.disableTable(tableName) at line 771.

Toward the end of test output, I saw:
{code}
013-01-15 10:51:12,747 WARN  
[org.apache.hadoop.hdfs.server.datanode.DataBlockScanner@4efde2c1] 
util.NativeCodeLoader(52): Unable to load native-hadoop library for your 
platform... using builtin-java classes where applicable
2013-01-15 10:51:32,346 FATAL [10.10.8.197,64388,1358275830786-BalancerChore] 
hbase.Chore(79): 10.10.8.197,64388,1358275830786-BalancerChoreerror
java.lang.NoClassDefFoundError: 
org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer$1
  at 
org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.getRegionLoadCost(StochasticLoadBalancer.java:673)
  at 
org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.computeRegionLoadCost(StochasticLoadBalancer.java:647)
  at 
org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.computeCost(StochasticLoadBalancer.java:427)
  at 
org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer.balanceCluster(StochasticLoadBalancer.java:192)
  at org.apache.hadoop.hbase.master.HMaster.balance(HMaster.java:1309)
  at 
org.apache.hadoop.hbase.master.balancer.BalancerChore.chore(BalancerChore.java:48)
{code}
                  
> nodeChildrenChange event may happen after the transition to 
> RS_ZK_REGION_SPLITTING in SplitTransaction causing the SPLIT event to be 
> missed in the master side.
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-7551
>                 URL: https://issues.apache.org/jira/browse/HBASE-7551
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.94.4
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: ramkrishna.s.vasudevan
>            Priority: Critical
>             Fix For: 0.96.0, 0.94.5
>
>         Attachments: 7551-0.94-test.txt, 7551-0.94-v1.txt, 7551-trunk.txt, 
> 7551-trunk-v2.txt, testSplitTransactionOnCluster-output.txt
>
>
> This came from HBASE-7468.
> I got the issue. I am able to reproduce this
> See the logs
> {code}
> 2013-01-14 14:37:21,760 INFO  [main] regionserver.SplitTransaction(216): 
> Starting split of region 
> testShouldClearRITWhenNodeFoundInSplittingState,,1358154439514.a9e57d09c58b3ef3b949d602232fb2c2.
> 2013-01-14 14:37:21,760 DEBUG [main] regionserver.SplitTransaction(871): 
> regionserver:61665-0x13c384e4e4f0002 Creating ephemeral node for 
> a9e57d09c58b3ef3b949d602232fb2c2 in SPLITTING state
> 2013-01-14 14:37:21,844 DEBUG [main] zookeeper.ZKAssign(757): 
> regionserver:61665-0x13c384e4e4f0002 Attempting to transition node 
> a9e57d09c58b3ef3b949d602232fb2c2 from RS_ZK_REGION_SPLITTING to 
> RS_ZK_REGION_SPLITTING
> 2013-01-14 14:37:21,849 DEBUG [Thread-873-EventThread] 
> zookeeper.ZooKeeperWatcher(277): master:62334-0x13c384e4e4f001b Received 
> ZooKeeper Event, type=NodeChildrenChanged, state=SyncConnected, 
> path=/hbase/unassigned
> 2013-01-14 14:37:21,853 DEBUG [main] zookeeper.ZKUtil(1565): 
> regionserver:61665-0x13c384e4e4f0002 Retrieved 140 byte(s) of data from znode 
> /hbase/unassigned/a9e57d09c58b3ef3b949d602232fb2c2; 
> data=region=testShouldClearRITWhenNodeFoundInSplittingState,,1358154439514.a9e57d09c58b3ef3b949d602232fb2c2.,
>  origin=Ram.Home,61665,1358154325430, state=RS_ZK_REGION_SPLITTING
> 2013-01-14 14:37:21,918 DEBUG [main] zookeeper.ZKAssign(820): 
> regionserver:61665-0x13c384e4e4f0002 Successfully transitioned node 
> a9e57d09c58b3ef3b949d602232fb2c2 from RS_ZK_REGION_SPLITTING to 
> RS_ZK_REGION_SPLITTING
> 2013-01-14 14:37:21,919 DEBUG [Thread-873-EventThread] zookeeper.ZKUtil(417): 
> master:62334-0x13c384e4e4f001b Set watcher on existing znode 
> /hbase/unassigned/a9e57d09c58b3ef3b949d602232fb2c2
> {code}
> Here we can observe that the SPLITTING node was first created. Then we 
> transit it to SPLITTING to SPLITTING so that AM can have the nodeDataChange 
> event. But for the nodeDataChange event to happen first nodeChildrenChange 
> event should happen so that the master can set a watcher on the node.
> Now when this hang happens, we can see that after the transition happens only 
> then the watcher is set by nodeChildrenChange event and so the SPLITTING to 
> SPLITTING event itself is missed or skipped.
> Ideally the nodeChildrenChange event iterates thro the list of new znodes on 
> the /hbase/assignment nodes. And then creates a watcher on that. One reason 
> could be there are more than one znode and so the watch setting operation 
> takes time. The order of execution is different when we try running from 
> eclipse and when we run mvn tests. 
> My conclusion is that the testcase actually reveals the problem but the same 
> can happen in any case where the SPLITTING event can get missed out. May be 
> some of the SPLIT related bugs that were raised is due to this? Need to 
> analyse.
> Any suggestions welcome. We should ensure that the transition from SPLITTING 
> to SPLITTING should happen only after the master has set the watch on the 
> znode and we should be sure of that.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to