[ 
https://issues.apache.org/jira/browse/HBASE-12847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14275585#comment-14275585
 ] 

Rajeshbabu Chintaguntla commented on HBASE-12847:
-------------------------------------------------

The test case hangs when meta and the table region co reside in same region 
server. In the test case I am trying to process server shutdown of non dead 
server so while offlining the server meta also getting offlined which is not 
correct because while processing the SSH meta should move to other server. I 
will change the test case properly and upload patch. 

> TestZKLessSplitOnCluster frequently times out in 0.98 builds
> ------------------------------------------------------------
>
>                 Key: HBASE-12847
>                 URL: https://issues.apache.org/jira/browse/HBASE-12847
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>             Fix For: 0.98.10
>
>         Attachments: test.log.bad.gz, test.log.good.gz
>
>
> Gets hung up in testSSHCleanupDaugtherRegionsOfAbortedSplit waiting on 
> deleteTable
> {noformat}
> "Thread-334" prio=10 tid=0x00007f15382da800 nid=0x40ae in Object.wait() 
> [0x00007
> f1315f5d000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         - waiting on <0x00000007e1b525b8> (a 
> org.apache.hadoop.hbase.ipc.RpcClie
> nt$Call)
>         at org.apache.hadoop.hbase.ipc.RpcClient.call(RpcClient.java:1452)
>         - locked <0x00000007e1b525b8> (a 
> org.apache.hadoop.hbase.ipc.RpcClient$C
> all)
>         at 
> org.apache.hadoop.hbase.ipc.RpcClient.callBlockingMethod(RpcClient.ja
> va:1661)
>         at 
> org.apache.hadoop.hbase.ipc.RpcClient$BlockingRpcChannelImplementatio
> n.callBlockingMethod(RpcClient.java:1719)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService
> $BlockingStub.disableTable(MasterProtos.java:43749)
>         at 
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplemen
> tation$5.disableTable(HConnectionManager.java:1995)
>         at 
> org.apache.hadoop.hbase.client.HBaseAdmin$5.call(HBaseAdmin.java:947)
>         at 
> org.apache.hadoop.hbase.client.HBaseAdmin$5.call(HBaseAdmin.java:942)
>         at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcR
> etryingCaller.java:117)
>         - locked <0x00000007dfffe938> (a 
> org.apache.hadoop.hbase.client.RpcRetry
> ingCaller)
>         at 
> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcR
> etryingCaller.java:93)
>         - locked <0x00000007dfffe938> (a 
> org.apache.hadoop.hbase.client.RpcRetry
> ingCaller)
>         at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.
> java:3398)
>         at 
> org.apache.hadoop.hbase.client.HBaseAdmin.disableTableAsync(HBaseAdmi
> n.java:942)
>         at 
> org.apache.hadoop.hbase.client.HBaseAdmin.disableTable(HBaseAdmin.jav
> a:974)
>         at 
> org.apache.hadoop.hbase.HBaseTestingUtility.deleteTable(HBaseTestingUtility.java:1532)
>         at 
> org.apache.hadoop.hbase.regionserver.TestSplitTransactionOnCluster.testSSHCleanupDaugtherRegionsOfAbortedSplit(TestSplitTransactionOnCluster.java:1172)
> {noformat}
> See attached test.log.good.gz and test.log.bad.gz.
> In test.log.bad at 2015-01-13 08:02:45,947 we acquire a lock on 
> testSSHCleanupDaugtherRegionsOfAbortedSplit to start a disable but do nothing 
> afterward. Nothing happens for one minute. Then looks like the client makes 
> another request but it can't get the table lock. There's no progress until 
> timeout.
> {noformat}
> 2015-01-13 08:02:45,947 INFO  [Thread-334] client.HBaseAdmin$5(945): Started 
> disable of testSSHCleanupDaugtherRegionsOfAbortedSplit
> 2015-01-13 08:02:45,948 INFO  [FifoRpcScheduler.handler1-thread-4] 
> master.HMaster(2213): Client=apurtell//10.40.8.95 disable 
> testSSHCleanupDaugtherRegionsOfAbortedSplit
> 2015-01-13 08:02:45,950 DEBUG [FifoRpcScheduler.handler1-thread-4] 
> lock.ZKInterProcessLockBase(226): Acquired a lock for 
> /hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:367740000000001
> 2015-01-13 08:03:44,585 DEBUG 
> [ip-10-40-8-95.us-west-2.compute.internal,36774,1421136162827-BalancerChore] 
> master.HMaster(1553): Not running balancer because 3 region(s) in transition: 
> {e4f79ac2b4711f7d906a291d94302d6e={e4f79ac2b4711f7d906a291d94302d6e 
> state=SPLITTING, ts=1421136165837, 
> server=ip-10-40-8-95.us-west-2.compute.internal,47769,1421136163047}, 
> ebbcf66ec09960d42a2a49252599bb6d={ebbcf66ec09960d42a2a49252599bb6d 
> state=SPLITTI...
> 2015-01-13 08:03:46,210 INFO  [Thread-334] client.HBaseAdmin$5(945): Started 
> disable of testSSHCleanupDaugtherRegionsOfAbortedSplit
> 2015-01-13 08:03:46,211 INFO  [FifoRpcScheduler.handler1-thread-3] 
> master.HMaster(2213): Client=apurtell//10.40.8.95 disable 
> testSSHCleanupDaugtherRegionsOfAbortedSplit
> 2015-01-13 08:03:46,214 DEBUG [FifoRpcScheduler.handler1-thread-3] 
> master.TableLockManager$ZKTableLockManager$1(242): Table is locked by 
> [tableName=
> default+testSSHCleanupDaugtherRegionsOfAbortedSplit, 
> lockOwner=ip-10-40-8-95.us-west-2.compute.internal,36774,1421136162827, 
> threadId=486, purpose=C_M_DISABLE_TABLE, isShared=false, 
> createTime=1421136165948]
> {noformat}
> If we look at test.log.good where the same sequence of events begin, the 
> table is quickly disabled and deleted:
> {noformat}
> 2015-01-13 08:01:47,923 INFO  [Thread-333] client.HBaseAdmin$5(945): Started 
> disable of testSSHCleanupDaugtherRegionsOfAbortedSplit
> 2015-01-13 08:01:47,924 INFO  [FifoRpcScheduler.handler1-thread-5] 
> master.HMaster(2213): Client=apurtell//10.40.8.95 disable 
> testSSHCleanupDaugtherRegionsOfAbortedSplit
> 2015-01-13 08:01:47,926 DEBUG [FifoRpcScheduler.handler1-thread-5] 
> lock.ZKInterProcessLockBase(226): Acquired a lock for 
> /hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000001
> 2015-01-13 08:01:47,932 INFO  [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> handler.DisableTableHandler(130): Attempting to disable table 
> testSSHCleanupDaugtherRegionsOfAbortedSplit
> 2015-01-13 08:01:47,933 INFO  [Thread-333] 
> zookeeper.RecoverableZooKeeper(121): Process 
> identifier=catalogtracker-on-hconnection-0x60cf9dfa connecting to ZooKeeper 
> ensemble=localhost:59024
> 2015-01-13 08:01:47,933 DEBUG [Thread-333] catalog.CatalogTracker(198): 
> Starting catalog tracker 
> org.apache.hadoop.hbase.catalog.CatalogTracker@610b218c
> 2015-01-13 08:01:47,935 DEBUG [Thread-333-EventThread] 
> zookeeper.ZooKeeperWatcher(313): catalogtracker-on-hconnection-0x60cf9dfa0x0, 
> quorum=localhost:59024, baseZNode=/hbase Received ZooKeeper Event, type=None, 
> state=SyncConnected, path=null
> 2015-01-13 08:01:47,935 DEBUG [Thread-333] zookeeper.ZKUtil(430): 
> catalogtracker-on-hconnection-0x60cf9dfa0x0, quorum=localhost:59024, 
> baseZNode=/hbase Set watcher on existing znode=/hbase/meta-region-server
> 2015-01-13 08:01:47,936 INFO  [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> handler.DisableTableHandler(190): Disabled table, 
> testSSHCleanupDaugtherRegionsOfAbortedSplit, is done=true
> 2015-01-13 08:01:47,936 DEBUG [Thread-333-EventThread] 
> zookeeper.ZooKeeperWatcher(390): 
> catalogtracker-on-hconnection-0x60cf9dfa-0x14ae2500c300017 connected
> 2015-01-13 08:01:47,938 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> lock.ZKInterProcessLockBase(328): Released 
> /hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000001
> 2015-01-13 08:01:47,941 DEBUG [Thread-333] catalog.CatalogTracker(222): 
> Stopping catalog tracker 
> org.apache.hadoop.hbase.catalog.CatalogTracker@610b218c
> 2015-01-13 08:01:47,945 INFO  [Thread-333] client.HBaseAdmin(999): Disabled 
> testSSHCleanupDaugtherRegionsOfAbortedSplit
> 2015-01-13 08:01:47,946 INFO  [FifoRpcScheduler.handler1-thread-2] 
> master.HMaster(2047): Client=apurtell//10.40.8.95 delete 
> testSSHCleanupDaugtherRegionsOfAbortedSplit
> 2015-01-13 08:01:47,948 DEBUG [FifoRpcScheduler.handler1-thread-2] 
> lock.ZKInterProcessLockBase(226): Acquired a lock for 
> /hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000002
> 2015-01-13 08:01:47,953 INFO  [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> handler.TableEventHandler(123): Handling table operation C_M_DELETE_TABLE on 
> table testSSHCleanupDaugtherRegionsOfAbortedSplit
> 2015-01-13 08:01:47,958 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> handler.DeleteTableHandler(93): Deleting regions from META
> 2015-01-13 08:01:47,962 INFO  [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> catalog.MetaEditor(496): Deleted [{ENCODED => 
> bc81e04ede70636e30fd1224df8bc70b, NAME => 
> 'testSSHCleanupDaugtherRegionsOfAbortedSplit,,1421136107549.bc81e04ede70636e30fd1224df8bc70b.',
>  STARTKEY => '', ENDKEY => ''}]
> 2015-01-13 08:01:47,964 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> handler.DeleteTableHandler(104): Archiving region 
> testSSHCleanupDaugtherRegionsOfAbortedSplit,,1421136107549.bc81e04ede70636e30fd1224df8bc70b.
>  from FS
> 2015-01-13 08:01:47,965 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> backup.HFileArchiver(93): ARCHIVING 
> hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b
> 2015-01-13 08:01:47,969 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> backup.HFileArchiver(134): Archiving [class 
> org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, 
> file:hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b/f]
> 2015-01-13 08:01:47,978 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> backup.HFileArchiver(438): Finished archiving from class 
> org.apache.hadoop.hbase.backup.HFileArchiver$FileablePath, 
> file:hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b/f/9b6aaa5d9dd44f3ab113a32c029d50ae,
>  to 
> hdfs://localhost:43493/user/apurtell/hbase/archive/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b/f/9b6aaa5d9dd44f3ab113a32c029d50ae
> 2015-01-13 08:01:47,979 INFO  [IPC Server handler 6 on 43493] 
> blockmanagement.BlockManager(1009): BLOCK* addToInvalidates: 
> blk_1073741839_1015 127.0.0.1:49933 127.0.0.1:55797 127.0.0.1:57090 
> 2015-01-13 08:01:47,979 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> backup.HFileArchiver(453): Deleted all region files in: 
> hdfs://localhost:43493/user/apurtell/hbase/.tmp/data/default/testSSHCleanupDaugtherRegionsOfAbortedSplit/bc81e04ede70636e30fd1224df8bc70b
> 2015-01-13 08:01:47,980 INFO  [IPC Server handler 2 on 43493] 
> blockmanagement.BlockManager(1009): BLOCK* addToInvalidates: 
> blk_1073741838_1014 127.0.0.1:57090 127.0.0.1:49933 127.0.0.1:55797 
> 2015-01-13 08:01:47,980 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> handler.DeleteTableHandler(114): Table 
> 'testSSHCleanupDaugtherRegionsOfAbortedSplit' archived!
> 2015-01-13 08:01:47,980 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> handler.DeleteTableHandler(117): Removing 
> 'testSSHCleanupDaugtherRegionsOfAbortedSplit' descriptor.
> 2015-01-13 08:01:47,981 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> handler.DeleteTableHandler(121): Removing 
> 'testSSHCleanupDaugtherRegionsOfAbortedSplit' from region states.
> 2015-01-13 08:01:47,981 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> handler.DeleteTableHandler(125): Marking 
> 'testSSHCleanupDaugtherRegionsOfAbortedSplit' as deleted.
> 2015-01-13 08:01:47,984 DEBUG [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> lock.ZKInterProcessLockBase(328): Released 
> /hbase/table-lock/testSSHCleanupDaugtherRegionsOfAbortedSplit/write-master:383350000000002
> {noformat}
> The first interesting line missing in test.log.bad is:
> {noformat}
> 2015-01-13 08:01:47,932 INFO  [MASTER_TABLE_OPERATIONS-ip-10-40-8-95:38335-0] 
> handler.DisableTableHandler(130): Attempting to disable table 
> testSSHCleanupDaugtherRegionsOfAbortedSplit
> {noformat}
> Nowhere in test.log.bad does "Attempting to disable table 
> testSSHCleanupDaugtherRegionsOfAbortedSplit" appear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to