[jira] [Commented] (HBASE-8646) Intermittent TestIOFencing#testFencingAroundCompaction failure due to region getting stuck in compaction

Enis Soztutar (JIRA) Mon, 01 Jul 2013 18:31:31 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13697403#comment-13697403
 ]


Enis Soztutar commented on HBASE-8646:
--------------------------------------

This is still related to my favorite issue, recoverLease. What seems to be 
happening in the recent failure is that, we call recover lease:
{code}
2013-06-25 03:23:44,733 INFO  
[SplitLogWorker-quirinus.apache.org,52386,1372130618556] util.FSHDFSUtils(86): 
Recovering lease on dfs file 
hdfs://localhost:33258/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654-splitting/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
2013-06-25 03:23:44,734 INFO  [IPC Server handler 0 on 33258] 
namenode.FSNamesystem(1471): recoverLease: recover lease [Lease.  Holder: 
DFSClient_hb_rs_quirinus.apache.org,39607,1372130599654_914660597_108, 
pendingcreates: 1], 
src=/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654-splitting/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
 from client 
DFSClient_hb_rs_quirinus.apache.org,39607,1372130599654_914660597_108
2013-06-25 03:23:44,734 INFO  [IPC Server handler 0 on 33258] 
namenode.FSNamesystem(2294): Recovering lease=[Lease.  Holder: 
DFSClient_hb_rs_quirinus.apache.org,39607,1372130599654_914660597_108, 
pendingcreates: 1], 
src=/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654-splitting/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
2013-06-25 03:23:44,734 INFO  
[SplitLogWorker-quirinus.apache.org,52386,1372130618556] util.FSHDFSUtils(156): 
recoverLease=false, attempt=0 on 
file=hdfs://localhost:33258/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654-splitting/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
 after 1ms
{code}

NN schedules block recovery and does lease recovery:
{code}
2013-06-25 03:23:47,735 INFO  [IPC Server handler 4 on 33258] 
namenode.FSNamesystem(1471): recoverLease: recover lease [Lease.  Holder: 
NN_Recovery, pendingcreates: 1], 
src=/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654-splitting/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
 from client NN_Recovery
2013-06-25 03:23:47,736 WARN  [DataStreamer for file 
/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
 block blk_3628383941313539055_1009] hdfs.DFSClient$DFSOutputStream(3211): 
Error Recovery for block blk_3628383941313539055_1009 bad datanode[0] 
127.0.0.1:55735
2013-06-25 03:23:47,736 DEBUG 
[org.apache.hadoop.hdfs.server.datanode.DataNode$2@70533bef] 
datanode.DataNode(2046): block=blk_3628383941313539055_1009, (length=381661), 
syncList=[BlockRecord(info=BlockRecoveryInfo(block=blk_3628383941313539055_1009 
wasRecoveredOnStartup=false) node=127.0.0.1:55735)], closeFile=true
2013-06-25 03:23:47,736 INFO  [IPC Server handler 4 on 33258] 
namenode.FSNamesystem(2294): Recovering lease=[Lease.  Holder: NN_Recovery, 
pendingcreates: 1], 
src=/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654-splitting/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
2013-06-25 03:23:47,737 INFO  
[SplitLogWorker-quirinus.apache.org,52386,1372130618556] util.FSHDFSUtils(156): 
recoverLease=false, attempt=1 on 
file=hdfs://localhost:33258/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654-splitting/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
 after 3004ms
{code}
Note that the second call for recoverLease() also returns false after 3 sec. 
This maybe because of the DN default heartbeat is 3 sec, and DN only gets the 
blockRecovery request at the last second. The third lease recovery waits for 1 
more minute, thus the test times out before that. 

I'll put up a patch which increases the timeout, and also increases the initial 
sleep time for lease recovery to be 4 sec by default. 

                
> Intermittent TestIOFencing#testFencingAroundCompaction failure due to region 
> getting stuck in compaction
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8646
>                 URL: https://issues.apache.org/jira/browse/HBASE-8646
>             Project: HBase
>          Issue Type: Test
>            Reporter: Ted Yu
>            Assignee: Enis Soztutar
>
> From 
> http://54.241.6.143/job/HBase-TRUNK/org.apache.hbase$hbase-server/348/testReport/junit/org.apache.hadoop.hbase/TestIOFencing/testFencingAroundCompaction/
>  (the underlying region is 
> tabletest,,1369855507443.c251a1d71e75fed8e490db63419edcf1.):
> {code}
> 2013-05-29 19:25:20,363 DEBUG [pool-1-thread-1] catalog.CatalogTracker(208): 
> Stopping catalog tracker 
> org.apache.hadoop.hbase.catalog.CatalogTracker@6280d069
> 2013-05-29 19:25:20,366 INFO  [pool-1-thread-1] hbase.TestIOFencing(255): 
> Waiting for compaction to be about to start
> 2013-05-29 19:25:20,367 DEBUG [pool-1-thread-1] 
> hbase.TestIOFencing$CompactionBlockerRegion(107): waiting for compaction to 
> block
> 2013-05-29 19:25:20,367 DEBUG [pool-1-thread-1] 
> hbase.TestIOFencing$CompactionBlockerRegion(109): compaction block reached
> 2013-05-29 19:25:20,367 INFO  [pool-1-thread-1] hbase.TestIOFencing(257): 
> Starting a new server
> 2013-05-29 19:25:20,424 DEBUG [pool-1-thread-1] 
> client.HConnectionManager(2811): 
> regionserver/ip-10-197-74-184.us-west-1.compute.internal/10.197.74.184:0 
> HConnection server-to-server retries=100
> ...
> 2013-05-29 19:25:20,861 INFO  [pool-1-thread-1] hbase.TestIOFencing(260): 
> Killing region server ZK lease
> ...
> 2013-05-29 19:25:21,030 DEBUG 
> [RS_CLOSE_REGION-ip-10-197-74-184.us-west-1.compute.internal,37836,1369855503920-0]
>  handler.CloseRegionHandler(125): Processing close of 
> tabletest,,1369855507443.c251a1d71e75fed8e490db63419edcf1.
> 2013-05-29 19:25:21,031 DEBUG 
> [RS_CLOSE_REGION-ip-10-197-74-184.us-west-1.compute.internal,37836,1369855503920-0]
>  regionserver.HRegion(928): Closing 
> tabletest,,1369855507443.c251a1d71e75fed8e490db63419edcf1.: disabling 
> compactions & flushes
> 2013-05-29 19:25:21,031 DEBUG 
> [RS_CLOSE_REGION-ip-10-197-74-184.us-west-1.compute.internal,37836,1369855503920-0]
>  regionserver.HRegion(1022): waiting for 1 compactions to complete for region 
> tabletest,,1369855507443.c251a1d71e75fed8e490db63419edcf1.
> ...
> 2013-05-29 19:25:27,037 INFO  [pool-1-thread-1] hbase.TestIOFencing(265): 
> Waiting for the new server to pick up the region 
> tabletest,,1369855507443.c251a1d71e75fed8e490db63419edcf1.
> {code}
> The test started new region server. However, the region got stuck in:
> {code}
>   public void waitForFlushesAndCompactions() {
>     synchronized (writestate) {
>       while (writestate.compacting > 0 || writestate.flushing) {
>         LOG.debug("waiting for " + writestate.compacting + " compactions"
>             + (writestate.flushing ? " & cache flush" : "") + " to complete 
> for region " + this);
>         try {
>           writestate.wait();
> {code}
> This led to the timeout:
> {code}
>         assertTrue("Timed out waiting for new server to open region",
>           System.currentTimeMillis() - startWaitTime < 60000);
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8646) Intermittent TestIOFencing#testFencingAroundCompaction failure due to region getting stuck in compaction

Reply via email to