[
https://issues.apache.org/jira/browse/HBASE-8646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13697403#comment-13697403
]
Enis Soztutar commented on HBASE-8646:
--------------------------------------
This is still related to my favorite issue, recoverLease. What seems to be
happening in the recent failure is that, we call recover lease:
{code}
2013-06-25 03:23:44,733 INFO
[SplitLogWorker-quirinus.apache.org,52386,1372130618556] util.FSHDFSUtils(86):
Recovering lease on dfs file
hdfs://localhost:33258/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654-splitting/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
2013-06-25 03:23:44,734 INFO [IPC Server handler 0 on 33258]
namenode.FSNamesystem(1471): recoverLease: recover lease [Lease. Holder:
DFSClient_hb_rs_quirinus.apache.org,39607,1372130599654_914660597_108,
pendingcreates: 1],
src=/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654-splitting/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
from client
DFSClient_hb_rs_quirinus.apache.org,39607,1372130599654_914660597_108
2013-06-25 03:23:44,734 INFO [IPC Server handler 0 on 33258]
namenode.FSNamesystem(2294): Recovering lease=[Lease. Holder:
DFSClient_hb_rs_quirinus.apache.org,39607,1372130599654_914660597_108,
pendingcreates: 1],
src=/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654-splitting/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
2013-06-25 03:23:44,734 INFO
[SplitLogWorker-quirinus.apache.org,52386,1372130618556] util.FSHDFSUtils(156):
recoverLease=false, attempt=0 on
file=hdfs://localhost:33258/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654-splitting/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
after 1ms
{code}
NN schedules block recovery and does lease recovery:
{code}
2013-06-25 03:23:47,735 INFO [IPC Server handler 4 on 33258]
namenode.FSNamesystem(1471): recoverLease: recover lease [Lease. Holder:
NN_Recovery, pendingcreates: 1],
src=/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654-splitting/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
from client NN_Recovery
2013-06-25 03:23:47,736 WARN [DataStreamer for file
/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
block blk_3628383941313539055_1009] hdfs.DFSClient$DFSOutputStream(3211):
Error Recovery for block blk_3628383941313539055_1009 bad datanode[0]
127.0.0.1:55735
2013-06-25 03:23:47,736 DEBUG
[org.apache.hadoop.hdfs.server.datanode.DataNode$2@70533bef]
datanode.DataNode(2046): block=blk_3628383941313539055_1009, (length=381661),
syncList=[BlockRecord(info=BlockRecoveryInfo(block=blk_3628383941313539055_1009
wasRecoveredOnStartup=false) node=127.0.0.1:55735)], closeFile=true
2013-06-25 03:23:47,736 INFO [IPC Server handler 4 on 33258]
namenode.FSNamesystem(2294): Recovering lease=[Lease. Holder: NN_Recovery,
pendingcreates: 1],
src=/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654-splitting/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
2013-06-25 03:23:47,737 INFO
[SplitLogWorker-quirinus.apache.org,52386,1372130618556] util.FSHDFSUtils(156):
recoverLease=false, attempt=1 on
file=hdfs://localhost:33258/user/jenkins/hbase/.logs/quirinus.apache.org,39607,1372130599654-splitting/quirinus.apache.org%2C39607%2C1372130599654.1372130603749
after 3004ms
{code}
Note that the second call for recoverLease() also returns false after 3 sec.
This maybe because of the DN default heartbeat is 3 sec, and DN only gets the
blockRecovery request at the last second. The third lease recovery waits for 1
more minute, thus the test times out before that.
I'll put up a patch which increases the timeout, and also increases the initial
sleep time for lease recovery to be 4 sec by default.
> Intermittent TestIOFencing#testFencingAroundCompaction failure due to region
> getting stuck in compaction
> --------------------------------------------------------------------------------------------------------
>
> Key: HBASE-8646
> URL: https://issues.apache.org/jira/browse/HBASE-8646
> Project: HBase
> Issue Type: Test
> Reporter: Ted Yu
> Assignee: Enis Soztutar
>
> From
> http://54.241.6.143/job/HBase-TRUNK/org.apache.hbase$hbase-server/348/testReport/junit/org.apache.hadoop.hbase/TestIOFencing/testFencingAroundCompaction/
> (the underlying region is
> tabletest,,1369855507443.c251a1d71e75fed8e490db63419edcf1.):
> {code}
> 2013-05-29 19:25:20,363 DEBUG [pool-1-thread-1] catalog.CatalogTracker(208):
> Stopping catalog tracker
> org.apache.hadoop.hbase.catalog.CatalogTracker@6280d069
> 2013-05-29 19:25:20,366 INFO [pool-1-thread-1] hbase.TestIOFencing(255):
> Waiting for compaction to be about to start
> 2013-05-29 19:25:20,367 DEBUG [pool-1-thread-1]
> hbase.TestIOFencing$CompactionBlockerRegion(107): waiting for compaction to
> block
> 2013-05-29 19:25:20,367 DEBUG [pool-1-thread-1]
> hbase.TestIOFencing$CompactionBlockerRegion(109): compaction block reached
> 2013-05-29 19:25:20,367 INFO [pool-1-thread-1] hbase.TestIOFencing(257):
> Starting a new server
> 2013-05-29 19:25:20,424 DEBUG [pool-1-thread-1]
> client.HConnectionManager(2811):
> regionserver/ip-10-197-74-184.us-west-1.compute.internal/10.197.74.184:0
> HConnection server-to-server retries=100
> ...
> 2013-05-29 19:25:20,861 INFO [pool-1-thread-1] hbase.TestIOFencing(260):
> Killing region server ZK lease
> ...
> 2013-05-29 19:25:21,030 DEBUG
> [RS_CLOSE_REGION-ip-10-197-74-184.us-west-1.compute.internal,37836,1369855503920-0]
> handler.CloseRegionHandler(125): Processing close of
> tabletest,,1369855507443.c251a1d71e75fed8e490db63419edcf1.
> 2013-05-29 19:25:21,031 DEBUG
> [RS_CLOSE_REGION-ip-10-197-74-184.us-west-1.compute.internal,37836,1369855503920-0]
> regionserver.HRegion(928): Closing
> tabletest,,1369855507443.c251a1d71e75fed8e490db63419edcf1.: disabling
> compactions & flushes
> 2013-05-29 19:25:21,031 DEBUG
> [RS_CLOSE_REGION-ip-10-197-74-184.us-west-1.compute.internal,37836,1369855503920-0]
> regionserver.HRegion(1022): waiting for 1 compactions to complete for region
> tabletest,,1369855507443.c251a1d71e75fed8e490db63419edcf1.
> ...
> 2013-05-29 19:25:27,037 INFO [pool-1-thread-1] hbase.TestIOFencing(265):
> Waiting for the new server to pick up the region
> tabletest,,1369855507443.c251a1d71e75fed8e490db63419edcf1.
> {code}
> The test started new region server. However, the region got stuck in:
> {code}
> public void waitForFlushesAndCompactions() {
> synchronized (writestate) {
> while (writestate.compacting > 0 || writestate.flushing) {
> LOG.debug("waiting for " + writestate.compacting + " compactions"
> + (writestate.flushing ? " & cache flush" : "") + " to complete
> for region " + this);
> try {
> writestate.wait();
> {code}
> This led to the timeout:
> {code}
> assertTrue("Timed out waiting for new server to open region",
> System.currentTimeMillis() - startWaitTime < 60000);
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira