[ 
https://issues.apache.org/jira/browse/HBASE-12430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14197614#comment-14197614
 ] 

Nick Dimiduk commented on HBASE-12430:
--------------------------------------

Linking to HBASE-6738 as there's some good stuff in that ticket reasoning 
through this part of the recovery process.

> Contention in lease recovery can delay log splitting unnecessarily
> ------------------------------------------------------------------
>
>                 Key: HBASE-12430
>                 URL: https://issues.apache.org/jira/browse/HBASE-12430
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, wal
>    Affects Versions: 0.98.4
>            Reporter: Nick Dimiduk
>
> I'm not deeply familiar with this area so please bear with me.
> In a run of IntegrationTestMTTR with CM, I'm seeing a case where RS recovery 
> is in progress. Splitting of one of the WAL files is started by a RS and some 
> tmp files are written to HDFS. CM kills the RS. Now other RS's try to 
> complete the same work but fail to write their temp files into this same 
> location because each of them have no lease on the output file. Log lines 
> look like
> {noformat}
> 2014-11-03 12:57:14,093 INFO  [RS_LOG_REPLAY_OPS-ip-172-31-4-166:60020-1] 
> wal.HLogSplitter: Processed 99 edits across 12 regions; log 
> file=hdfs://ip-172-31-4-163.ec2.internal:8020/apps/hbase/data/WALs/ip-172-31-4-162.ec2.internal,60020,1415017856808-splitting/ip-172-31-4-162.ec2.internal%2C60020%2C1415017856808.1415018131158
>  is corrupted = false progress failed = true
> 2014-11-03 12:57:14,093 WARN  [RS_LOG_REPLAY_OPS-ip-172-31-4-166:60020-1] 
> regionserver.SplitLogWorker: log splitting of 
> WALs/ip-172-31-4-162.ec2.internal,60020,1415017856808-splitting/ip-172-31-4-162.ec2.internal%2C60020%2C1415017856808.1415018131158
>  failed, returning error
> org.apache.hadoop.io.MultipleIOException: 11 exceptions 
> [org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
>  No lease on 
> /apps/hbase/data/data/default/IntegrationTestIngestWithTags/0c55ce7c53f996cd97f55385eee222c2/recovered.edits/0000000000000030557.temp
>  (inode 28346): File does not exist. [Lease.  Holder: 
> DFSClient_hb_rs_ip-172-31-4-166.ec2.internal,60020,1415019284535_-996811059_38,
>  pendingcreates: 49]
> {noformat}
> Splitting does eventually complete but it takes almost 15 minutes.
> I don't have a fix in mind. I've thought we should be recovering edits into a 
> worker-specific directory and then do a(n atomic) rename to the "official" 
> split destination, but this change cannot be executed across a rolling 
> restart. I've also considered managing the recovery more explicitly, but I 
> think the current behavior of multiple RS's competing for the work is to 
> facilitate speculative execution of splitting. Other ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to