[jira] [Resolved] (HBASE-8276) Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the distributed log manager"

Lars Hofhansl (JIRA) Fri, 05 Apr 2013 20:25:20 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-8276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Lars Hofhansl resolved HBASE-8276.
----------------------------------

    Resolution: Fixed
    
> Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the 
> distributed log manager"
> -----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8276
>                 URL: https://issues.apache.org/jira/browse/HBASE-8276
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>             Fix For: 0.94.7
>
>         Attachments: hbase-8276.patch, hbase-8276-v1.patch
>
>
> In recent tests, we found situations that when some data nodes are down and 
> file operations are slow depending on underlying hdfs timeout(normally 30 
> secs and socket connection timeout maybe around 1 min). While split log task 
> heart beat time out is only 25 secs, a split log task will be preempted by 
> SplitLogManager and assign to someone else after the 25 secs. On a small 
> cluster, you'll see the same task is keeping bounced back & force for a 
> while. I pasted a snippet of related logs below. You can search "preempted 
> from" to see a task is preempted.
> {code}
> 2013-04-01 21:22:08,599 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog: 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159,
>  length=127639653
> 2013-04-01 21:22:08,599 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recovering file 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159
> 2013-04-01 21:22:09,603 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Finished lease recover attempt for 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159
> 2013-04-01 21:22:09,629 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old 
> edits file. It could be the result of a previous failed split attempt. 
> Deleting 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/73387f8d327a45bacf069bd631d70b3b/recovered.edits/0000000000003703447.temp,
>  length=0
> 2013-04-01 21:22:09,629 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old 
> edits file. It could be the result of a previous failed split attempt. 
> Deleting 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/b749cbceaaf037c97f70cc2a6f48f2b8/recovered.edits/0000000000003703446.temp,
>  length=0
> 2013-04-01 21:22:09,630 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old 
> edits file. It could be the result of a previous failed split attempt. 
> Deleting 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/c26b9d4a042d42c1194a8c2f389d33c8/recovered.edits/0000000000003703448.temp,
>  length=0
> 2013-04-01 21:22:09,666 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old 
> edits file. It could be the result of a previous failed split attempt. 
> Deleting 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/adabdb40ccd52140f09f953ff41fd829/recovered.edits/0000000000003703451.temp,
>  length=0
> 2013-04-01 21:22:09,722 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old 
> edits file. It could be the result of a previous failed split attempt. 
> Deleting 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/19f463fe74f4365e7df3e5fdb13aecad/recovered.edits/0000000000003703468.temp,
>  length=0
> 2013-04-01 21:22:09,734 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old 
> edits file. It could be the result of a previous failed split attempt. 
> Deleting 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/b3e759a3fc9c4e83064961cc3cd4a911/recovered.edits/0000000000003703459.temp,
>  length=0
> 2013-04-01 21:22:09,770 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old 
> edits file. It could be the result of a previous failed split attempt. 
> Deleting 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/6f078553be50897a986734ae043a5889/recovered.edits/0000000000003703454.temp,
>  length=0
> 2013-04-01 21:22:34,985 INFO 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: task 
> /hbase/splitlog/hdfs%3A%2F%2Fip-10-137-16-140.ec2.internal%3A8020%2Fapps%2Fhbase%2Fdata%2F.logs%2Fip-10-137-20-188.ec2.internal%2C60020%2C1364849530779-splitting%2Fip-10-137-20-188.ec2.internal%252C60020%252C1364849530779.1364865506159
>  preempted from ip-10-151-29-196.ec2.internal,60020,1364849530671, current 
> task state and owner=unassigned 
> ip-10-137-16-140.ec2.internal,60000,1364849528428
> {code}
> The exact same issue is fixed by hbase-6738 in trunk so here comes the 
> backport. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HBASE-8276) Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the distributed log manager"

Reply via email to