[jira] [Commented] (HBASE-8276) Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the distributed log manager"

Jeffrey Zhong (JIRA) Fri, 05 Apr 2013 15:43:16 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-8276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13624140#comment-13624140
 ]


Jeffrey Zhong commented on HBASE-8276:
--------------------------------------

[~lhofhansl] For your question:
{quote}
In 0.94 can we get away with just increasing the timeout to 5mins?
{quote}
A good question. I guess not. The reason we could possibly increase the timeout 
to 5mins because the patch adds the logic to immediately resubmit a task when a 
dead server is detected without reaching timeout(while, before the fix, 
SplitLogManager waits for 25 secs timeout no matter if a server is already 
marked as dead in master). Back to your question, if we just bump up the time 
out to 5mins, it will hurt recovery time because log splitting slow down for 
normal dead server scenarios.
                
> Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the 
> distributed log manager"
> -----------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8276
>                 URL: https://issues.apache.org/jira/browse/HBASE-8276
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>         Attachments: hbase-8276.patch, hbase-8276-v1.patch
>
>
> In recent tests, we found situations that when some data nodes are down and 
> file operations are slow depending on underlying hdfs timeout(normally 30 
> secs and socket connection timeout maybe around 1 min). While split log task 
> heart beat time out is only 25 secs, a split log task will be preempted by 
> SplitLogManager and assign to someone else after the 25 secs. On a small 
> cluster, you'll see the same task is keeping bounced back & force for a 
> while. I pasted a snippet of related logs below. You can search "preempted 
> from" to see a task is preempted.
> {code}
> 2013-04-01 21:22:08,599 INFO 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog: 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159,
>  length=127639653
> 2013-04-01 21:22:08,599 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Recovering file 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159
> 2013-04-01 21:22:09,603 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: 
> Finished lease recover attempt for 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159
> 2013-04-01 21:22:09,629 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old 
> edits file. It could be the result of a previous failed split attempt. 
> Deleting 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/73387f8d327a45bacf069bd631d70b3b/recovered.edits/0000000000003703447.temp,
>  length=0
> 2013-04-01 21:22:09,629 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old 
> edits file. It could be the result of a previous failed split attempt. 
> Deleting 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/b749cbceaaf037c97f70cc2a6f48f2b8/recovered.edits/0000000000003703446.temp,
>  length=0
> 2013-04-01 21:22:09,630 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old 
> edits file. It could be the result of a previous failed split attempt. 
> Deleting 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/c26b9d4a042d42c1194a8c2f389d33c8/recovered.edits/0000000000003703448.temp,
>  length=0
> 2013-04-01 21:22:09,666 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old 
> edits file. It could be the result of a previous failed split attempt. 
> Deleting 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/adabdb40ccd52140f09f953ff41fd829/recovered.edits/0000000000003703451.temp,
>  length=0
> 2013-04-01 21:22:09,722 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old 
> edits file. It could be the result of a previous failed split attempt. 
> Deleting 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/19f463fe74f4365e7df3e5fdb13aecad/recovered.edits/0000000000003703468.temp,
>  length=0
> 2013-04-01 21:22:09,734 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old 
> edits file. It could be the result of a previous failed split attempt. 
> Deleting 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/b3e759a3fc9c4e83064961cc3cd4a911/recovered.edits/0000000000003703459.temp,
>  length=0
> 2013-04-01 21:22:09,770 WARN 
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old 
> edits file. It could be the result of a previous failed split attempt. 
> Deleting 
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/6f078553be50897a986734ae043a5889/recovered.edits/0000000000003703454.temp,
>  length=0
> 2013-04-01 21:22:34,985 INFO 
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: task 
> /hbase/splitlog/hdfs%3A%2F%2Fip-10-137-16-140.ec2.internal%3A8020%2Fapps%2Fhbase%2Fdata%2F.logs%2Fip-10-137-20-188.ec2.internal%2C60020%2C1364849530779-splitting%2Fip-10-137-20-188.ec2.internal%252C60020%252C1364849530779.1364865506159
>  preempted from ip-10-151-29-196.ec2.internal,60020,1364849530671, current 
> task state and owner=unassigned 
> ip-10-137-16-140.ec2.internal,60000,1364849528428
> {code}
> The exact same issue is fixed by hbase-6738 in trunk so here comes the 
> backport. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-8276) Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the distributed log manager"

Reply via email to