[
https://issues.apache.org/jira/browse/HBASE-8276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lars Hofhansl resolved HBASE-8276.
----------------------------------
Resolution: Fixed
> Backport hbase-6738 to 0.94 "Too aggressive task resubmission from the
> distributed log manager"
> -----------------------------------------------------------------------------------------------
>
> Key: HBASE-8276
> URL: https://issues.apache.org/jira/browse/HBASE-8276
> Project: HBase
> Issue Type: Bug
> Reporter: Jeffrey Zhong
> Assignee: Jeffrey Zhong
> Fix For: 0.94.7
>
> Attachments: hbase-8276.patch, hbase-8276-v1.patch
>
>
> In recent tests, we found situations that when some data nodes are down and
> file operations are slow depending on underlying hdfs timeout(normally 30
> secs and socket connection timeout maybe around 1 min). While split log task
> heart beat time out is only 25 secs, a split log task will be preempted by
> SplitLogManager and assign to someone else after the 25 secs. On a small
> cluster, you'll see the same task is keeping bounced back & force for a
> while. I pasted a snippet of related logs below. You can search "preempted
> from" to see a task is preempted.
> {code}
> 2013-04-01 21:22:08,599 INFO
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog:
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159,
> length=127639653
> 2013-04-01 21:22:08,599 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:
> Recovering file
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159
> 2013-04-01 21:22:09,603 INFO org.apache.hadoop.hbase.util.FSHDFSUtils:
> Finished lease recover attempt for
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/.logs/ip-10-137-20-188.ec2.internal,60020,1364849530779-splitting/ip-10-137-20-188.ec2.internal%2C60020%2C1364849530779.1364865506159
> 2013-04-01 21:22:09,629 WARN
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old
> edits file. It could be the result of a previous failed split attempt.
> Deleting
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/73387f8d327a45bacf069bd631d70b3b/recovered.edits/0000000000003703447.temp,
> length=0
> 2013-04-01 21:22:09,629 WARN
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old
> edits file. It could be the result of a previous failed split attempt.
> Deleting
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/b749cbceaaf037c97f70cc2a6f48f2b8/recovered.edits/0000000000003703446.temp,
> length=0
> 2013-04-01 21:22:09,630 WARN
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old
> edits file. It could be the result of a previous failed split attempt.
> Deleting
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/c26b9d4a042d42c1194a8c2f389d33c8/recovered.edits/0000000000003703448.temp,
> length=0
> 2013-04-01 21:22:09,666 WARN
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old
> edits file. It could be the result of a previous failed split attempt.
> Deleting
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/adabdb40ccd52140f09f953ff41fd829/recovered.edits/0000000000003703451.temp,
> length=0
> 2013-04-01 21:22:09,722 WARN
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old
> edits file. It could be the result of a previous failed split attempt.
> Deleting
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/19f463fe74f4365e7df3e5fdb13aecad/recovered.edits/0000000000003703468.temp,
> length=0
> 2013-04-01 21:22:09,734 WARN
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old
> edits file. It could be the result of a previous failed split attempt.
> Deleting
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/b3e759a3fc9c4e83064961cc3cd4a911/recovered.edits/0000000000003703459.temp,
> length=0
> 2013-04-01 21:22:09,770 WARN
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Found existing old
> edits file. It could be the result of a previous failed split attempt.
> Deleting
> hdfs://ip-10-137-16-140.ec2.internal:8020/apps/hbase/data/IntegrationTestLoadAndVerify/6f078553be50897a986734ae043a5889/recovered.edits/0000000000003703454.temp,
> length=0
> 2013-04-01 21:22:34,985 INFO
> org.apache.hadoop.hbase.regionserver.SplitLogWorker: task
> /hbase/splitlog/hdfs%3A%2F%2Fip-10-137-16-140.ec2.internal%3A8020%2Fapps%2Fhbase%2Fdata%2F.logs%2Fip-10-137-20-188.ec2.internal%2C60020%2C1364849530779-splitting%2Fip-10-137-20-188.ec2.internal%252C60020%252C1364849530779.1364865506159
> preempted from ip-10-151-29-196.ec2.internal,60020,1364849530671, current
> task state and owner=unassigned
> ip-10-137-16-140.ec2.internal,60000,1364849528428
> {code}
> The exact same issue is fixed by hbase-6738 in trunk so here comes the
> backport.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira