[
https://issues.apache.org/jira/browse/HBASE-10000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844663#comment-13844663
]
Ted Yu commented on HBASE-10000:
--------------------------------
hdfs deployment is hadoop 2.2
>From namenode log:
{code}
2013-12-10 19:13:22,629 INFO namenode.FSNamesystem
(FSNamesystem.java:recoverLeaseInternal(2322)) - recoverLease: [Lease. Holder:
DFSClient_hb_rs_hor13n05.gq1.ygridcore.net,60020,1386702460286_-1928541312_28,
pendingcreates: 22],
src=/apps/hbase/data/WALs/hor13n05.gq1.ygridcore.net,60020,1386702460286-splitting/hor13n05.gq1.ygridcore.net%2C60020%2C1386702460286.1386702750923
from client
DFSClient_hb_rs_hor13n05.gq1.ygridcore.net,60020,1386702460286_-1928541312_28
2013-12-10 19:13:22,629 INFO namenode.FSNamesystem
(FSNamesystem.java:internalReleaseLease(3507)) - Recovering [Lease. Holder:
DFSClient_hb_rs_hor13n05.gq1.ygridcore.net,60020,1386702460286_-1928541312_28,
pendingcreates: 22],
src=/apps/hbase/data/WALs/hor13n05.gq1.ygridcore.net,60020,1386702460286-splitting/hor13n05.gq1.ygridcore.net%2C60020%2C1386702460286.1386702750923
2013-12-10 19:13:22,630 INFO BlockStateChange
(BlockInfoUnderConstruction.java:initializeBlockRecovery(291)) - BLOCK*
blk_1073761381_20789{blockUCState=UNDER_RECOVERY, primaryNodeIndex=1,
replicas=[ReplicaUnderConstruction[68.142.246.24:50010|RBW],
ReplicaUnderConstruction[68.142.246.26:50010|RBW],
ReplicaUnderConstruction[68.142.246.23:50010|RBW]]} recovery started,
primary=ReplicaUnderConstruction[68.142.246.26:50010|RBW]
2013-12-10 19:13:22,630 WARN hdfs.StateChange
(FSNamesystem.java:internalReleaseLease(3611)) - DIR*
NameSystem.internalReleaseLease: File
/apps/hbase/data/WALs/hor13n05.gq1.ygridcore.net,60020,1386702460286-splitting/hor13n05.gq1.ygridcore.net%2C60020%2C1386702460286.1386702750923
has not been closed. Lease recovery is in progress. RecoveryId = 20976 for
block blk_1073761381_20789{blockUCState=UNDER_RECOVERY, primaryNodeIndex=1,
replicas=[ReplicaUnderConstruction[68.142.246.24:50010|RBW],
ReplicaUnderConstruction[68.142.246.26:50010|RBW],
ReplicaUnderConstruction[68.142.246.23:50010|RBW]]}
...
2013-12-10 19:13:23,978 INFO namenode.FSNamesystem
(FSNamesystem.java:recoverLeaseInternal(2322)) - recoverLease: [Lease. Holder:
DFSClient_NONMAPREDUCE_-1595419144_1, pendingcreates: 1],
src=/apps/hbase/data/WALs/hor13n05.gq1.ygridcore.net,60020,1386702460286-splitting/hor13n05.gq1.ygridcore.net%2C60020%2C1386702460286.1386702750923
from client DFSClient_NONMAPREDUCE_-1595419144_1
2013-12-10 19:13:23,978 INFO namenode.FSNamesystem
(FSNamesystem.java:internalReleaseLease(3507)) - Recovering [Lease. Holder:
DFSClient_NONMAPREDUCE_-1595419144_1, pendingcreates: 1],
src=/apps/hbase/data/WALs/hor13n05.gq1.ygridcore.net,60020,1386702460286-splitting/hor13n05.gq1.ygridcore.net%2C60020%2C1386702460286.1386702750923
2013-12-10 19:13:23,978 INFO BlockStateChange
(BlockInfoUnderConstruction.java:initializeBlockRecovery(291)) - BLOCK*
blk_1073761381_20789{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0,
replicas=[ReplicaUnderConstruction[68.142.246.24:50010|RBW],
ReplicaUnderConstruction[68.142.246.26:50010|RBW],
ReplicaUnderConstruction[68.142.246.23:50010|RBW]]} recovery started,
primary=ReplicaUnderConstruction[68.142.246.24:50010|RBW]
2013-12-10 19:13:23,978 WARN hdfs.StateChange
(FSNamesystem.java:internalReleaseLease(3611)) - DIR*
NameSystem.internalReleaseLease: File
/apps/hbase/data/WALs/hor13n05.gq1.ygridcore.net,60020,1386702460286-splitting/hor13n05.gq1.ygridcore.net%2C60020%2C1386702460286.1386702750923
has not been closed. Lease recovery is in progress. RecoveryId = 20993 for
block blk_1073761381_20789{blockUCState=UNDER_RECOVERY, primaryNodeIndex=0,
replicas=[ReplicaUnderConstruction[68.142.246.24:50010|RBW],
ReplicaUnderConstruction[68.142.246.26:50010|RBW],
ReplicaUnderConstruction[68.142.246.23:50010|RBW]]}
...
2013-12-10 19:13:24,507 INFO namenode.FSNamesystem
(FSNamesystem.java:commitBlockSynchronization(3792)) -
commitBlockSynchronization(newblock=BP-2115634419-68.142.246.20-1386671426957:blk_1073761381_20789,
file=/apps/hbase/data/WALs/hor13n05.gq1.ygridcore.net,60020,1386702460286-splitting/hor13n05.gq1.ygridcore.net%2C60020%2C1386702460286.1386702750923,
newgenerationstamp=20993, newlength=88018762, newtargets=[68.142.246.24:50010,
68.142.246.26:50010, 68.142.246.23:50010]) successful
{code}
The recovery took ~23seconds.
The benefit of this route is that WAL split selection and lease recovery are
now made parallel.
By the time SplitLogWorker picks up the unclosed log, lease recovery is already
underway. subsequentPause is 61 seconds by default. Within this period,
SplitLogWorker would be able to detect the completion of lease recovery.
In the above case, 23 < 61, Meaning recoverDFSFileLease() would return in the
first iteration.
I plan to do more cluster testing for future patches.
bq. Just hadoop2?
HDP 1.3 hdfs component has DistributedFileSystem#isFileClosed() as well. Test
is being formulated for verification that lack of isFileClosed() wouldn't slow
down lease recovery.
I plan to separate the following code from recoverDFSFileLease() into its own
method:
{code}
try {
isFileClosedMeth = dfs.getClass().getMethod("isFileClosed",
new Class[]{ Path.class });
} catch (NoSuchMethodException nsme) {
LOG.debug("isFileClosed not available");
}
{code}
this way, the test can override the new method and perform verification.
> Initiate lease recovery for outstanding WAL files at the very beginning of
> recovery
> -----------------------------------------------------------------------------------
>
> Key: HBASE-10000
> URL: https://issues.apache.org/jira/browse/HBASE-10000
> Project: HBase
> Issue Type: Improvement
> Reporter: Ted Yu
> Assignee: Ted Yu
> Fix For: 0.98.1
>
> Attachments: 10000-0.96-v5.txt, 10000-0.96-v6.txt,
> 10000-recover-ts-with-pb-2.txt, 10000-recover-ts-with-pb-3.txt,
> 10000-recover-ts-with-pb-4.txt, 10000-recover-ts-with-pb-5.txt,
> 10000-recover-ts-with-pb-6.txt, 10000-v4.txt, 10000-v5.txt, 10000-v6.txt
>
>
> At the beginning of recovery, master can send lease recovery requests
> concurrently for outstanding WAL files using a thread pool.
> Each split worker would first check whether the WAL file it processes is
> closed.
> Thanks to Nicolas Liochon and Jeffery discussion with whom gave rise to this
> idea.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)