[
https://issues.apache.org/jira/browse/HDFS-2994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13279468#comment-13279468
]
amith commented on HDFS-2994:
-----------------------------
When there is lease recovery is in progress along with the append call on the
same file then I have seen this problem coming.
Currently FSDirectory.replaceNode() is called from 2 methods
FSNameSystem#finalizeINodeFileUnderConstruction()
FSNameSystem#prepareFileForWrite()
from this method we call to change the entry Inode entry in NN metadata (INode
Structure, from InodeFile->InodeFileUnderConstruction ...)
If we observe the change constructor used in this methods
{code}
public LocatedBlock prepareFileForWrite(String src, INode file,
String leaseHolder, String clientMachine, DatanodeDescriptor clientNode,
boolean writeToEditLog)
throws UnresolvedLinkException, IOException {
INodeFile node = (INodeFile) file;
INodeFileUnderConstruction cons = new INodeFileUnderConstruction(
node.getLocalNameBytes(),
node.getReplication(),
node.getModificationTime(),
node.getPreferredBlockSize(),
node.getBlocks(),
node.getPermissionStatus(),
leaseHolder,
clientMachine,
clientNode);
dir.replaceNode(src, node, cons);
leaseManager.addLease(cons.getClientName(), src);
LocatedBlock ret = blockManager.convertLastBlockToUnderConstruction(cons);
if (writeToEditLog) {
getEditLog().logOpenFile(src, cons);
}
return ret;
}
{code}
INodeFileUnderConstruction constructor fails to capture INode.parent attribute
causing the cons to have a null entry instead of parent !!!
Similarly
{code}
private void finalizeINodeFileUnderConstruction(String src,
INodeFileUnderConstruction pendingFile)
throws IOException, UnresolvedLinkException {
assert hasWriteLock();
leaseManager.removeLease(pendingFile.getClientName(), src);
// The file is no longer pending.
// Create permanent INode, update blocks
INodeFile newFile = pendingFile.convertToInodeFile();
dir.replaceNode(src, pendingFile, newFile);
// close file and persist block allocations for this file
dir.closeFile(src, newFile);
checkReplicationFactor(newFile);
}
{code} pendingFile.convertToInodeFile(); also looses the parent attribute
causing null entry in parent's location.
Similarly I have modified the
{code}
boolean removeNode() {
if (parent == null) {
return false;
} else {
parent.removeChild(this);
- parent=null;
return true;
}
}
{code}
since in
{code}
INode myFile = dir.getFileINode(src);
recoverLeaseInternal(myFile, src, holder, clientMachine, false);
{code}
in recoverLeaseInternal myFile loose the parent attribute.
A test as been added to verify the same behaviour, in which I am creating 3
clients to with different
{code}
mapreduce.task.attempt.id
{code}
so that we can have different holder for the clients so lease recovery to get
triggered when accessed by other client.
> If lease is recovered successfully inline with create, create can fail
> ----------------------------------------------------------------------
>
> Key: HDFS-2994
> URL: https://issues.apache.org/jira/browse/HDFS-2994
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 0.24.0
> Reporter: Todd Lipcon
> Assignee: amith
> Attachments: HDFS-2994_1.patch
>
>
> I saw the following logs on my test cluster:
> {code}
> 2012-02-22 14:35:22,887 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: startFile: recover lease
> [Lease. Holder: DFSClient_attempt_1329943893604_0007_m_000376_0_453973131_1,
> pendingcreates: 1], src=/benchmarks/TestDFSIO/io_data/test_io_6 from client
> DFSClient_attempt_1329943893604_0007_m_000376_0_453973131_1
> 2012-02-22 14:35:22,887 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering lease=[Lease.
> Holder: DFSClient_attempt_1329943893604_0007_m_000376_0_453973131_1,
> pendingcreates: 1], src=/benchmarks/TestDFSIO/io_data/test_io_6
> 2012-02-22 14:35:22,888 WARN org.apache.hadoop.hdfs.StateChange: BLOCK*
> internalReleaseLease: All existing blocks are COMPLETE, lease removed, file
> closed.
> 2012-02-22 14:35:22,888 WARN org.apache.hadoop.hdfs.StateChange: DIR*
> FSDirectory.replaceNode: failed to remove
> /benchmarks/TestDFSIO/io_data/test_io_6
> 2012-02-22 14:35:22,888 WARN org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.startFile: FSDirectory.replaceNode: failed to remove
> /benchmarks/TestDFSIO/io_data/test_io_6
> {code}
> It seems like, if {{recoverLeaseInternal}} succeeds in {{startFileInternal}},
> then the INode will be replaced with a new one, meaning the later
> {{replaceNode}} call can fail.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira