[
https://issues.apache.org/jira/browse/HADOOP-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12565274#action_12565274
]
dhruba borthakur commented on HADOOP-2647:
------------------------------------------
Maybe we can do better than this. In the current code, if the client encounters
an RPC error while fetching a new block id from the namenode, it does not
retry. It throws an exception to the application. This becomes especially bad
if the namenode is in the middle of a GC and does not respond in time. The
reason the client throws an exception is because it does not know whether the
namenode successfully allocated a block for this file.
One possible enhancement would be to make the client retry the addBlock RPC if
needed. The client can send the block list that it currently has. The namenode
can match the block list send by the client with what it has in its own
metadata and then send back a new blockid (or a previously allocated blockid
that the client had not yet received because the earlier RPC timedout). This
will make the client more robust!
> dfs -put hangs
> --------------
>
> Key: HADOOP-2647
> URL: https://issues.apache.org/jira/browse/HADOOP-2647
> Project: Hadoop Core
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.15.1
> Environment: LINUX
> Reporter: lohit vijayarenu
> Assignee: Raghu Angadi
> Fix For: 0.15.4
>
> Attachments: HADOOP-2647.patch
>
>
> We saw a case where dfs -put hung while copying a 2GB file for over 20 hours.
> When we took a look at the stack trace of process the main thread was waiting
> for confirmation from namenode for complete status.
> only 4 blocks were copied and 5th block was missing when we ran fsck on the
> partially transfered file.
> There are 2 problems we saw here.
> 1. DFS client hung without a timeout when there is no response from namenode.
> 2. In IOUtils::copyBytes(InputStream in, OutputStream out, int buffSize,
> boolean close)
> During copy, if there is an exception, the out.close() is called. Exception
> is not caught. Which is why we see a close call in the stack trace.
> When we checked for block IDs in namenode log. For the block which was
> missing, there was only one response to namenode instead of three.
> This close state coupled with namenode not reporting the error back might
> have cause the whole process to hang.
> Opening this JIRA to see if we could add checks to the two problems mentioned
> above.
> <stack trace of main thread>
> "main" prio=10 tid=0x0805a000 nid=0x5b53 waiting on condition
> [0xf7e64000..0xf7e65288] java.lang.Thread.State: TIMED_WAITING (sleeping)
> at java.lang.Thread.sleep(Native Method)
> at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1751) -
> locked <0x77d593a0> (a org.apache.hadoop.dfs.DFSClient$DFSOutputStream) at
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
> at
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64) at
> org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:55)
> at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:83) at
> org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:140)
> at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:826)
> at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:114)
> at org.apache.hadoop.fs.FsShell.run(FsShell.java:1354)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
> at org.apache.hadoop.fs.FsShell.main(FsShell.java:1472)
> </stack trace of main thread>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.