[ 
https://issues.apache.org/jira/browse/HADOOP-2647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12564227#action_12564227
 ] 

Raghu Angadi commented on HADOOP-2647:
--------------------------------------

Lohit, Koji and I looked at the logs an this is what looks to have happened.

client :{code}
//...
    try {
      out.write(buf);
    } finally {
      out.close(); //<== stuck here
    }
{code}
Thats a valid way to close. This is what happened :
 - Namenode allocates block b_x at t and datanodes report the block completed 
at t+4sec (128 MB block).
 - Namenode allocates next block b_y two minutes later. No datanodes reported 
this block written.
 - Namenode allocates another block b_z 100 millisecs later.

So on the client side write() failed when client trying to get b_y. client 
proxy throws SocketTimeoutException. Then app does out.close(), which again 
tries to write the block, and this time it is b_z. Client does not know about 
b_y at all.

The client side bug is that DFSClient's OutputStream() does not remember that 
there was an error. It should.


> dfs -put hangs
> --------------
>
>                 Key: HADOOP-2647
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2647
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: dfs
>    Affects Versions: 0.15.1
>         Environment: LINUX
>            Reporter: lohit vijayarenu
>
> We saw a case where dfs -put hung while copying a 2GB file for over 20 hours.
> When we took a look at the stack trace of process the main thread was waiting 
> for confirmation from namenode for complete status.
> only 4 blocks were copied and 5th block was missing when we ran fsck on the 
> partially transfered file. 
> There are 2 problems we saw here.
> 1. DFS client hung without a timeout when there is no response from namenode.
> 2. In IOUtils::copyBytes(InputStream in, OutputStream out, int buffSize, 
> boolean close)
> During copy, if there is an exception, the out.close() is called. Exception 
> is not caught. Which is why we see a close call in the stack trace. 
> When we checked for block IDs in namenode log. For the block which was 
> missing, there was only one response to namenode instead of three.
> This close state coupled with namenode not reporting the error back might 
> have cause the whole process to hang.
> Opening this JIRA to see if we could add checks to the two problems mentioned 
> above.
> <stack trace of main thread>
> "main" prio=10 tid=0x0805a000 nid=0x5b53 waiting on condition 
> [0xf7e64000..0xf7e65288]   java.lang.Thread.State: TIMED_WAITING (sleeping)
>   at java.lang.Thread.sleep(Native Method) 
>   at 
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:1751)  - 
> locked <0x77d593a0> (a org.apache.hadoop.dfs.DFSClient$DFSOutputStream)  at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:49)
>   at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:64)  at 
> org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:55)
>   at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:83)  at 
> org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:140)
>   at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:826)
>   at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:114)
>   at org.apache.hadoop.fs.FsShell.run(FsShell.java:1354)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)  at 
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
>   at org.apache.hadoop.fs.FsShell.main(FsShell.java:1472)
> </stack trace of main thread>

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to