[ 
https://issues.apache.org/jira/browse/HDFS-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-7915:
---------------------------------------
       Resolution: Fixed
    Fix Version/s: 2.7.0
           Status: Resolved  (was: Patch Available)

committed. thanks, guys.

I will file a follow-up to look into if we can do more logging.  Note that in 
the specific case where we caught this bug (writeArray failing), we actually 
got as much logging as possible from the DataNode.  Everything we needed was 
logged there, including the failed domain socket I/O stack traces.  Similarly, 
I can't think of any DFSClient logs we needed and didn't get.  We got the 
domain socket I/O stack traces there was well.  What we don't know is why the 
write failed, but we logged as much information as the kernel gave us (it 
returned EAGAIN, which means timeout).

In general socket reads and writes can fail, and HDFS needs to be able to 
handle that.  The cause of the timeout in the case we saw is outside the scope 
of this JIRA.

> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell 
> the DFSClient about it because of a network error
> -----------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-7915
>                 URL: https://issues.apache.org/jira/browse/HDFS-7915
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Colin Patrick McCabe
>            Assignee: Colin Patrick McCabe
>             Fix For: 2.7.0
>
>         Attachments: HDFS-7915.001.patch, HDFS-7915.002.patch, 
> HDFS-7915.004.patch, HDFS-7915.005.patch, HDFS-7915.006.patch
>
>
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell 
> the DFSClient about it because of a network error.  In 
> {{DataXceiver#requestShortCircuitFds}}, the DataNode can succeed at the first 
> part (mark the slot as used) and fail at the second part (tell the DFSClient 
> what it did). The "try" block for unregistering the slot only covers a 
> failure in the first part, not the second part. In this way, a divergence can 
> form between the views of which slots are allocated on DFSClient and on 
> server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to