[
https://issues.apache.org/jira/browse/HDFS-7915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Colin Patrick McCabe updated HDFS-7915:
---------------------------------------
Resolution: Fixed
Fix Version/s: 2.7.0
Status: Resolved (was: Patch Available)
committed. thanks, guys.
I will file a follow-up to look into if we can do more logging. Note that in
the specific case where we caught this bug (writeArray failing), we actually
got as much logging as possible from the DataNode. Everything we needed was
logged there, including the failed domain socket I/O stack traces. Similarly,
I can't think of any DFSClient logs we needed and didn't get. We got the
domain socket I/O stack traces there was well. What we don't know is why the
write failed, but we logged as much information as the kernel gave us (it
returned EAGAIN, which means timeout).
In general socket reads and writes can fail, and HDFS needs to be able to
handle that. The cause of the timeout in the case we saw is outside the scope
of this JIRA.
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell
> the DFSClient about it because of a network error
> -----------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-7915
> URL: https://issues.apache.org/jira/browse/HDFS-7915
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Colin Patrick McCabe
> Assignee: Colin Patrick McCabe
> Fix For: 2.7.0
>
> Attachments: HDFS-7915.001.patch, HDFS-7915.002.patch,
> HDFS-7915.004.patch, HDFS-7915.005.patch, HDFS-7915.006.patch
>
>
> The DataNode can sometimes allocate a ShortCircuitShm slot and fail to tell
> the DFSClient about it because of a network error. In
> {{DataXceiver#requestShortCircuitFds}}, the DataNode can succeed at the first
> part (mark the slot as used) and fail at the second part (tell the DFSClient
> what it did). The "try" block for unregistering the slot only covers a
> failure in the first part, not the second part. In this way, a divergence can
> form between the views of which slots are allocated on DFSClient and on
> server.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)