[
https://issues.apache.org/jira/browse/HADOOP-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504148#comment-14504148
]
Colin Patrick McCabe commented on HADOOP-11802:
-----------------------------------------------
Hi Eric,
Good catch. I think the issue here is that there is a lot of buffering in the
domain socket. So it's difficult to get the DataNode to fail when doing its
write on the socket. In my experience, the write will succeed even when the
other end has already shut down the socket. This buffering can be set by
configuring SO_RCVBUF, but even the smallest value still buffers enough that
the unit test will pass under every condition. This buffering is not a problem
since in the event of a communication failure, the client will close the
socket, triggering the DataNode to free the resources. However, it does make
unit testing by injecting faults on the client side more difficult to do.
The solution to this problem is to inject the failure directly on the DataNode
side. The latest patch does this. I have confirmed that it fails without the
fix applied.
> DomainSocketWatcher thread terminates sometimes after there is an I/O error
> during requestShortCircuitShm
> ---------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-11802
> URL: https://issues.apache.org/jira/browse/HADOOP-11802
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Eric Payne
> Assignee: Colin Patrick McCabe
> Attachments: HADOOP-11802.001.patch, HADOOP-11802.002.patch
>
>
> In {{DataXceiver#requestShortCircuitShm}}, we attempt to recover from some
> errors by closing the {{DomainSocket}}. However, this violates the invariant
> that the domain socket should never be closed when it is being managed by the
> {{DomainSocketWatcher}}. Instead, we should call {{shutdown}} on the
> {{DomainSocket}}. When this bug hits, it terminates the
> {{DomainSocketWatcher}} thread.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)