[ 
https://issues.apache.org/jira/browse/HADOOP-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14504148#comment-14504148
 ] 

Colin Patrick McCabe commented on HADOOP-11802:
-----------------------------------------------

Hi Eric,

Good catch.  I think the issue here is that there is a lot of buffering in the 
domain socket.  So it's difficult to get the DataNode to fail when doing its 
write on the socket.  In my experience, the write will succeed even when the 
other end has already shut down the socket.  This buffering can be set by 
configuring SO_RCVBUF, but even the smallest value still buffers enough that 
the unit test will pass under every condition.  This buffering is not a problem 
since in the event of a communication failure, the client will close the 
socket, triggering the DataNode to free the resources.  However, it does make 
unit testing by injecting faults on the client side more difficult to do.

The solution to this problem is to inject the failure directly on the DataNode 
side.  The latest patch does this.  I have confirmed that it fails without the 
fix applied.

> DomainSocketWatcher thread terminates sometimes after there is an I/O error 
> during requestShortCircuitShm
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-11802
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11802
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Eric Payne
>            Assignee: Colin Patrick McCabe
>         Attachments: HADOOP-11802.001.patch, HADOOP-11802.002.patch
>
>
> In {{DataXceiver#requestShortCircuitShm}}, we attempt to recover from some 
> errors by closing the {{DomainSocket}}.  However, this violates the invariant 
> that the domain socket should never be closed when it is being managed by the 
> {{DomainSocketWatcher}}.  Instead, we should call {{shutdown}} on the 
> {{DomainSocket}}.  When this bug hits, it terminates the 
> {{DomainSocketWatcher}} thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to