[ 
https://issues.apache.org/jira/browse/HADOOP-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509297#comment-13509297
 ] 

Todd Lipcon commented on HADOOP-6762:
-------------------------------------

Also figured I'd write up a short summary of this, since the above discussion 
is long and somewhat hard to follow after 2.5 years and ~15 attachments :)

The issue at hand is what happens when an IPC caller thread (i.e the user 
thread who is making an IPC call, for example to the NN) is interrupted while 
in the process of writing the call to the wire. Java NIO's semantics are that a 
ClosedByInterruptException is thrown on the blocked thread, _and also that the 
underlying channel is closed_. In the context of IPC, this meant that the 
caller thread would receive a ClosedByInterruptException, and that any other 
threads which were sharing the same IPC socket would then receive 
ClosedChannelExceptions, even though those other threads were never meant to be 
interrupted.

The solution is to change the call-sending code such that the actual write() 
call happens on a new thread, created by the {{SEND_PARAMS_EXECUTOR}} in the 
patch. Since the user code has no reference to this thread, it won't ever get 
interrupted, even if someone interrupts the user thread making the call. So, 
the user thread will receive an InterruptedException, but any other threads 
using the same socket continue to run unaffected.
                
> exception while doing RPC I/O closes channel
> --------------------------------------------
>
>                 Key: HADOOP-6762
>                 URL: https://issues.apache.org/jira/browse/HADOOP-6762
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>            Reporter: sam rash
>            Assignee: Todd Lipcon
>            Priority: Critical
>         Attachments: hadoop-6762-10.txt, hadoop-6762-1.txt, 
> hadoop-6762-2.txt, hadoop-6762-3.txt, hadoop-6762-4.txt, hadoop-6762-6.txt, 
> hadoop-6762-7.txt, hadoop-6762-8.txt, hadoop-6762-9.txt, HADOOP-6762.patch, 
> hadoop-6762.txt, hadoop-6762.txt, hadoop-6762.txt
>
>
> If a single process creates two unique fileSystems to the same NN using 
> FileSystem.newInstance(), and one of them issues a close(), the leasechecker 
> thread is interrupted.  This interrupt races with the rpc namenode.renew() 
> and can cause a ClosedByInterruptException.  This closes the underlying 
> channel and the other filesystem, sharing the connection will get errors.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to