[
https://issues.apache.org/jira/browse/HADOOP-6762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13509297#comment-13509297
]
Todd Lipcon commented on HADOOP-6762:
-------------------------------------
Also figured I'd write up a short summary of this, since the above discussion
is long and somewhat hard to follow after 2.5 years and ~15 attachments :)
The issue at hand is what happens when an IPC caller thread (i.e the user
thread who is making an IPC call, for example to the NN) is interrupted while
in the process of writing the call to the wire. Java NIO's semantics are that a
ClosedByInterruptException is thrown on the blocked thread, _and also that the
underlying channel is closed_. In the context of IPC, this meant that the
caller thread would receive a ClosedByInterruptException, and that any other
threads which were sharing the same IPC socket would then receive
ClosedChannelExceptions, even though those other threads were never meant to be
interrupted.
The solution is to change the call-sending code such that the actual write()
call happens on a new thread, created by the {{SEND_PARAMS_EXECUTOR}} in the
patch. Since the user code has no reference to this thread, it won't ever get
interrupted, even if someone interrupts the user thread making the call. So,
the user thread will receive an InterruptedException, but any other threads
using the same socket continue to run unaffected.
> exception while doing RPC I/O closes channel
> --------------------------------------------
>
> Key: HADOOP-6762
> URL: https://issues.apache.org/jira/browse/HADOOP-6762
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 0.20.2
> Reporter: sam rash
> Assignee: Todd Lipcon
> Priority: Critical
> Attachments: hadoop-6762-10.txt, hadoop-6762-1.txt,
> hadoop-6762-2.txt, hadoop-6762-3.txt, hadoop-6762-4.txt, hadoop-6762-6.txt,
> hadoop-6762-7.txt, hadoop-6762-8.txt, hadoop-6762-9.txt, HADOOP-6762.patch,
> hadoop-6762.txt, hadoop-6762.txt, hadoop-6762.txt
>
>
> If a single process creates two unique fileSystems to the same NN using
> FileSystem.newInstance(), and one of them issues a close(), the leasechecker
> thread is interrupted. This interrupt races with the rpc namenode.renew()
> and can cause a ClosedByInterruptException. This closes the underlying
> channel and the other filesystem, sharing the connection will get errors.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira