[
https://issues.apache.org/jira/browse/HADOOP-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Colin Patrick McCabe updated HADOOP-11802:
------------------------------------------
Description: In {{DataXceiver#requestShortCircuitShm}}, we attempt to
recover from some errors by closing the {{DomainSocket}}. However, this
violates the invariant that the domain socket should never be closed when it is
being managed by the {{DomainSocketWatcher}}. Instead, we should call
{{shutdown}} on the {{DomainSocket}}. When this bug hits, it terminates the
{{DomainSocketWatcher}} thread. (was: In the main finally block of the
{{DomainSocketWatcher#watcherThread}}, the call to {{sendCallback}} can
encounter an {{IllegalStateException}}, and leave some cleanup tasks undone.
{code}
} finally {
lock.lock();
try {
kick(); // allow the handler for notificationSockets[0] to read a byte
for (Entry entry : entries.values()) {
// We do not remove from entries as we iterate, because that can
// cause a ConcurrentModificationException.
sendCallback("close", entries, fdSet, entry.getDomainSocket().fd);
}
entries.clear();
fdSet.close();
} finally {
lock.unlock();
}
}
{code}
The exception causes {{watcherThread}} to skip the calls to {{entries.clear()}}
and {{fdSet.close()}}.
{code}
2015-04-02 11:48:09,941 [DataXceiver for client
unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operation #1]] INFO
DataNode.clienttrace: cliID: DFSClient_NONMAPREDUCE_-807148576_1, src:
127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a, srvID:
e6b6cdd7-1bf8-415f-a412-32d8493554df, success: false
2015-04-02 11:48:09,941 [Thread-14] ERROR unix.DomainSocketWatcher:
Thread[Thread-14,5,main] terminating on unexpected exception
java.lang.IllegalStateException: failed to remove
b845649551b6b1eab5c17f630e42489d
at
com.google.common.base.Preconditions.checkState(Preconditions.java:145)
at
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.removeShm(ShortCircuitRegistry.java:119)
at
org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry$RegisteredShm.handle(ShortCircuitRegistry.java:102)
at
org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallback(DomainSocketWatcher.java:402)
at
org.apache.hadoop.net.unix.DomainSocketWatcher.access$1100(DomainSocketWatcher.java:52)
at
org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:522)
at java.lang.Thread.run(Thread.java:722)
{code}
Please note that this is not a duplicate of HADOOP-11333, HADOOP-11604, or
HADOOP-10404. The cluster installation is running code with all of these fixes.)
> DomainSocketWatcher thread terminates sometimes after there is an I/O error
> during requestShortCircuitShm
> ---------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-11802
> URL: https://issues.apache.org/jira/browse/HADOOP-11802
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Eric Payne
> Assignee: Eric Payne
>
> In {{DataXceiver#requestShortCircuitShm}}, we attempt to recover from some
> errors by closing the {{DomainSocket}}. However, this violates the invariant
> that the domain socket should never be closed when it is being managed by the
> {{DomainSocketWatcher}}. Instead, we should call {{shutdown}} on the
> {{DomainSocket}}. When this bug hits, it terminates the
> {{DomainSocketWatcher}} thread.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)