[
https://issues.apache.org/jira/browse/HADOOP-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14486155#comment-14486155
]
Colin Patrick McCabe edited comment on HADOOP-11802 at 4/8/15 10:18 PM:
------------------------------------------------------------------------
I thought about this a little bit more, and I wonder whether this finally block
inside requestShortCircuitShm is causing a "double removal":
{code}
public void requestShortCircuitShm(String clientName) throws IOException {
NewShmInfo shmInfo = null;
boolean success = false;
DomainSocket sock = peer.getDomainSocket();
try {
...
} finally {
...
if ((!success) && (peer == null)) {
// If we failed to pass the shared memory segment to the client,
// close the UNIX domain socket now. This will trigger the
// DomainSocketWatcher callback, cleaning up the segment.
IOUtils.cleanup(null, sock);
}
IOUtils.cleanup(null, shmInfo);
}
{code}
Closing the socket will remove that shmID, but so will closing the NewShmInfo
object... let me look into this.
[edit: NewShmInfo#close just closes the shared memory segment, but not the
domain socket. Since DomainSocketWatcher is watching the domain socket rather
than the shm fd, doing both close operations should not be a problem.]
was (Author: cmccabe):
I thought about this a little bit more, and I wonder whether this finally block
inside requestShortCircuitShm is causing a "double removal":
{code}
public void requestShortCircuitShm(String clientName) throws IOException {
NewShmInfo shmInfo = null;
boolean success = false;
DomainSocket sock = peer.getDomainSocket();
try {
...
} finally {
...
if ((!success) && (peer == null)) {
// If we failed to pass the shared memory segment to the client,
// close the UNIX domain socket now. This will trigger the
// DomainSocketWatcher callback, cleaning up the segment.
IOUtils.cleanup(null, sock);
}
IOUtils.cleanup(null, shmInfo);
}
{code}
Closing the socket will remove that shmID, but so will closing the NewShmInfo
object... let me look into this.
> DomainSocketWatcher#watcherThread can encounter IllegalStateException in
> finally block when calling sendCallback
> ----------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-11802
> URL: https://issues.apache.org/jira/browse/HADOOP-11802
> Project: Hadoop Common
> Issue Type: Bug
> Affects Versions: 2.7.0
> Reporter: Eric Payne
> Assignee: Eric Payne
>
> In the main finally block of the {{DomainSocketWatcher#watcherThread}}, the
> call to {{sendCallback}} can encounter an {{IllegalStateException}}, and
> leave some cleanup tasks undone.
> {code}
> } finally {
> lock.lock();
> try {
> kick(); // allow the handler for notificationSockets[0] to read a
> byte
> for (Entry entry : entries.values()) {
> // We do not remove from entries as we iterate, because that can
> // cause a ConcurrentModificationException.
> sendCallback("close", entries, fdSet, entry.getDomainSocket().fd);
> }
> entries.clear();
> fdSet.close();
> } finally {
> lock.unlock();
> }
> }
> {code}
> The exception causes {{watcherThread}} to skip the calls to
> {{entries.clear()}} and {{fdSet.close()}}.
> {code}
> 2015-04-02 11:48:09,941 [DataXceiver for client
> unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operation #1]] INFO
> DataNode.clienttrace: cliID: DFSClient_NONMAPREDUCE_-807148576_1, src:
> 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a, srvID:
> e6b6cdd7-1bf8-415f-a412-32d8493554df, success: false
> 2015-04-02 11:48:09,941 [Thread-14] ERROR unix.DomainSocketWatcher:
> Thread[Thread-14,5,main] terminating on unexpected exception
> java.lang.IllegalStateException: failed to remove
> b845649551b6b1eab5c17f630e42489d
> at
> com.google.common.base.Preconditions.checkState(Preconditions.java:145)
> at
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.removeShm(ShortCircuitRegistry.java:119)
> at
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry$RegisteredShm.handle(ShortCircuitRegistry.java:102)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallback(DomainSocketWatcher.java:402)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher.access$1100(DomainSocketWatcher.java:52)
> at
> org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:522)
> at java.lang.Thread.run(Thread.java:722)
> {code}
> Please note that this is not a duplicate of HADOOP-11333, HADOOP-11604, or
> HADOOP-10404. The cluster installation is running code with all of these
> fixes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)