[ 
https://issues.apache.org/jira/browse/HADOOP-11802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14493182#comment-14493182
 ] 

Eric Payne commented on HADOOP-11802:
-------------------------------------

Thanks again, [~cmccabe], for your comments and taking time on this issue.

One thing to note is that just prior to these problems, a 195-second GC was 
taking place on the DN.

I added a catch of {{Throwable}} in the main thread of the 
{{DomainSocketWatcher}} and reproduced the problem. AFAICT, the following 
represents what is happening:

- Request for short circuit read is received
- {{DataXceiver#requestShortCircuitShm}} calls 
{{ShortCircuitRegistry#createNewMemorySegment}}, which creates a shared memory 
segment and associates it with the passed domain socket in the 
{{DomainSocketWatcher}}. Then, in that thread, {{createNewMemorySegment}} waits 
on that socket/shm entry in {{DomainSocketWatcher#add}}.
{code}
  public NewShmInfo createNewMemorySegment(String clientName,
...
    watcher.add(sock, shm);
...
{code}
- It's at this point that things get confusing, and I'm still working on why 
this happens. The wait wakes up, but things are not normal, but it wasn't woken 
up because of an exception, either. You can tell that no exception was thrown 
inside {{createNewMemorySegment}} to wake it up because the following code goes 
on to call {{sendShmSuccessRespons}}, which is where the next bad thing happens:
{code}
    public void requestShortCircuitShm(String clientName) throws IOException {
...
      try {
        shmInfo = datanode.shortCircuitRegistry.
            createNewMemorySegment(clientName, sock);
        // After calling #{ShortCircuitRegistry#createNewMemorySegment}, the
        // socket is managed by the DomainSocketWatcher, not the DataXceiver.
        releaseSocket();
      } catch (UnsupportedOperationException e) {
        sendShmErrorResponse(ERROR_UNSUPPORTED, 
            "This datanode has not been configured to support " +
            "short-circuit shared memory segments.");
        return;
      } catch (IOException e) {
        sendShmErrorResponse(ERROR,
            "Failed to create shared file descriptor: " + e.getMessage());
        return;
      }
      sendShmSuccessResponse(sock, shmInfo);
...
{code}
- At this point, the call to {{sendShmSuccessResponse}} gets an exception:
{noformat}
2015-04-04 13:12:30,973 [DataXceiver for client 
unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operation #1]]
      INFO DataNode.clienttrace: cliID: 
DFSClient_attempt_1427231924849_569269_m_002116_0_-161414780_1,
      src: 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: 
n/a,
      srvID: a2d3bac0-e98b-4b73-a5a1-82c7eb557a7a, success: false
2015-04-04 13:12:30,984 [DataXceiver for client 
unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operation #1]]
      ERROR datanode.DataNode: host.domain.com:1004:DataXceiver error processing
      REQUEST_SHORT_CIRCUIT_SHM operation  src: 
unix:/home/gs/var/run/hdfs/dn_socket dst: <local>
     
java.net.SocketException: write(2) error: Broken pipe
        at org.apache.hadoop.net.unix.DomainSocket.writeArray0(Native Method)
        at 
org.apache.hadoop.net.unix.DomainSocket.access$300(DomainSocket.java:45)
        at 
org.apache.hadoop.net.unix.DomainSocket$DomainOutputStream.write(DomainSocket.java:601)
        at 
com.google.protobuf.CodedOutputStream.refreshBuffer(CodedOutputStream.java:833)
        at 
com.google.protobuf.CodedOutputStream.flush(CodedOutputStream.java:843)
        at 
com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:91)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.sendShmSuccessResponse(DataXceiver.java:380)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.requestShortCircuitShm(DataXceiver.java:418)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opRequestShortCircuitShm(Receiver.java:214)
        at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:95)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
        at java.lang.Thread.run(Thread.java:722)
{noformat}
- At this point, it bubbles back up to {{DataXceiver#requestShortCircuitShm}}, 
which cleans up, closing the socket:
{code}
...
      if ((!success) && (peer == null)) {
        // If we failed to pass the shared memory segment to the client,
        // close the UNIX domain socket now.  This will trigger the 
        // DomainSocketWatcher callback, cleaning up the segment.
        IOUtils.cleanup(null, sock);
      }
{code}
- Then, the main {{DomainSocketWatcher}} thread wakes up (after regular timeout 
interval has expired), and tries to call {{sendCallbackAndRemove}}, which 
encounters the following {{IllegalArgumentException}}:
{code}
  final Thread watcherThread = new Thread(new Runnable() {
...
        while (true) {
          lock.lock();
          try {
            for (int fd : fdSet.getAndClearReadableFds()) {
              sendCallbackAndRemove("getAndClearReadableFds", entries, fdSet,
                  fd);
            }
...
{code}
{noformat}
ERROR unix.DomainSocketWatcher: 
org.apache.hadoop.net.unix.DomainSocketWatcher$2@76845081
      terminating on Throwable
java.lang.IllegalArgumentException: DomainSocketWatcher(103231254): file 
descriptor 249 was closed
      while still in the poll(2) loop.
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
        at 
org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallback(DomainSocketWatcher.java:421)
        at 
org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallbackAndRemove(DomainSocketWatcher.java:448)
        at 
org.apache.hadoop.net.unix.DomainSocketWatcher.access$500(DomainSocketWatcher.java:52)
        at 
org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:470)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

I welcome your suggestions and look forward to your feedback.
Thanks,
Eric

> DomainSocketWatcher#watcherThread can encounter IllegalStateException in 
> finally block when calling sendCallback
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-11802
>                 URL: https://issues.apache.org/jira/browse/HADOOP-11802
>             Project: Hadoop Common
>          Issue Type: Bug
>    Affects Versions: 2.7.0
>            Reporter: Eric Payne
>            Assignee: Eric Payne
>
> In the main finally block of the {{DomainSocketWatcher#watcherThread}}, the 
> call to {{sendCallback}} can encounter an {{IllegalStateException}}, and 
> leave some cleanup tasks undone.
> {code}
>       } finally {
>         lock.lock();
>         try {
>           kick(); // allow the handler for notificationSockets[0] to read a 
> byte
>           for (Entry entry : entries.values()) {
>             // We do not remove from entries as we iterate, because that can
>             // cause a ConcurrentModificationException.
>             sendCallback("close", entries, fdSet, entry.getDomainSocket().fd);
>           }
>           entries.clear();
>           fdSet.close();
>         } finally {
>           lock.unlock();
>         }
>       }
> {code}
> The exception causes {{watcherThread}} to skip the calls to 
> {{entries.clear()}} and {{fdSet.close()}}.
> {code}
> 2015-04-02 11:48:09,941 [DataXceiver for client 
> unix:/home/gs/var/run/hdfs/dn_socket [Waiting for operation #1]] INFO 
> DataNode.clienttrace: cliID: DFSClient_NONMAPREDUCE_-807148576_1, src: 
> 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: n/a, srvID: 
> e6b6cdd7-1bf8-415f-a412-32d8493554df, success: false
> 2015-04-02 11:48:09,941 [Thread-14] ERROR unix.DomainSocketWatcher: 
> Thread[Thread-14,5,main] terminating on unexpected exception
> java.lang.IllegalStateException: failed to remove 
> b845649551b6b1eab5c17f630e42489d
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:145)
>         at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry.removeShm(ShortCircuitRegistry.java:119)
>         at 
> org.apache.hadoop.hdfs.server.datanode.ShortCircuitRegistry$RegisteredShm.handle(ShortCircuitRegistry.java:102)
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.sendCallback(DomainSocketWatcher.java:402)
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.access$1100(DomainSocketWatcher.java:52)
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:522)
>         at java.lang.Thread.run(Thread.java:722)
> {code}
> Please note that this is not a duplicate of HADOOP-11333, HADOOP-11604, or 
> HADOOP-10404. The cluster installation is running code with all of these 
> fixes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to