[ 
https://issues.apache.org/jira/browse/HADOOP-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16540322#comment-16540322
 ] 

Yongjun Zhang commented on HADOOP-15538:
----------------------------------------

Thanks a lot [~mi...@cloudera.com]. Good inputs!

The java lock of interest here is a java object (stateLock in 
SocketChannelImpl), and it's used in synchronized block, example below. Assume 
the unknown thread is an external thread in the C++ world that reached java via 
JNI (should be in our case since it's impalad).

The lock is acquired/released looks like the following:
{code:java}
    private void ensureWriteOpen() throws ClosedChannelException {
        synchronized (stateLock) {
            if (!isOpen())
                throw new ClosedChannelException();
            if (!isOutputOpen)
                throw new ClosedChannelException();
            if (!isConnected())
                throw new NotYetConnectedException();
        }
    }
{code}
so the speculated scenario is: an external thread unknown to java acquired the 
lock upon entering the synchronized block, then the thread disappeared 
abnormally without the lock being released.

In java synchronized block semantics, even if the execution of this thread is 
to exit, with or without exception being thrown inside the block, it would 
release the lock before getting out of the block;

However, it doesn't release the lock in our case. This means the execution of 
the thread is essentially terminated inside the synchronized block, without 
handling any error, without even going out of the synchronized block, or the 
thread is "frozen" inside the synchronized block. Memory stomp might be a 
reason, but I wonder how.

Discussions in
 [https://bugs.openjdk.java.net/browse/JDK-8007476]

stated
{quote}The VM should never encounter a monitor with an owner thread that 
doesn't exist. Seems to me this may be a symptom rather than the actual problem.

Looking at the NFC.c code it seems to me that the code is fragile because it 
has sequences of the form:

clazz = (*env)->FindClass(env, "cvm/jit/share/TestException");
 mid = (*env)->GetMethodID(env, clazz, "<init>", "()V");
 throwObj = (jthrowable) (*env)->NewObject(env, clazz, mid);

but there is no checking to see if any exceptions occurred. Not saying that is 
the case here but if for some reason there is a problem who knows what the 
failure mode might be
{quote}
and
{quote}My point is that the reason for the assert condition not holding (the 
owner of a monitor apparently not being in the Threadslist) may not be due to 
any inherent bug in deadlock detection or monitor management but may due to 
some other problem induced by the testcase - eg a memory stomp due to native 
code failing to check for errors after invoking JNI functions.
{quote}

> Possible RPC deadlock in Client
> -------------------------------
>
>                 Key: HADOOP-15538
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15538
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>            Priority: Major
>         Attachments: t1+13min.jstack, t1.jstack
>
>
> We have a jstack collection that spans 13 minutes. One frame per ~1.5 
> minutes. And for each of the frame, I observed the following:
> {code:java}
> Found one Java-level deadlock:
> =============================
> "IPC Parameter Sending Thread #294":
>   waiting to lock monitor 0x00007f68f21f3188 (object 0x0000000621745390, a 
> java.lang.Object),
>   which is held by UNKNOWN_owner_addr=0x00007f68332e2800
> Java stack information for the threads listed above:
> ===================================================
> "IPC Parameter Sending Thread #294":
>         at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:268)
>         - waiting to lock <0x0000000621745390> (a java.lang.Object)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
>         - locked <0x0000000621745380> (a java.lang.Object)
>         at 
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
>         at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
>         at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>         - locked <0x0000000621749850> (a java.io.BufferedOutputStream)
>         at java.io.DataOutputStream.flush(DataOutputStream.java:123)
>         at org.apache.hadoop.ipc.Client$Connection$3.run(Client.java:1072)
>         - locked <0x000000062174b878> (a java.io.DataOutputStream)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Found one Java-level deadlock:
> =============================
> "IPC Client (297602875) connection to x.y.z.p:8020 from impala":
>   waiting to lock monitor 0x00007f68f21f3188 (object 0x0000000621745390, a 
> java.lang.Object),
>   which is held by UNKNOWN_owner_addr=0x00007f68332e2800
> Java stack information for the threads listed above:
> ===================================================
> "IPC Client (297602875) connection to x.y.z.p:8020 from impala":
>         at 
> sun.nio.ch.SocketChannelImpl.readerCleanup(SocketChannelImpl.java:279)
>         - waiting to lock <0x0000000621745390> (a java.lang.Object)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:390)
>         - locked <0x0000000621745370> (a java.lang.Object)
>         at 
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at 
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:553)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>         - locked <0x00000006217476f0> (a java.io.BufferedInputStream)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1006)
> Found 2 deadlocks.
> {code}
> This happens with jdk1.8.0_162 on 2.6.32-696.18.7.el6.x86_64.
> The code appears to match 
> [https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/tree/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java].
> The first thread is blocked at:
> [https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/blob/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java?line=268]
> The second thread is blocked at:
>  
> [https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/blob/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java?line=279]
> There are two issues here:
>  # There seems to be a real deadlock because the stacks remain the same even 
> if the first an last jstack frames captured is 13 minutes apart.
>  # Java deadlock report seems to be problematic, two threads that have 
> deadlock should not be blocked on the same lock, but they appear to be in 
> this case: the same SocketChannelImpl's stateLock.
> I found a relevant jdk jira 
> [https://bugs.openjdk.java.net/browse/JDK-8007476], it explains where two 
> deadlocks are reported and they are really for the same deadlock.
> I don't see a similar report about this issue in jdk jira database, and I'm 
> thinking about filing a jdk jira for that, but would like to throw some 
> discussion here before that.
> Issue#1 is important, because the client is hanging, which indicate a real 
> problem; however, without a correct report before issue#2 is fixed, it's not 
> clear how the deadlock really looks like.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to