[ 
https://issues.apache.org/jira/browse/HADOOP-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16538911#comment-16538911
 ] 

Misha Dmitriev commented on HADOOP-15538:
-----------------------------------------

So, as far as I understand, we have two threads that try to grab the same lock 
(which is an instance of java.lang.Object referened by 
SocketChannelImpl.stateLock). When the JVM's deadlock detection mechanism tries 
to find who currently holds this lock, it cannot find any Java thread 
responsible for this. Such a situation is considered a variation of a deadlock, 
though it's not a classical one with two threads and two locks. Rather, it's a 
single lock, but it can never be grabbed by the waiting threads, because the 
only thread that can unlock it somehow disappeared. Note that the JVM's 
message, as well as the comments in the JVM code, are somewhat cryptic, and it 
took me some head-scratching and guessing before I understood what they try to 
say.

I don't think that some normal Java thread threw an exception and exited, but 
didn't clean up one of the locks that it was holding. At least I've never seen 
such a situation in the past. Probably such a bug would be relatively easy to 
reproduce, and thus would have been fixed long ago. So I think here we have 
something really non-standard in play, and therefore the following exotic 
scenarios are more likely here:
 # The lock is still being held by some thread that the JVM doesn't know about, 
e.g. one started from native code.
 # The thread that was holding the lock exited in some non-standard, 
non-graceful way, perhaps because of a failure in native code. I am not sure 
what happens in such a case, and my theory is that if the thread is terminated 
by the OS and the JVM doesn't have a chance to interfere, all the Java locks 
that such a thread holds won't be unlocked. So we really have an "orphaned" 
lock.
 # Native code generally doing something bad. Note that in the JDK bug 
mentioned above, one of the JDK guys gave the following possible reason for 
running into this condition: "My point is that the reason for the assert 
condition not holding (the owner of a monitor apparently not being in the 
Threadslist) may not be due to any inherent bug in deadlock detection or 
monitor management but may due to some other problem induced by the testcase - 
e.g. a memory stomp due to native code failing to check for errors after 
invoking JNI functions."

The bottom line is that I think we should really sniff for the native code in 
the app that runs this HDFS Client, and then check if that native code is doing 
something unusual.

> Possible RPC deadlock in Client
> -------------------------------
>
>                 Key: HADOOP-15538
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15538
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>            Reporter: Yongjun Zhang
>            Assignee: Yongjun Zhang
>            Priority: Major
>         Attachments: t1+13min.jstack, t1.jstack
>
>
> We have a jstack collection that spans 13 minutes. One frame per ~1.5 
> minutes. And for each of the frame, I observed the following:
> {code:java}
> Found one Java-level deadlock:
> =============================
> "IPC Parameter Sending Thread #294":
>   waiting to lock monitor 0x00007f68f21f3188 (object 0x0000000621745390, a 
> java.lang.Object),
>   which is held by UNKNOWN_owner_addr=0x00007f68332e2800
> Java stack information for the threads listed above:
> ===================================================
> "IPC Parameter Sending Thread #294":
>         at 
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:268)
>         - waiting to lock <0x0000000621745390> (a java.lang.Object)
>         at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
>         - locked <0x0000000621745380> (a java.lang.Object)
>         at 
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
>         at 
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
>         at 
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
>         at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
>         - locked <0x0000000621749850> (a java.io.BufferedOutputStream)
>         at java.io.DataOutputStream.flush(DataOutputStream.java:123)
>         at org.apache.hadoop.ipc.Client$Connection$3.run(Client.java:1072)
>         - locked <0x000000062174b878> (a java.io.DataOutputStream)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Found one Java-level deadlock:
> =============================
> "IPC Client (297602875) connection to x.y.z.p:8020 from impala":
>   waiting to lock monitor 0x00007f68f21f3188 (object 0x0000000621745390, a 
> java.lang.Object),
>   which is held by UNKNOWN_owner_addr=0x00007f68332e2800
> Java stack information for the threads listed above:
> ===================================================
> "IPC Client (297602875) connection to x.y.z.p:8020 from impala":
>         at 
> sun.nio.ch.SocketChannelImpl.readerCleanup(SocketChannelImpl.java:279)
>         - waiting to lock <0x0000000621745390> (a java.lang.Object)
>         at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:390)
>         - locked <0x0000000621745370> (a java.lang.Object)
>         at 
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at 
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:553)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>         - locked <0x00000006217476f0> (a java.io.BufferedInputStream)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1006)
> Found 2 deadlocks.
> {code}
> This happens with jdk1.8.0_162 on 2.6.32-696.18.7.el6.x86_64.
> The code appears to match 
> [https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/tree/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java].
> The first thread is blocked at:
> [https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/blob/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java?line=268]
> The second thread is blocked at:
>  
> [https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/blob/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java?line=279]
> There are two issues here:
>  # There seems to be a real deadlock because the stacks remain the same even 
> if the first an last jstack frames captured is 13 minutes apart.
>  # Java deadlock report seems to be problematic, two threads that have 
> deadlock should not be blocked on the same lock, but they appear to be in 
> this case: the same SocketChannelImpl's stateLock.
> I found a relevant jdk jira 
> [https://bugs.openjdk.java.net/browse/JDK-8007476], it explains where two 
> deadlocks are reported and they are really for the same deadlock.
> I don't see a similar report about this issue in jdk jira database, and I'm 
> thinking about filing a jdk jira for that, but would like to throw some 
> discussion here before that.
> Issue#1 is important, because the client is hanging, which indicate a real 
> problem; however, without a correct report before issue#2 is fixed, it's not 
> clear how the deadlock really looks like.
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to