[
https://issues.apache.org/jira/browse/HADOOP-15538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513183#comment-16513183
]
Yongjun Zhang edited comment on HADOOP-15538 at 6/15/18 12:45 AM:
------------------------------------------------------------------
Thanks a lot [~daryn] and [~billyean].
Please see attached two jstack files, one is at the begining t1.jstack, the
other is 13 minutes later: t1+13min.jstack.
My read is, JDK-8007476 fixed a jdk internal error that crash the jstack dump
itself, by introducing "which is held by UNKNOWN_owner_addr=" to make it dump
reasonable value of the other guy who hold the lock, so it can dump out the
deadlock threads pair instead of crashing. In the example provided there, the
two threads point to each other because they are the two threads involved in
the same deadlock.
I think the java version here has the JDK-8007476 fix, that's why we also have
"which is held by UNKNOWN_owner_addr=" reported; However, the other party of
deadlocks is not reported here.
Problem here:
# Did the deadlock finder miss some threads involved in the deadlock?
# If it did not miss, why the two threads listed are blocked at the same lock
(stateLock)? And these two threads point to the same culprit addr reported as
"which is held by UNKNOWN_owner_addr=".
# Who holds the stateLock?
Haibo's suggestion of having heapdump seems a good direction to dive into.
Unfortunately the problem is intermittent and it takes time to see it again.
But I will look into.
If any of you have more comments and insight to share, I would really
appreciate.
Thanks.
was (Author: yzhangal):
Thanks a lot [~daryn] and [~billyean].
Please see attached two jstack files, one is at the begining t1.jstack, the
other is 13 minutes later: t1+13min.jstack.
My read is, JDK-8007476 fixed a jdk internal error that crash the jstack dump
itself, by introducing "which is held by UNKNOWN_owner_addr=" to make it dump
reasonable value of the other guy who hold the lock, so it can dump out the
deadlock threads pair instead of crashing. In the example provided there, the
two threads point to each other because they are the two threads involved in
the same deadlock.
I think the java version here has the fix, that's why we also have "which is
held by UNKNOWN_owner_addr=" reported; However, the other party of deadlocks
is not reported here.
Problem here:
# Did the deadlock finder miss some threads involved in the deadlock?
# If it did not miss, why the two threads listed are blocked at the same lock
(stateLock)? And these two threads point to the same culprit addr reported as
"which is held by UNKNOWN_owner_addr=".
# Who holds the stateLock?
Haibo's suggestion of having heapdump seems a good direction to dive into.
Unfortunately the problem is intermittent and it takes time to see it again.
But I will look into.
If any of you have more comments and insight to share, I would really
appreciate.
Thanks.
> Possible deadlock in Client
> ---------------------------
>
> Key: HADOOP-15538
> URL: https://issues.apache.org/jira/browse/HADOOP-15538
> Project: Hadoop Common
> Issue Type: Bug
> Components: common
> Reporter: Yongjun Zhang
> Priority: Major
> Attachments: t1+13min.jstack, t1.jstack
>
>
> We have a jstack collection that spans 13 minutes. Once frame per ~1.5
> minutes. And for each of the frame, I observed the following:
> {code}
> Found one Java-level deadlock:
> =============================
> "IPC Parameter Sending Thread #294":
> waiting to lock monitor 0x00007f68f21f3188 (object 0x0000000621745390, a
> java.lang.Object),
> which is held by UNKNOWN_owner_addr=0x00007f68332e2800
> Java stack information for the threads listed above:
> ===================================================
> "IPC Parameter Sending Thread #294":
> at
> sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:268)
> - waiting to lock <0x0000000621745390> (a java.lang.Object)
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
> - locked <0x0000000621745380> (a java.lang.Object)
> at
> org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
> at
> org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
> at
> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
> at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
> - locked <0x0000000621749850> (a java.io.BufferedOutputStream)
> at java.io.DataOutputStream.flush(DataOutputStream.java:123)
> at org.apache.hadoop.ipc.Client$Connection$3.run(Client.java:1072)
> - locked <0x000000062174b878> (a java.io.DataOutputStream)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Found one Java-level deadlock:
> =============================
> "IPC Client (297602875) connection to x.y.z.p:8020 from impala":
> waiting to lock monitor 0x00007f68f21f3188 (object 0x0000000621745390, a
> java.lang.Object),
> which is held by UNKNOWN_owner_addr=0x00007f68332e2800
> Java stack information for the threads listed above:
> ===================================================
> "IPC Client (297602875) connection to x.y.z.p:8020 from impala":
> at
> sun.nio.ch.SocketChannelImpl.readerCleanup(SocketChannelImpl.java:279)
> - waiting to lock <0x0000000621745390> (a java.lang.Object)
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:390)
> - locked <0x0000000621745370> (a java.lang.Object)
> at
> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
> at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
> at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
> at java.io.FilterInputStream.read(FilterInputStream.java:133)
> at java.io.FilterInputStream.read(FilterInputStream.java:133)
> at
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:553)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
> - locked <0x00000006217476f0> (a java.io.BufferedInputStream)
> at java.io.DataInputStream.readInt(DataInputStream.java:387)
> at
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1113)
> at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1006)
> Found 2 deadlocks.
> {code}
> This happens with jdk1.8.0_162 on 2.6.32-696.18.7.el6.x86_64.
> The code appears to match
> https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/tree/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java.
> The first thread is blocked at:
> https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/blob/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java?line=268
> The second thread is blocked at:
> https://insight.io/github.com/AdoptOpenJDK/openjdk-jdk8u/blob/dev/jdk/src/share/classes/sun/nio/ch/SocketChannelImpl.java?line=279
> There are two issues here:
> # There seems to be a real deadlock because the stacks remain the same even
> if the first an last jstack frames captured is 13 minutes apart.
> # Java deadlock report seems to be problematic, two threads that have
> deadlock should not be blocked on the same lock, but they appear to be in
> this case: the same SocketChannelImpl's stateLock.
> I found a relevant jdk jira https://bugs.openjdk.java.net/browse/JDK-8007476,
> it explains where two deadlocks are reported and they are really for the same
> deadlock.
> I don't see a similar report about this issue in jdk jira database, and I'm
> thinking about filing a jdk jira for that, but would like to throw some
> discussion here before that.
> Issue#1 is important, because the client is hanging, which indicate a real
> problem; however, without a correct report before issue#2 is fixed, it's not
> clear how the deadlock really looks like.
> Thanks.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]