[
https://issues.apache.org/jira/browse/FLINK-12547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Haibo Sun updated FLINK-12547:
------------------------------
Description:
The jstack is as follows (this jstack is from an old Flink version, but the
master branch has the same problem).
{code:java}
"Source: Custom Source (76/400)" #68 prio=5 os_prio=0 tid=0x00007f8139cd3000
nid=0xe2 runnable [0x00007f80da5fd000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)
at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)
at
org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:164)
at
org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
at
org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
at
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
- locked <0x000000062cf2a188> (a java.lang.Object)
at
org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:968)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:604)
at java.lang.Thread.run(Thread.java:834)
Locked ownable synchronizers:
- None
{code}
The reason is that SO_TIMEOUT is not set in the socket connection of the blob
client. When the network packet loss seriously due to the high CPU load of the
machine, the blob client connection fails to perceive that the server has been
disconnected, which results in blocking in the native method.
was:
The jstack is as follows (this jstack is from an old Flink version, but the
master branch has the same problem).
{code:java}
"Source: Custom Source (76/400)" #68 prio=5 os_prio=0 tid=0x00007f8139cd3000
nid=0xe2 runnable [0x00007f80da5fd000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:170)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)
at org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)
at
org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:164)
at
org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
at
org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
at
org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
- locked <0x000000062cf2a188> (a java.lang.Object)
at
org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:968)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:604)
at java.lang.Thread.run(Thread.java:834)
Locked ownable synchronizers:
- None
{code}
The reason is that SO_TIMEOUT is not set in the socket connection of the blob
client. When the network packet loss seriously due to the high CPU load of the
machine, the blob client connection fails to perceive that the server has been
disconnected, which results in blocking in the native method.
> Deadlock when the task thread downloads jars using BlobClient
> -------------------------------------------------------------
>
> Key: FLINK-12547
> URL: https://issues.apache.org/jira/browse/FLINK-12547
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Operators
> Affects Versions: 1.8.0
> Reporter: Haibo Sun
> Assignee: Haibo Sun
> Priority: Major
>
> The jstack is as follows (this jstack is from an old Flink version, but the
> master branch has the same problem).
> {code:java}
> "Source: Custom Source (76/400)" #68 prio=5 os_prio=0 tid=0x00007f8139cd3000
> nid=0xe2 runnable [0x00007f80da5fd000]
> java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:170)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at
> org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)
> at
> org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)
> at
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:164)
> at
> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
> at
> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
> at
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
> - locked <0x000000062cf2a188> (a java.lang.Object)
> at
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:968)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:604)
> at java.lang.Thread.run(Thread.java:834)
> Locked ownable synchronizers:
> - None
> {code}
>
> The reason is that SO_TIMEOUT is not set in the socket connection of the blob
> client. When the network packet loss seriously due to the high CPU load of
> the machine, the blob client connection fails to perceive that the server has
> been disconnected, which results in blocking in the native method.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)