[
https://issues.apache.org/jira/browse/FLINK-12547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated FLINK-12547:
-----------------------------------
Labels: pull-request-available (was: )
> Deadlock when the task thread downloads jars using BlobClient
> -------------------------------------------------------------
>
> Key: FLINK-12547
> URL: https://issues.apache.org/jira/browse/FLINK-12547
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Operators
> Affects Versions: 1.8.0
> Reporter: Haibo Sun
> Assignee: Haibo Sun
> Priority: Major
> Labels: pull-request-available
>
> The jstack is as follows (this jstack is from an old Flink version, but the
> master branch has the same problem).
> {code:java}
> "Source: Custom Source (76/400)" #68 prio=5 os_prio=0 tid=0x00007f8139cd3000
> nid=0xe2 runnable [0x00007f80da5fd000]
> java.lang.Thread.State: RUNNABLE
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:170)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at
> org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:152)
> at
> org.apache.flink.runtime.blob.BlobInputStream.read(BlobInputStream.java:140)
> at
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:164)
> at
> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
> at
> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
> at
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
> - locked <0x000000062cf2a188> (a java.lang.Object)
> at
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:968)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:604)
> at java.lang.Thread.run(Thread.java:834)
> Locked ownable synchronizers:
> - None
> {code}
>
> The reason is that SO_TIMEOUT is not set in the socket connection of the blob
> client. When the network packet loss seriously due to the high CPU load of
> the machine, the blob client connection fails to perceive that the server has
> been disconnected, which results in blocking in the native method.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)