[
https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056020#comment-17056020
]
Andrey Zagrebin commented on FLINK-16468:
-----------------------------------------
Thanks for reporting this [~longtimer]
Could you attach the full logs?
Could you enable debug logs for org.apache.flink.runtime.blob.BlobClient to see
all underlying reasons for retrying?
Have you changed option "blob.fetch.retries"?
The problem may also be that the socket is not properly closed after some other
failure.
cc [~NicoK]
> BlobClient rapid retrieval retries on failure opens too many sockets
> --------------------------------------------------------------------
>
> Key: FLINK-16468
> URL: https://issues.apache.org/jira/browse/FLINK-16468
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.9.2
> Environment: Linux ubuntu servers running, patch current latest
> Ubuntu patch current release java 8 JRE
> Reporter: Jason Kania
> Priority: Major
>
> In situations where the BlobClient retrieval fails as in the following log,
> rapid retries will exhaust the open sockets. All the retries happen within a
> few milliseconds.
> {{2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient -
> Failed to fetch BLOB
> cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7
> from aaa-1/10.0.1.1:45145 and store it under
> /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-00000004
> Retrying...}}
> The above is output repeatedly until the following error occurs:
> {{java.io.IOException: Could not connect to BlobServer at address
> aaa-1/10.0.1.1:45145}}
> {{ at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:100)}}
> {{ at
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143)}}
> {{ at
> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)}}
> {{ at
> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202)}}
> {{ at
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)}}
> {{ at
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915)}}
> {{ at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595)}}
> {{ at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)}}
> {{ at java.lang.Thread.run(Thread.java:748)}}
> {{Caused by: java.net.SocketException: Too many open files}}
> {{ at java.net.Socket.createImpl(Socket.java:478)}}
> {{ at java.net.Socket.connect(Socket.java:605)}}
> {{ at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:95)}}
> {{ ... 8 more}}
> The retries should have some form of backoff in this situation to avoid
> flooding the logs and exhausting other resources on the server.
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)