[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets

Jason Kania (Jira) Wed, 11 Mar 2020 17:44:14 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057498#comment-17057498
 ]


Jason Kania commented on FLINK-16468:
-------------------------------------

[~NicoK], I will try the blob.fetch.retries settings and see.

As for the deployment, it is only 4 slots in one task manager on a 2 CPU system 
so was not expecting to exhaust the number of sockets either. It was the sheer 
number of retries really quickly that seems to have done it so it may have been 
all the closed but TIME-WAIT status socket connections at the OS level that 
were not yet available for reuse that was causing the issue.

If the issue happens again, I will see if more information is available. 
However, a backoff algorithm does seem to be a good plan.

> BlobClient rapid retrieval retries on failure opens too many sockets
> --------------------------------------------------------------------
>
>                 Key: FLINK-16468
>                 URL: https://issues.apache.org/jira/browse/FLINK-16468
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.2
>         Environment: Linux ubuntu servers running, patch current latest 
> Ubuntu patch current release java 8 JRE
>            Reporter: Jason Kania
>            Priority: Major
>
> In situations where the BlobClient retrieval fails as in the following log, 
> rapid retries will exhaust the open sockets. All the retries happen within a 
> few milliseconds.
> {{2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - 
> Failed to fetch BLOB 
> cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7
>  from aaa-1/10.0.1.1:45145 and store it under 
> /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-00000004 
> Retrying...}}
> The above is output repeatedly until the following error occurs:
> {{java.io.IOException: Could not connect to BlobServer at address 
> aaa-1/10.0.1.1:45145}}
> {{ at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:100)}}
> {{ at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143)}}
> {{ at 
> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)}}
> {{ at 
> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202)}}
> {{ at 
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)}}
> {{ at 
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915)}}
> {{ at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595)}}
> {{ at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)}}
> {{ at java.lang.Thread.run(Thread.java:748)}}
> {{Caused by: java.net.SocketException: Too many open files}}
> {{ at java.net.Socket.createImpl(Socket.java:478)}}
> {{ at java.net.Socket.connect(Socket.java:605)}}
> {{ at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:95)}}
> {{ ... 8 more}}
>  The retries should have some form of backoff in this situation to avoid 
> flooding the logs and exhausting other resources on the server.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets

Reply via email to