[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets

Nico Kruber (Jira) Wed, 11 Mar 2020 01:43:11 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-16468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056755#comment-17056755
 ]


Nico Kruber commented on FLINK-16468:
-------------------------------------

Hi [~longtimer], I verified in the code that the sockets which are opened for 
each retry are closed properly, however, TCP sockets enter in to a TIME-WAIT 
status and wait for a little while longer until they are really cleaned up [1]. 
You could try changing your kernel's settings to enable fast reuse of sockets 
to cope with that [1] or increase the limit on number of open sockets, but I 
agree that having some (exponential) back-off may be a better solution.

Reducing {{blob.fetch.retries}} may also be an option for now. I'm still a bit 
wondering why you exhaust the number of sockets. Are you maybe deploying to a 
larger set of TMs on the same machine or one TM with a lot of slots or a huge 
number of tasks?

[1] https://vincent.bernat.ch/en/blog/2014-tcp-time-wait-state-linux

> BlobClient rapid retrieval retries on failure opens too many sockets
> --------------------------------------------------------------------
>
>                 Key: FLINK-16468
>                 URL: https://issues.apache.org/jira/browse/FLINK-16468
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.2
>         Environment: Linux ubuntu servers running, patch current latest 
> Ubuntu patch current release java 8 JRE
>            Reporter: Jason Kania
>            Priority: Major
>
> In situations where the BlobClient retrieval fails as in the following log, 
> rapid retries will exhaust the open sockets. All the retries happen within a 
> few milliseconds.
> {{2020-03-06 17:19:07,116 ERROR org.apache.flink.runtime.blob.BlobClient - 
> Failed to fetch BLOB 
> cddd17ef76291dd60eee9fd36085647a/p-bcd61652baba25d6863cf17843a2ef64f4c801d5-c1781532477cf65ff1c1e7d72dccabc7
>  from aaa-1/10.0.1.1:45145 and store it under 
> /tmp/blobStore-7328ed37-8bc7-4af7-a56c-474e264157c9/incoming/temp-00000004 
> Retrying...}}
> The above is output repeatedly until the following error occurs:
> {{java.io.IOException: Could not connect to BlobServer at address 
> aaa-1/10.0.1.1:45145}}
> {{ at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:100)}}
> {{ at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:143)}}
> {{ at 
> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)}}
> {{ at 
> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:202)}}
> {{ at 
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)}}
> {{ at 
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:915)}}
> {{ at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:595)}}
> {{ at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)}}
> {{ at java.lang.Thread.run(Thread.java:748)}}
> {{Caused by: java.net.SocketException: Too many open files}}
> {{ at java.net.Socket.createImpl(Socket.java:478)}}
> {{ at java.net.Socket.connect(Socket.java:605)}}
> {{ at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:95)}}
> {{ ... 8 more}}
>  The retries should have some form of backoff in this situation to avoid 
> flooding the logs and exhausting other resources on the server.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-16468) BlobClient rapid retrieval retries on failure opens too many sockets

Reply via email to