[ 
https://issues.apache.org/jira/browse/FLINK-10213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652566#comment-16652566
 ] 

ASF GitHub Bot commented on FLINK-10213:
----------------------------------------

joey opened a new pull request #6862: [FLINK-10213] Task managers cache a 
negative DNS lookup of the blob server indefinitely
URL: https://github.com/apache/flink/pull/6862
 
 
   ## What is the purpose of the change
   
   This pull request fixes a bug where a negative DNS resolution for the blob 
server address could be cached forever.
   
   ## Brief change log
   
   * Updated BlobClient to create the socket using the hostname and port rather 
than the InetSocketAddress.
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   
   * Added a test case to BlobClientTest for unresolved InetSocketAddresses. 
The test fails without the implemented fix and passes with implemented fix.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / **no**)
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / **no**)
     - The serializers: (yes / **no** / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / **no** 
/ don't know)
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes / **no**)
     - If yes, how is the feature documented? (**not applicable** / docs / 
JavaDocs / not documented)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Task managers cache a negative DNS lookup of the blob server indefinitely
> -------------------------------------------------------------------------
>
>                 Key: FLINK-10213
>                 URL: https://issues.apache.org/jira/browse/FLINK-10213
>             Project: Flink
>          Issue Type: Bug
>          Components: TaskManager
>    Affects Versions: 1.5.0
>            Reporter: Joey Echeverria
>            Priority: Major
>              Labels: pull-request-available
>
> When the task manager establishes a connection with the resource manager, it 
> gets the hostname and port of the blob server and uses that to create an 
> instance of an {{InetSocketAddress}}. Per the documentation of the 
> constructor:
> {quote}An attempt will be made to resolve the hostname into an InetAddress. 
> If that attempt fails, the address will be flagged as _unresolved_{quote}
> Flink never checks to see if the address was unresolved. Later when executing 
> a task that needs to download from the blob server, it will use that same 
> {{InetSocketAddress}} instance to attempt to connect a {{Socket}}. This will 
> result in an exception similar to:
> {noformat}
> java.io.IOException: Failed to fetch BLOB 
> 97799b827ef073e04178a99f0f40b00e/p-6d8ec2ad31337110819c7c3641fdb18d3793a7fb-72bf00066308f4b4d2a9c5aea593b41f
>  from jobmanager:6124 and store it under 
> /tmp/blobStore-d135961a-03cb-4542-af6d-cf378ff83c12/incoming/temp-00018669
>       at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:191)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>       at 
> org.apache.flink.runtime.blob.AbstractBlobCache.getFileInternal(AbstractBlobCache.java:181)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>       at 
> org.apache.flink.runtime.blob.PermanentBlobCache.getFile(PermanentBlobCache.java:206)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>       at 
> org.apache.flink.runtime.execution.librarycache.BlobLibraryCacheManager.registerTask(BlobLibraryCacheManager.java:120)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>       at 
> org.apache.flink.runtime.taskmanager.Task.createUserCodeClassloader(Task.java:863)
>  [flink-dist_2.11-1.5.0.jar:1.5.0]
>       at org.apache.flink.runtime.taskmanager.Task.run(Task.java:579) 
> [flink-dist_2.11-1.5.0.jar:1.5.0]
>       at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
> Caused by: java.io.IOException: Could not connect to BlobServer at address 
> flink-jobmanager:6124
>       at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:124) 
> ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>       at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:165)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>       ... 6 more
> Caused by: java.net.UnknownHostException: jobmanager
>       at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:184) 
> ~[?:1.8.0_171]
>       at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) 
> ~[?:1.8.0_171]
>       at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_171]
>       at java.net.Socket.connect(Socket.java:538) ~[?:1.8.0_171]
>       at org.apache.flink.runtime.blob.BlobClient.<init>(BlobClient.java:118) 
> ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>       at 
> org.apache.flink.runtime.blob.BlobClient.downloadFromBlobServer(BlobClient.java:165)
>  ~[flink-dist_2.11-1.5.0.jar:1.5.0]
>       ... 6 more
> {noformat}
> Since the {{InetSocketAddress}} is re-used, you'll have repeated failures of 
> any tasks that are executed on that task manager and the only current 
> workaround is to manually restart the task manager.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to