[jira] [Commented] (FLINK-19067) resource_manager and dispatcher register on different nodes in HA mode will cause FileNotFoundException

Robert Metzger (Jira) Fri, 15 Jan 2021 05:43:18 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266048#comment-17266048
 ]


Robert Metzger commented on FLINK-19067:
----------------------------------------

Thank you for your detailed analysis. I had a look at the code too, and I agree 
that you can run into this situation.

Let's assume we have the following setup:
Node1
lead Dispatcher
BlobServer1

Node2
lead ResourceManager
BlobServer2

Node3
TaskManager

Job submissions will go to the Dispatcher on Node1. The JobSubmitHandler will 
call DispatcherGateway.getBlobServerPort, which returns the address of 
BlobServer1.

During job execution, the TaskManager on Node3 will execute a task from the 
job. As part of the TaskExecutorRegistrationSuccess, we include the 
ClusterInformation, which contains the BlobServer address. The 
TaskExecutorRegistrationSuccess is coming from the ResourceManager, on Node2, 
which returns it's local ClusterInformation, with the address of BlobServer2.

[~trohrmann] Could you take a quick look at this to verify or correct my 
thinking?

> resource_manager and dispatcher register on different nodes in HA mode will 
> cause FileNotFoundException
> -------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-19067
>                 URL: https://issues.apache.org/jira/browse/FLINK-19067
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.1
>            Reporter: JieFang.He
>            Priority: Major
>         Attachments: flink-jobmanager-deployer-hejiefang01.log, 
> flink-jobmanager-deployer-hejiefang02.log, 
> flink-taskmanager-deployer-hejiefang01.log, 
> flink-taskmanager-deployer-hejiefang02.log
>
>
> When run examples/batch/WordCount.jar，it will fail with the exception:
> {code:java}
> Caused by: java.io.FileNotFoundException: 
> /data2/flink/storageDir/default/blob/job_d29414828f614d5466e239be4d3889ac/blob_p-a2ebe1c5aa160595f214b4bd0f39d80e42ee2e93-f458f1c12dc023e78d25f191de1d7c4b
>  (No such file or directory)
>  at java.io.FileInputStream.open0(Native Method)
>  at java.io.FileInputStream.open(FileInputStream.java:195)
>  at java.io.FileInputStream.<init>(FileInputStream.java:138)
>  at 
> org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
>  at 
> org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:143)
>  at 
> org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:105)
>  at 
> org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:87)
>  at 
> org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:501)
>  at 
> org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:231)
>  at 
> org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:117)
> {code}
>  
> I think the reason is that the jobFiles are upload to the dispatcher node，but 
> the task get jobFiles from resource_manager node. So in HA mode, it need to 
> ensure they are on one node
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19067) resource_manager and dispatcher register on different nodes in HA mode will cause FileNotFoundException

Reply via email to