[jira] [Commented] (FLINK-19067) resource_manager and dispatcher register on different nodes in HA mode will cause FileNotFoundException

Till Rohrmann (Jira) Fri, 15 Jan 2021 06:28:39 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266070#comment-17266070
 ]


Till Rohrmann commented on FLINK-19067:
---------------------------------------

I think this scenario can happen as you describe it [~rmetzger] and 
[~hejiefang] if you haven't configured a proper 
{{high-availability.storageDir}}. Note that {{high-availability.storageDir}} 
needs to be reachable from all nodes. Otherwise HA won't properly work because 
you cannot persist the blobs, submitted {{JobGraphs}} and {{Checkpoints}}.

In your case an {{high-availability.storageDir}} which is accessible from all 
nodes will solve the problem because all {{BlobServer}} have access to blobs 
which are uploaded to any {{BlobServer}}.

I agree that this is not beautiful and we should actually change the HA 
services so that there is only a single leader elector for the whole JobManager 
process instead of leader election for the {{JobMaster}}, {{Dispatcher}} and 
{{ResourceManager}} individually. However, as a "quick fix" you can configure a 
proper {{high-availability.storageDir}} and the problem should be gone.

> resource_manager and dispatcher register on different nodes in HA mode will 
> cause FileNotFoundException
> -------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-19067
>                 URL: https://issues.apache.org/jira/browse/FLINK-19067
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.1
>            Reporter: JieFang.He
>            Assignee: Robert Metzger
>            Priority: Major
>         Attachments: flink-jobmanager-deployer-hejiefang01.log, 
> flink-jobmanager-deployer-hejiefang02.log, 
> flink-taskmanager-deployer-hejiefang01.log, 
> flink-taskmanager-deployer-hejiefang02.log
>
>
> When run examples/batch/WordCount.jar，it will fail with the exception:
> {code:java}
> Caused by: java.io.FileNotFoundException: 
> /data2/flink/storageDir/default/blob/job_d29414828f614d5466e239be4d3889ac/blob_p-a2ebe1c5aa160595f214b4bd0f39d80e42ee2e93-f458f1c12dc023e78d25f191de1d7c4b
>  (No such file or directory)
>  at java.io.FileInputStream.open0(Native Method)
>  at java.io.FileInputStream.open(FileInputStream.java:195)
>  at java.io.FileInputStream.<init>(FileInputStream.java:138)
>  at 
> org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
>  at 
> org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:143)
>  at 
> org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:105)
>  at 
> org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:87)
>  at 
> org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:501)
>  at 
> org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:231)
>  at 
> org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:117)
> {code}
>  
> I think the reason is that the jobFiles are upload to the dispatcher node，but 
> the task get jobFiles from resource_manager node. So in HA mode, it need to 
> ensure they are on one node
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-19067) resource_manager and dispatcher register on different nodes in HA mode will cause FileNotFoundException

Reply via email to