[
https://issues.apache.org/jira/browse/FLINK-19067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17266070#comment-17266070
]
Till Rohrmann commented on FLINK-19067:
---------------------------------------
I think this scenario can happen as you describe it [~rmetzger] and
[~hejiefang] if you haven't configured a proper
{{high-availability.storageDir}}. Note that {{high-availability.storageDir}}
needs to be reachable from all nodes. Otherwise HA won't properly work because
you cannot persist the blobs, submitted {{JobGraphs}} and {{Checkpoints}}.
In your case an {{high-availability.storageDir}} which is accessible from all
nodes will solve the problem because all {{BlobServer}} have access to blobs
which are uploaded to any {{BlobServer}}.
I agree that this is not beautiful and we should actually change the HA
services so that there is only a single leader elector for the whole JobManager
process instead of leader election for the {{JobMaster}}, {{Dispatcher}} and
{{ResourceManager}} individually. However, as a "quick fix" you can configure a
proper {{high-availability.storageDir}} and the problem should be gone.
> resource_manager and dispatcher register on different nodes in HA mode will
> cause FileNotFoundException
> -------------------------------------------------------------------------------------------------------
>
> Key: FLINK-19067
> URL: https://issues.apache.org/jira/browse/FLINK-19067
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.11.1
> Reporter: JieFang.He
> Assignee: Robert Metzger
> Priority: Major
> Attachments: flink-jobmanager-deployer-hejiefang01.log,
> flink-jobmanager-deployer-hejiefang02.log,
> flink-taskmanager-deployer-hejiefang01.log,
> flink-taskmanager-deployer-hejiefang02.log
>
>
> When run examples/batch/WordCount.jar,it will fail with the exception:
> {code:java}
> Caused by: java.io.FileNotFoundException:
> /data2/flink/storageDir/default/blob/job_d29414828f614d5466e239be4d3889ac/blob_p-a2ebe1c5aa160595f214b4bd0f39d80e42ee2e93-f458f1c12dc023e78d25f191de1d7c4b
> (No such file or directory)
> at java.io.FileInputStream.open0(Native Method)
> at java.io.FileInputStream.open(FileInputStream.java:195)
> at java.io.FileInputStream.<init>(FileInputStream.java:138)
> at
> org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:50)
> at
> org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:143)
> at
> org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:105)
> at
> org.apache.flink.runtime.blob.FileSystemBlobStore.get(FileSystemBlobStore.java:87)
> at
> org.apache.flink.runtime.blob.BlobServer.getFileInternal(BlobServer.java:501)
> at
> org.apache.flink.runtime.blob.BlobServerConnection.get(BlobServerConnection.java:231)
> at
> org.apache.flink.runtime.blob.BlobServerConnection.run(BlobServerConnection.java:117)
> {code}
>
> I think the reason is that the jobFiles are upload to the dispatcher node,but
> the task get jobFiles from resource_manager node. So in HA mode, it need to
> ensure they are on one node
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)