[
https://issues.apache.org/jira/browse/FLINK-4150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15382419#comment-15382419
]
ASF GitHub Bot commented on FLINK-4150:
---------------------------------------
Github user tillrohrmann commented on the issue:
https://github.com/apache/flink/pull/2256
In general, I think it would be helpful for users to be able to retrieve
checkpoints of a failed job. I could imagine a scenario where a job is faulty
but one only runs into after some time. Being then able to transform a
checkpoint into a savepoint and then restarting the failed job with a corrected
jar could be helpful.
Thus, I think we should only remove the persisted job data if the job has
reached FINISHED or CANCELED. Admittedly, this is a very conservative approach,
but then users are less likely to lose data.
However, this should be out of the scope of this PR.
> Problem with Blobstore in Yarn HA setting on recovery after cluster shutdown
> ----------------------------------------------------------------------------
>
> Key: FLINK-4150
> URL: https://issues.apache.org/jira/browse/FLINK-4150
> Project: Flink
> Issue Type: Bug
> Components: Job-Submission
> Reporter: Stefan Richter
> Assignee: Ufuk Celebi
> Priority: Blocker
> Fix For: 1.1.0
>
>
> Submitting a job in Yarn with HA can lead to the following exception:
> {code}
> org.apache.flink.streaming.runtime.tasks.StreamTaskException: Cannot load
> user class: org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09
> ClassLoader info: URL ClassLoader:
> file:
> '/tmp/blobStore-ccec0f4a-3e07-455f-945b-4fcd08f5bac1/cache/blob_7fafffe9595cd06aff213b81b5da7b1682e1d6b0'
> (invalid JAR: zip file is empty)
> Class not resolvable through given classloader.
> at
> org.apache.flink.streaming.api.graph.StreamConfig.getStreamOperator(StreamConfig.java:207)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:222)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Some job information, including the Blob ids, are stored in Zookeeper. The
> actual Blobs are stored in a dedicated BlobStore, if the recovery mode is set
> to Zookeeper. This BlobStore is typically located in a FS like HDFS. When the
> cluster is shut down, the path for the BlobStore is deleted. When the cluster
> is then restarted, recovering jobs cannot restore because it's Blob ids
> stored in Zookeeper now point to deleted files.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)