[
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835800#comment-16835800
]
Henrik edited comment on FLINK-12381 at 5/8/19 6:34 PM:
--------------------------------------------------------
Yes, you can see it like that (a new cluster), I suppose.
So does that mean that flink is useless without HA then? Because if I don't
have HA, and the node I'm running it on, or the k8s pod I'm running it in,
restarts, it's a new cluster?
In the optimal world, I would not have to manually change the specification of
the job that runs, without the job that runs also having been changed. I.e. it
goes against declarative running of resources in a k8s cluster to manually have
to change the jobid whenever the pod is restarted.
was (Author: haf):
Yes, you can see it like that (a new cluster), I suppose.
So does that mean that flink is useless without HA then? Because if I don't
have HA, and the node I'm running it on, or the k8s pod I'm running it in,
restarts, it's a new cluster?
> W/o HA, upon a full restart, checkpointing crashes
> --------------------------------------------------
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / Coordination
> Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
> Reporter: Henrik
> Priority: Major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException:
> 'gs://example_bucket/flink/checkpoints/00000000000000000000000000000000/chk-16/_metadata'
> already exists
> at
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
> at
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:74)
> at
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
> at
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
> at
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
> at
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.<init>(FsCheckpointMetadataOutputStream.java:65)
> at
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
> at
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
> ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the
> job completely. Partial and undefined failure is not what should happen.
>
> Repro:
> # Set up a single purpose job cluster (which could use much better docs btw!)
> # Let it run with GCS checkpointing for a while with rocksdb/gs://example
> # Kill it
> # Start it
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)