[
https://issues.apache.org/jira/browse/FLINK-12381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Flink Jira Bot updated FLINK-12381:
-----------------------------------
Labels: stale-major (was: )
I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help
the community manage its development. I see this issues has been marked as
Major but is unassigned and neither itself nor its Sub-Tasks have been updated
for 30 days. I have gone ahead and added a "stale-major" to the issue". If this
ticket is a Major, please either assign yourself or give an update. Afterwards,
please remove the label or in 7 days the issue will be deprioritized.
> W/o HA, upon a full restart, checkpointing crashes
> --------------------------------------------------
>
> Key: FLINK-12381
> URL: https://issues.apache.org/jira/browse/FLINK-12381
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Checkpointing, Runtime / Coordination
> Affects Versions: 1.8.0
> Environment: Same as FLINK-\{12379, 12377, 12376}
> Reporter: Henrik
> Priority: Major
> Labels: stale-major
>
> {code:java}
> Caused by: org.apache.hadoop.fs.FileAlreadyExistsException:
> 'gs://example_bucket/flink/checkpoints/00000000000000000000000000000000/chk-16/_metadata'
> already exists
> at
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.createChannel(GoogleHadoopOutputStream.java:85)
> at
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopOutputStream.<init>(GoogleHadoopOutputStream.java:74)
> at
> com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemBase.create(GoogleHadoopFileSystemBase.java:797)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807)
> at
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:141)
> at
> org.apache.flink.runtime.fs.hdfs.HadoopFileSystem.create(HadoopFileSystem.java:37)
> at
> org.apache.flink.runtime.state.filesystem.FsCheckpointMetadataOutputStream.<init>(FsCheckpointMetadataOutputStream.java:65)
> at
> org.apache.flink.runtime.state.filesystem.FsCheckpointStorageLocation.createMetadataOutputStream(FsCheckpointStorageLocation.java:104)
> at
> org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:259)
> at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:829)
> ... 8 more
> {code}
> Instead, it should either just overwrite the checkpoint or fail to start the
> job completely. Partial and undefined failure is not what should happen.
>
> Repro:
> # Set up a single purpose job cluster (which could use much better docs btw!)
> # Let it run with GCS checkpointing for a while with rocksdb/gs://example
> # Kill it
> # Start it
--
This message was sent by Atlassian Jira
(v8.3.4#803005)