StephanEwen commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory. URL: https://github.com/apache/flink/pull/10768#issuecomment-595176653 > @StephanEwen , I think it could be okay to delete directories with JobStatus and CheckpointProperties as current implementations of checkpoint store and checkpoint id counter all have their own responsibility to decide whether to discard checkpoints. In my point of view, checkpoint storage should have similar responsibility ownership compared with checkpoint store. I prefer each storage could have their own implementation to discard its directories to offer agility. I don't agree here. This really sounds to me like both making abstractions harder for the future and inconsistencies waiting to happen. Can there be a case in which different storages behave differently? Can it be that a storage would retain data when "Job=Canceled" and "Properties=NeverRetain", but delete them when "Job=Canceled" and "Properties=RetainOnFailure"? About the "checkpoint id counter": That one has a pretty clear life cycle: exactly that of the job, no ZK data should be left when the job finishes/cancels/fails. No retention policies or so. The JobStatus there is only to differentiate between shutdown-without-delete (suspend = lost leadership) and shutdown-with-delete. Arguable, that should actually be a boolean flag, not passing the JobStatus. > When talking about cleanup without recursive. I think your assumptions might not happen in real environment as previous job would always have a different job-id sub-directory which is definitely not the same as current job. We would only at most clean up the parent directory at job-id level and I think we would not cleanup other or previous jobs. That sounds exactly right, we should not try and clean up data from a previous job. That risks data loss in certain setups, when users configure the setup in an "unconventional way". Moreover, for HDFS (or non-object storage), the directory could only be discarded if empty. However, we often come across problem that checkpoint coordinator deletes files not so fast which lead the sub-directory is not exactly empty. In the end, we cannot ensure the directory could be cleaned up as expected. To make cleanup more safer, we could add documentation to not set savepoint target location under the job-id sub-directory. What do you think? I am not quite convinced here as well. This sounds again like risking data loss when users make a setup error. "Fixing by documentation" is often not preventing these things. What about "incremental savepoints" in the future? They would leave data in the "shared" directory.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
