[GitHub] [flink] StephanEwen commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

GitBox Thu, 05 Mar 2020 03:18:27 -0800

StephanEwen commented on issue #10768: [FLINK-15012][checkpoint] Introduce 
shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.
URL: https://github.com/apache/flink/pull/10768#issuecomment-595176653
 
 
   > @StephanEwen , I think it could be okay to delete directories with 
JobStatus and CheckpointProperties as current implementations of checkpoint 
store and checkpoint id counter all have their own responsibility to decide 
whether to discard checkpoints. In my point of view, checkpoint storage should 
have similar responsibility ownership compared with checkpoint store. I prefer 
each storage could have their own implementation to discard its directories to 
offer agility.
   
   I don't agree here. This really sounds to me like both making abstractions 
harder for the future and inconsistencies waiting to happen.
   
   Can there be a case in which different storages behave differently? Can it 
be that a storage would retain data when "Job=Canceled" and 
"Properties=NeverRetain", but delete them when "Job=Canceled" and 
"Properties=RetainOnFailure"?
   
   About the "checkpoint id counter": That one has a pretty clear life cycle: 
exactly that of the job, no ZK data should be left when the job 
finishes/cancels/fails. No retention policies or so. The JobStatus there is 
only to differentiate between shutdown-without-delete (suspend = lost 
leadership) and shutdown-with-delete. Arguable, that should actually be a 
boolean flag, not passing the JobStatus.
   
   > When talking about cleanup without recursive. I think your assumptions 
might not happen in real environment as previous job would always have a 
different job-id sub-directory which is definitely not the same as current job. 
We would only at most clean up the parent directory at job-id level and I think 
we would not cleanup other or previous jobs.
   
   That sounds exactly right, we should not try and clean up data from a 
previous job. That risks data loss in certain setups, when users configure the 
setup in an "unconventional way".
   
   Moreover, for HDFS (or non-object storage), the directory could only be 
discarded if empty. However, we often come across problem that checkpoint 
coordinator deletes files not so fast which lead the sub-directory is not 
exactly empty. In the end, we cannot ensure the directory could be cleaned up 
as expected.
   To make cleanup more safer, we could add documentation to not set savepoint 
target location under the job-id sub-directory. What do you think?
   
   I am not quite convinced here as well. This sounds again like risking data 
loss when users make a setup error. "Fixing by documentation" is often not 
preventing these things.
   What about "incremental savepoints" in the future? They would leave data in 
the "shared" directory.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [flink] StephanEwen commented on issue #10768: [FLINK-15012][checkpoint] Introduce shutdown to CheckpointStorageCoordinatorView to clean up checkpoint directory.

Reply via email to