I agree with you to extend the documentation around this. Moreover I support to have specific unit tests for this.
> There is clearly some demand for Spark to automatically clean up checkpoints on shutdown What about I suggested on the PR? To clean up the checkpoint directory at shutdown one can register the directory to be deleted at exit: FileSystem fs = FileSystem.get(conf); fs.deleteOnExit(checkpointPath); On Thu, Mar 11, 2021 at 6:06 PM Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > OK, perhaps the best course of action is to leave the current behavior > as-is but clarify the documentation for `.checkpoint()` and/or > `cleanCheckpoints`. > > I personally find it confusing that `cleanCheckpoints` doesn't address > shutdown behavior, and the Stack Overflow links I shared > <https://issues.apache.org/jira/browse/SPARK-33000> show that many people > are in the same situation. There is clearly some demand for Spark to > automatically clean up checkpoints on shutdown. But perhaps that should > be... a new config? a rejected feature? something else? I dunno. > > Does anyone else have thoughts on how to approach this? > > On Wed, Mar 10, 2021 at 4:39 PM Attila Zsolt Piros < > piros.attila.zs...@gmail.com> wrote: > >> > Checkpoint data is left behind after a normal shutdown, not just an >> unexpected shutdown. The PR description includes a simple demonstration of >> this. >> >> I think I might overemphasized a bit the "unexpected" adjective to show >> you the value in the current behavior. >> >> The feature configured with >> "spark.cleaner.referenceTracking.cleanCheckpoints" is about out of scoped >> references without ANY shutdown. >> >> It would be hard to distinguish that level (ShutdownHookManager) the >> unexpected from the intentional exits. >> As the user code (run by driver) could contain a System.exit() which was >> added by the developer for numerous reasons (this way distinguishing >> unexpected and not unexpected is not really an option). >> Even a third party library can contain s System.exit(). Would that be an >> unexpected exit or intentional? You can see it is hard to tell. >> >> To test the real feature >> behind "spark.cleaner.referenceTracking.cleanCheckpoints" you can create a >> reference within a scope which is closed. For example within the body of a >> function (without return value) and store it only in a local >> variable. After the scope is closed in case of our function when the caller >> gets the control back you have chance to see the context cleaner working >> (you might even need to trigger a GC too). >> >> On Wed, Mar 10, 2021 at 10:09 PM Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> Checkpoint data is left behind after a normal shutdown, not just an >>> unexpected shutdown. The PR description includes a simple demonstration of >>> this. >>> >>> If the current behavior is truly intended -- which I find difficult to >>> believe given how confusing >>> <https://stackoverflow.com/q/52630858/877069> it >>> <https://stackoverflow.com/q/60009856/877069> is >>> <https://stackoverflow.com/q/61454740/877069> -- then at the very least >>> we need to update the documentation for both `.checkpoint()` and >>> `cleanCheckpoints` to make that clear. >>> >>> > This way even after an unexpected exit the next run of the same app >>> should be able to pick up the checkpointed data. >>> >>> The use case you are describing potentially makes sense. But preserving >>> checkpoint data after an unexpected shutdown -- even when >>> `cleanCheckpoints` is set to true -- is a new guarantee that is not >>> currently expressed in the API or documentation. At least as far as I can >>> tell. >>> >>> On Wed, Mar 10, 2021 at 3:10 PM Attila Zsolt Piros < >>> piros.attila.zs...@gmail.com> wrote: >>> >>>> Hi Nick! >>>> >>>> I am not sure you are fixing a problem here. I think what you see is as >>>> problem is actually an intended behaviour. >>>> >>>> Checkpoint data should outlive the unexpected shutdowns. So there is a >>>> very important difference between the reference goes out of scope during a >>>> normal execution (in this case cleanup is expected depending on the config >>>> you mentioned) and when a references goes out of scope because of an >>>> unexpected error (in this case you should keep the checkpoint data). >>>> >>>> This way even after an unexpected exit the next run of the same app >>>> should be able to pick up the checkpointed data. >>>> >>>> Best Regards, >>>> Attila >>>> >>>> >>>> >>>> >>>> On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas < >>>> nicholas.cham...@gmail.com> wrote: >>>> >>>>> Hello people, >>>>> >>>>> I'm working on a fix for SPARK-33000 >>>>> <https://issues.apache.org/jira/browse/SPARK-33000>. Spark does not >>>>> cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate >>>>> configs are set. >>>>> >>>>> In the course of developing a fix, another contributor pointed out >>>>> <https://github.com/apache/spark/pull/31742#issuecomment-790987483> >>>>> that checkpointed data may not be the only type of resource that needs a >>>>> fix for shutdown cleanup. >>>>> >>>>> I'm looking for a committer who might have an opinion on how Spark >>>>> should clean up disk-based resources on shutdown. The last people who >>>>> contributed significantly to the ContextCleaner, where this cleanup >>>>> happens, were @witgo <https://github.com/witgo> and @andrewor14 >>>>> <https://github.com/andrewor14>. But that was ~6 years ago, and I >>>>> don't think they are active on the project anymore. >>>>> >>>>> Any takers to take a look and give their thoughts? The PR is small >>>>> <https://github.com/apache/spark/pull/31742>. +39 / -2. >>>>> >>>>> Nick >>>>> >>>>>