OK, perhaps the best course of action is to leave the current behavior as-is but clarify the documentation for `.checkpoint()` and/or `cleanCheckpoints`.
I personally find it confusing that `cleanCheckpoints` doesn't address shutdown behavior, and the Stack Overflow links I shared <https://issues.apache.org/jira/browse/SPARK-33000> show that many people are in the same situation. There is clearly some demand for Spark to automatically clean up checkpoints on shutdown. But perhaps that should be... a new config? a rejected feature? something else? I dunno. Does anyone else have thoughts on how to approach this? On Wed, Mar 10, 2021 at 4:39 PM Attila Zsolt Piros < piros.attila.zs...@gmail.com> wrote: > > Checkpoint data is left behind after a normal shutdown, not just an > unexpected shutdown. The PR description includes a simple demonstration of > this. > > I think I might overemphasized a bit the "unexpected" adjective to show > you the value in the current behavior. > > The feature configured with > "spark.cleaner.referenceTracking.cleanCheckpoints" is about out of scoped > references without ANY shutdown. > > It would be hard to distinguish that level (ShutdownHookManager) the > unexpected from the intentional exits. > As the user code (run by driver) could contain a System.exit() which was > added by the developer for numerous reasons (this way distinguishing > unexpected and not unexpected is not really an option). > Even a third party library can contain s System.exit(). Would that be an > unexpected exit or intentional? You can see it is hard to tell. > > To test the real feature > behind "spark.cleaner.referenceTracking.cleanCheckpoints" you can create a > reference within a scope which is closed. For example within the body of a > function (without return value) and store it only in a local > variable. After the scope is closed in case of our function when the caller > gets the control back you have chance to see the context cleaner working > (you might even need to trigger a GC too). > > On Wed, Mar 10, 2021 at 10:09 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Checkpoint data is left behind after a normal shutdown, not just an >> unexpected shutdown. The PR description includes a simple demonstration of >> this. >> >> If the current behavior is truly intended -- which I find difficult to >> believe given how confusing <https://stackoverflow.com/q/52630858/877069> >> it <https://stackoverflow.com/q/60009856/877069> is >> <https://stackoverflow.com/q/61454740/877069> -- then at the very least >> we need to update the documentation for both `.checkpoint()` and >> `cleanCheckpoints` to make that clear. >> >> > This way even after an unexpected exit the next run of the same app >> should be able to pick up the checkpointed data. >> >> The use case you are describing potentially makes sense. But preserving >> checkpoint data after an unexpected shutdown -- even when >> `cleanCheckpoints` is set to true -- is a new guarantee that is not >> currently expressed in the API or documentation. At least as far as I can >> tell. >> >> On Wed, Mar 10, 2021 at 3:10 PM Attila Zsolt Piros < >> piros.attila.zs...@gmail.com> wrote: >> >>> Hi Nick! >>> >>> I am not sure you are fixing a problem here. I think what you see is as >>> problem is actually an intended behaviour. >>> >>> Checkpoint data should outlive the unexpected shutdowns. So there is a >>> very important difference between the reference goes out of scope during a >>> normal execution (in this case cleanup is expected depending on the config >>> you mentioned) and when a references goes out of scope because of an >>> unexpected error (in this case you should keep the checkpoint data). >>> >>> This way even after an unexpected exit the next run of the same app >>> should be able to pick up the checkpointed data. >>> >>> Best Regards, >>> Attila >>> >>> >>> >>> >>> On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas < >>> nicholas.cham...@gmail.com> wrote: >>> >>>> Hello people, >>>> >>>> I'm working on a fix for SPARK-33000 >>>> <https://issues.apache.org/jira/browse/SPARK-33000>. Spark does not >>>> cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate >>>> configs are set. >>>> >>>> In the course of developing a fix, another contributor pointed out >>>> <https://github.com/apache/spark/pull/31742#issuecomment-790987483> >>>> that checkpointed data may not be the only type of resource that needs a >>>> fix for shutdown cleanup. >>>> >>>> I'm looking for a committer who might have an opinion on how Spark >>>> should clean up disk-based resources on shutdown. The last people who >>>> contributed significantly to the ContextCleaner, where this cleanup >>>> happens, were @witgo <https://github.com/witgo> and @andrewor14 >>>> <https://github.com/andrewor14>. But that was ~6 years ago, and I >>>> don't think they are active on the project anymore. >>>> >>>> Any takers to take a look and give their thoughts? The PR is small >>>> <https://github.com/apache/spark/pull/31742>. +39 / -2. >>>> >>>> Nick >>>> >>>>