OK, perhaps the best course of action is to leave the current behavior
as-is but clarify the documentation for `.checkpoint()` and/or
`cleanCheckpoints`.

I personally find it confusing that `cleanCheckpoints` doesn't address
shutdown behavior, and the Stack Overflow links I shared
<https://issues.apache.org/jira/browse/SPARK-33000> show that many people
are in the same situation. There is clearly some demand for Spark to
automatically clean up checkpoints on shutdown. But perhaps that should
be... a new config? a rejected feature? something else? I dunno.

Does anyone else have thoughts on how to approach this?

On Wed, Mar 10, 2021 at 4:39 PM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> > Checkpoint data is left behind after a normal shutdown, not just an
> unexpected shutdown. The PR description includes a simple demonstration of
> this.
>
> I think I might overemphasized a bit the "unexpected" adjective to show
> you the value in the current behavior.
>
> The feature configured with
> "spark.cleaner.referenceTracking.cleanCheckpoints" is about out of scoped
> references without ANY shutdown.
>
> It would be hard to distinguish that level (ShutdownHookManager) the
> unexpected from the intentional exits.
> As the user code (run by driver) could contain a System.exit() which was
> added by the developer for numerous reasons (this way distinguishing
> unexpected and not unexpected is not really an option).
> Even a third party library can contain s System.exit(). Would that be an
> unexpected exit or intentional? You can see it is hard to tell.
>
> To test the real feature
> behind "spark.cleaner.referenceTracking.cleanCheckpoints" you can create a
> reference within a scope which is closed. For example within the body of a
> function (without return value) and store it only in a local
> variable. After the scope is closed in case of our function when the caller
> gets the control back you have chance to see the context cleaner working
> (you might even need to trigger a GC too).
>
> On Wed, Mar 10, 2021 at 10:09 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Checkpoint data is left behind after a normal shutdown, not just an
>> unexpected shutdown. The PR description includes a simple demonstration of
>> this.
>>
>> If the current behavior is truly intended -- which I find difficult to
>> believe given how confusing <https://stackoverflow.com/q/52630858/877069>
>> it <https://stackoverflow.com/q/60009856/877069> is
>> <https://stackoverflow.com/q/61454740/877069> -- then at the very least
>> we need to update the documentation for both `.checkpoint()` and
>> `cleanCheckpoints` to make that clear.
>>
>> > This way even after an unexpected exit the next run of the same app
>> should be able to pick up the checkpointed data.
>>
>> The use case you are describing potentially makes sense. But preserving
>> checkpoint data after an unexpected shutdown -- even when
>> `cleanCheckpoints` is set to true -- is a new guarantee that is not
>> currently expressed in the API or documentation. At least as far as I can
>> tell.
>>
>> On Wed, Mar 10, 2021 at 3:10 PM Attila Zsolt Piros <
>> piros.attila.zs...@gmail.com> wrote:
>>
>>> Hi Nick!
>>>
>>> I am not sure you are fixing a problem here. I think what you see is as
>>> problem is actually an intended behaviour.
>>>
>>> Checkpoint data should outlive the unexpected shutdowns. So there is a
>>> very important difference between the reference goes out of scope during a
>>> normal execution (in this case cleanup is expected depending on the config
>>> you mentioned) and when a references goes out of scope because of an
>>> unexpected error (in this case you should keep the checkpoint data).
>>>
>>> This way even after an unexpected exit the next run of the same app
>>> should be able to pick up the checkpointed data.
>>>
>>> Best Regards,
>>> Attila
>>>
>>>
>>>
>>>
>>> On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Hello people,
>>>>
>>>> I'm working on a fix for SPARK-33000
>>>> <https://issues.apache.org/jira/browse/SPARK-33000>. Spark does not
>>>> cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate
>>>> configs are set.
>>>>
>>>> In the course of developing a fix, another contributor pointed out
>>>> <https://github.com/apache/spark/pull/31742#issuecomment-790987483>
>>>> that checkpointed data may not be the only type of resource that needs a
>>>> fix for shutdown cleanup.
>>>>
>>>> I'm looking for a committer who might have an opinion on how Spark
>>>> should clean up disk-based resources on shutdown. The last people who
>>>> contributed significantly to the ContextCleaner, where this cleanup
>>>> happens, were @witgo <https://github.com/witgo> and @andrewor14
>>>> <https://github.com/andrewor14>. But that was ~6 years ago, and I
>>>> don't think they are active on the project anymore.
>>>>
>>>> Any takers to take a look and give their thoughts? The PR is small
>>>> <https://github.com/apache/spark/pull/31742>. +39 / -2.
>>>>
>>>> Nick
>>>>
>>>>

Reply via email to