I agree with you to extend the documentation around this. Moreover I
support to have specific unit tests for this.

> There is clearly some demand for Spark to automatically clean up
checkpoints on shutdown

What about I suggested on the PR? To clean up the checkpoint directory at
shutdown one can register the directory to be deleted at exit:

 FileSystem fs = FileSystem.get(conf);
 fs.deleteOnExit(checkpointPath);

On Thu, Mar 11, 2021 at 6:06 PM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> OK, perhaps the best course of action is to leave the current behavior
> as-is but clarify the documentation for `.checkpoint()` and/or
> `cleanCheckpoints`.
>
> I personally find it confusing that `cleanCheckpoints` doesn't address
> shutdown behavior, and the Stack Overflow links I shared
> <https://issues.apache.org/jira/browse/SPARK-33000> show that many people
> are in the same situation. There is clearly some demand for Spark to
> automatically clean up checkpoints on shutdown. But perhaps that should
> be... a new config? a rejected feature? something else? I dunno.
>
> Does anyone else have thoughts on how to approach this?
>
> On Wed, Mar 10, 2021 at 4:39 PM Attila Zsolt Piros <
> piros.attila.zs...@gmail.com> wrote:
>
>> > Checkpoint data is left behind after a normal shutdown, not just an
>> unexpected shutdown. The PR description includes a simple demonstration of
>> this.
>>
>> I think I might overemphasized a bit the "unexpected" adjective to show
>> you the value in the current behavior.
>>
>> The feature configured with
>> "spark.cleaner.referenceTracking.cleanCheckpoints" is about out of scoped
>> references without ANY shutdown.
>>
>> It would be hard to distinguish that level (ShutdownHookManager) the
>> unexpected from the intentional exits.
>> As the user code (run by driver) could contain a System.exit() which was
>> added by the developer for numerous reasons (this way distinguishing
>> unexpected and not unexpected is not really an option).
>> Even a third party library can contain s System.exit(). Would that be an
>> unexpected exit or intentional? You can see it is hard to tell.
>>
>> To test the real feature
>> behind "spark.cleaner.referenceTracking.cleanCheckpoints" you can create a
>> reference within a scope which is closed. For example within the body of a
>> function (without return value) and store it only in a local
>> variable. After the scope is closed in case of our function when the caller
>> gets the control back you have chance to see the context cleaner working
>> (you might even need to trigger a GC too).
>>
>> On Wed, Mar 10, 2021 at 10:09 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Checkpoint data is left behind after a normal shutdown, not just an
>>> unexpected shutdown. The PR description includes a simple demonstration of
>>> this.
>>>
>>> If the current behavior is truly intended -- which I find difficult to
>>> believe given how confusing
>>> <https://stackoverflow.com/q/52630858/877069> it
>>> <https://stackoverflow.com/q/60009856/877069> is
>>> <https://stackoverflow.com/q/61454740/877069> -- then at the very least
>>> we need to update the documentation for both `.checkpoint()` and
>>> `cleanCheckpoints` to make that clear.
>>>
>>> > This way even after an unexpected exit the next run of the same app
>>> should be able to pick up the checkpointed data.
>>>
>>> The use case you are describing potentially makes sense. But preserving
>>> checkpoint data after an unexpected shutdown -- even when
>>> `cleanCheckpoints` is set to true -- is a new guarantee that is not
>>> currently expressed in the API or documentation. At least as far as I can
>>> tell.
>>>
>>> On Wed, Mar 10, 2021 at 3:10 PM Attila Zsolt Piros <
>>> piros.attila.zs...@gmail.com> wrote:
>>>
>>>> Hi Nick!
>>>>
>>>> I am not sure you are fixing a problem here. I think what you see is as
>>>> problem is actually an intended behaviour.
>>>>
>>>> Checkpoint data should outlive the unexpected shutdowns. So there is a
>>>> very important difference between the reference goes out of scope during a
>>>> normal execution (in this case cleanup is expected depending on the config
>>>> you mentioned) and when a references goes out of scope because of an
>>>> unexpected error (in this case you should keep the checkpoint data).
>>>>
>>>> This way even after an unexpected exit the next run of the same app
>>>> should be able to pick up the checkpointed data.
>>>>
>>>> Best Regards,
>>>> Attila
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas <
>>>> nicholas.cham...@gmail.com> wrote:
>>>>
>>>>> Hello people,
>>>>>
>>>>> I'm working on a fix for SPARK-33000
>>>>> <https://issues.apache.org/jira/browse/SPARK-33000>. Spark does not
>>>>> cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate
>>>>> configs are set.
>>>>>
>>>>> In the course of developing a fix, another contributor pointed out
>>>>> <https://github.com/apache/spark/pull/31742#issuecomment-790987483>
>>>>> that checkpointed data may not be the only type of resource that needs a
>>>>> fix for shutdown cleanup.
>>>>>
>>>>> I'm looking for a committer who might have an opinion on how Spark
>>>>> should clean up disk-based resources on shutdown. The last people who
>>>>> contributed significantly to the ContextCleaner, where this cleanup
>>>>> happens, were @witgo <https://github.com/witgo> and @andrewor14
>>>>> <https://github.com/andrewor14>. But that was ~6 years ago, and I
>>>>> don't think they are active on the project anymore.
>>>>>
>>>>> Any takers to take a look and give their thoughts? The PR is small
>>>>> <https://github.com/apache/spark/pull/31742>. +39 / -2.
>>>>>
>>>>> Nick
>>>>>
>>>>>

Reply via email to