Re: Shutdown cleanup of disk-based resources that Spark creates

2021-04-06 Thread Steve Loughran
On Thu, 11 Mar 2021 at 19:58, Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> I agree with you to extend the documentation around this. Moreover I
> support to have specific unit tests for this.
>
> > There is clearly some demand for Spark to automatically clean up
> checkpoints on shutdown
>
> What about I suggested on the PR? To clean up the checkpoint directory at
> shutdown one can register the directory to be deleted at exit:
>
>  FileSystem fs = FileSystem.get(conf);
>  fs.deleteOnExit(checkpointPath);
>
>>
>>
 I wouldn't recommend that. It's really for testing. It should probably get
tagged as deprecated. Better for your own cleanup code to have some atomic
bool which makes the decision.


   1. It does the delete sequentially -the more paths, the longer it takes
   2. doesn't notice/skip if a file has changed since it's added
   3. doesn't distinguish from files and dirs. So if you have a file /temp/1
   4. then replace it with dir /temp/1, the entire tree gets deleted on
   shutdown. Is that what you wanted.

I've played with some optimisation of the s3a case (
https://github.com/apache/hadoop/pull/1924 ) ; but  really it should be
some of

-store any checksum/timestamp/size on submit, + dir/file status
-only delete on a match
-do this in a thread pool. Though you can't always create them on shutdown,
can you?

But of course do that and something, somewhere will break.

safer to roll your own.


Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-11 Thread Attila Zsolt Piros
I agree with you to extend the documentation around this. Moreover I
support to have specific unit tests for this.

> There is clearly some demand for Spark to automatically clean up
checkpoints on shutdown

What about I suggested on the PR? To clean up the checkpoint directory at
shutdown one can register the directory to be deleted at exit:

 FileSystem fs = FileSystem.get(conf);
 fs.deleteOnExit(checkpointPath);

On Thu, Mar 11, 2021 at 6:06 PM Nicholas Chammas 
wrote:

> OK, perhaps the best course of action is to leave the current behavior
> as-is but clarify the documentation for `.checkpoint()` and/or
> `cleanCheckpoints`.
>
> I personally find it confusing that `cleanCheckpoints` doesn't address
> shutdown behavior, and the Stack Overflow links I shared
>  show that many people
> are in the same situation. There is clearly some demand for Spark to
> automatically clean up checkpoints on shutdown. But perhaps that should
> be... a new config? a rejected feature? something else? I dunno.
>
> Does anyone else have thoughts on how to approach this?
>
> On Wed, Mar 10, 2021 at 4:39 PM Attila Zsolt Piros <
> piros.attila.zs...@gmail.com> wrote:
>
>> > Checkpoint data is left behind after a normal shutdown, not just an
>> unexpected shutdown. The PR description includes a simple demonstration of
>> this.
>>
>> I think I might overemphasized a bit the "unexpected" adjective to show
>> you the value in the current behavior.
>>
>> The feature configured with
>> "spark.cleaner.referenceTracking.cleanCheckpoints" is about out of scoped
>> references without ANY shutdown.
>>
>> It would be hard to distinguish that level (ShutdownHookManager) the
>> unexpected from the intentional exits.
>> As the user code (run by driver) could contain a System.exit() which was
>> added by the developer for numerous reasons (this way distinguishing
>> unexpected and not unexpected is not really an option).
>> Even a third party library can contain s System.exit(). Would that be an
>> unexpected exit or intentional? You can see it is hard to tell.
>>
>> To test the real feature
>> behind "spark.cleaner.referenceTracking.cleanCheckpoints" you can create a
>> reference within a scope which is closed. For example within the body of a
>> function (without return value) and store it only in a local
>> variable. After the scope is closed in case of our function when the caller
>> gets the control back you have chance to see the context cleaner working
>> (you might even need to trigger a GC too).
>>
>> On Wed, Mar 10, 2021 at 10:09 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Checkpoint data is left behind after a normal shutdown, not just an
>>> unexpected shutdown. The PR description includes a simple demonstration of
>>> this.
>>>
>>> If the current behavior is truly intended -- which I find difficult to
>>> believe given how confusing
>>>  it
>>>  is
>>>  -- then at the very least
>>> we need to update the documentation for both `.checkpoint()` and
>>> `cleanCheckpoints` to make that clear.
>>>
>>> > This way even after an unexpected exit the next run of the same app
>>> should be able to pick up the checkpointed data.
>>>
>>> The use case you are describing potentially makes sense. But preserving
>>> checkpoint data after an unexpected shutdown -- even when
>>> `cleanCheckpoints` is set to true -- is a new guarantee that is not
>>> currently expressed in the API or documentation. At least as far as I can
>>> tell.
>>>
>>> On Wed, Mar 10, 2021 at 3:10 PM Attila Zsolt Piros <
>>> piros.attila.zs...@gmail.com> wrote:
>>>
 Hi Nick!

 I am not sure you are fixing a problem here. I think what you see is as
 problem is actually an intended behaviour.

 Checkpoint data should outlive the unexpected shutdowns. So there is a
 very important difference between the reference goes out of scope during a
 normal execution (in this case cleanup is expected depending on the config
 you mentioned) and when a references goes out of scope because of an
 unexpected error (in this case you should keep the checkpoint data).

 This way even after an unexpected exit the next run of the same app
 should be able to pick up the checkpointed data.

 Best Regards,
 Attila




 On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> Hello people,
>
> I'm working on a fix for SPARK-33000
> . Spark does not
> cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate
> configs are set.
>
> In the course of developing a fix, another contributor pointed out
> 
> 

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-11 Thread Nicholas Chammas
OK, perhaps the best course of action is to leave the current behavior
as-is but clarify the documentation for `.checkpoint()` and/or
`cleanCheckpoints`.

I personally find it confusing that `cleanCheckpoints` doesn't address
shutdown behavior, and the Stack Overflow links I shared
 show that many people
are in the same situation. There is clearly some demand for Spark to
automatically clean up checkpoints on shutdown. But perhaps that should
be... a new config? a rejected feature? something else? I dunno.

Does anyone else have thoughts on how to approach this?

On Wed, Mar 10, 2021 at 4:39 PM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> > Checkpoint data is left behind after a normal shutdown, not just an
> unexpected shutdown. The PR description includes a simple demonstration of
> this.
>
> I think I might overemphasized a bit the "unexpected" adjective to show
> you the value in the current behavior.
>
> The feature configured with
> "spark.cleaner.referenceTracking.cleanCheckpoints" is about out of scoped
> references without ANY shutdown.
>
> It would be hard to distinguish that level (ShutdownHookManager) the
> unexpected from the intentional exits.
> As the user code (run by driver) could contain a System.exit() which was
> added by the developer for numerous reasons (this way distinguishing
> unexpected and not unexpected is not really an option).
> Even a third party library can contain s System.exit(). Would that be an
> unexpected exit or intentional? You can see it is hard to tell.
>
> To test the real feature
> behind "spark.cleaner.referenceTracking.cleanCheckpoints" you can create a
> reference within a scope which is closed. For example within the body of a
> function (without return value) and store it only in a local
> variable. After the scope is closed in case of our function when the caller
> gets the control back you have chance to see the context cleaner working
> (you might even need to trigger a GC too).
>
> On Wed, Mar 10, 2021 at 10:09 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Checkpoint data is left behind after a normal shutdown, not just an
>> unexpected shutdown. The PR description includes a simple demonstration of
>> this.
>>
>> If the current behavior is truly intended -- which I find difficult to
>> believe given how confusing 
>> it  is
>>  -- then at the very least
>> we need to update the documentation for both `.checkpoint()` and
>> `cleanCheckpoints` to make that clear.
>>
>> > This way even after an unexpected exit the next run of the same app
>> should be able to pick up the checkpointed data.
>>
>> The use case you are describing potentially makes sense. But preserving
>> checkpoint data after an unexpected shutdown -- even when
>> `cleanCheckpoints` is set to true -- is a new guarantee that is not
>> currently expressed in the API or documentation. At least as far as I can
>> tell.
>>
>> On Wed, Mar 10, 2021 at 3:10 PM Attila Zsolt Piros <
>> piros.attila.zs...@gmail.com> wrote:
>>
>>> Hi Nick!
>>>
>>> I am not sure you are fixing a problem here. I think what you see is as
>>> problem is actually an intended behaviour.
>>>
>>> Checkpoint data should outlive the unexpected shutdowns. So there is a
>>> very important difference between the reference goes out of scope during a
>>> normal execution (in this case cleanup is expected depending on the config
>>> you mentioned) and when a references goes out of scope because of an
>>> unexpected error (in this case you should keep the checkpoint data).
>>>
>>> This way even after an unexpected exit the next run of the same app
>>> should be able to pick up the checkpointed data.
>>>
>>> Best Regards,
>>> Attila
>>>
>>>
>>>
>>>
>>> On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Hello people,

 I'm working on a fix for SPARK-33000
 . Spark does not
 cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate
 configs are set.

 In the course of developing a fix, another contributor pointed out
 
 that checkpointed data may not be the only type of resource that needs a
 fix for shutdown cleanup.

 I'm looking for a committer who might have an opinion on how Spark
 should clean up disk-based resources on shutdown. The last people who
 contributed significantly to the ContextCleaner, where this cleanup
 happens, were @witgo  and @andrewor14
 . But that was ~6 years ago, and I
 don't think they are active on the project anymore.

 Any takers to take a look and give their thoughts? The PR is 

Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Attila Zsolt Piros
> Checkpoint data is left behind after a normal shutdown, not just an
unexpected shutdown. The PR description includes a simple demonstration of
this.

I think I might overemphasized a bit the "unexpected" adjective to show you
the value in the current behavior.

The feature configured with
"spark.cleaner.referenceTracking.cleanCheckpoints" is about out of scoped
references without ANY shutdown.

It would be hard to distinguish that level (ShutdownHookManager) the
unexpected from the intentional exits.
As the user code (run by driver) could contain a System.exit() which was
added by the developer for numerous reasons (this way distinguishing
unexpected and not unexpected is not really an option).
Even a third party library can contain s System.exit(). Would that be an
unexpected exit or intentional? You can see it is hard to tell.

To test the real feature
behind "spark.cleaner.referenceTracking.cleanCheckpoints" you can create a
reference within a scope which is closed. For example within the body of a
function (without return value) and store it only in a local
variable. After the scope is closed in case of our function when the caller
gets the control back you have chance to see the context cleaner working
(you might even need to trigger a GC too).

On Wed, Mar 10, 2021 at 10:09 PM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Checkpoint data is left behind after a normal shutdown, not just an
> unexpected shutdown. The PR description includes a simple demonstration of
> this.
>
> If the current behavior is truly intended -- which I find difficult to
> believe given how confusing 
> it  is
>  -- then at the very least
> we need to update the documentation for both `.checkpoint()` and
> `cleanCheckpoints` to make that clear.
>
> > This way even after an unexpected exit the next run of the same app
> should be able to pick up the checkpointed data.
>
> The use case you are describing potentially makes sense. But preserving
> checkpoint data after an unexpected shutdown -- even when
> `cleanCheckpoints` is set to true -- is a new guarantee that is not
> currently expressed in the API or documentation. At least as far as I can
> tell.
>
> On Wed, Mar 10, 2021 at 3:10 PM Attila Zsolt Piros <
> piros.attila.zs...@gmail.com> wrote:
>
>> Hi Nick!
>>
>> I am not sure you are fixing a problem here. I think what you see is as
>> problem is actually an intended behaviour.
>>
>> Checkpoint data should outlive the unexpected shutdowns. So there is a
>> very important difference between the reference goes out of scope during a
>> normal execution (in this case cleanup is expected depending on the config
>> you mentioned) and when a references goes out of scope because of an
>> unexpected error (in this case you should keep the checkpoint data).
>>
>> This way even after an unexpected exit the next run of the same app
>> should be able to pick up the checkpointed data.
>>
>> Best Regards,
>> Attila
>>
>>
>>
>>
>> On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Hello people,
>>>
>>> I'm working on a fix for SPARK-33000
>>> . Spark does not
>>> cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate
>>> configs are set.
>>>
>>> In the course of developing a fix, another contributor pointed out
>>> 
>>> that checkpointed data may not be the only type of resource that needs a
>>> fix for shutdown cleanup.
>>>
>>> I'm looking for a committer who might have an opinion on how Spark
>>> should clean up disk-based resources on shutdown. The last people who
>>> contributed significantly to the ContextCleaner, where this cleanup
>>> happens, were @witgo  and @andrewor14
>>> . But that was ~6 years ago, and I don't
>>> think they are active on the project anymore.
>>>
>>> Any takers to take a look and give their thoughts? The PR is small
>>> . +39 / -2.
>>>
>>> Nick
>>>
>>>


Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Nicholas Chammas
Checkpoint data is left behind after a normal shutdown, not just an
unexpected shutdown. The PR description includes a simple demonstration of
this.

If the current behavior is truly intended -- which I find difficult to
believe given how confusing  it
 is
 -- then at the very least we
need to update the documentation for both `.checkpoint()` and
`cleanCheckpoints` to make that clear.

> This way even after an unexpected exit the next run of the same app
should be able to pick up the checkpointed data.

The use case you are describing potentially makes sense. But preserving
checkpoint data after an unexpected shutdown -- even when
`cleanCheckpoints` is set to true -- is a new guarantee that is not
currently expressed in the API or documentation. At least as far as I can
tell.

On Wed, Mar 10, 2021 at 3:10 PM Attila Zsolt Piros <
piros.attila.zs...@gmail.com> wrote:

> Hi Nick!
>
> I am not sure you are fixing a problem here. I think what you see is as
> problem is actually an intended behaviour.
>
> Checkpoint data should outlive the unexpected shutdowns. So there is a
> very important difference between the reference goes out of scope during a
> normal execution (in this case cleanup is expected depending on the config
> you mentioned) and when a references goes out of scope because of an
> unexpected error (in this case you should keep the checkpoint data).
>
> This way even after an unexpected exit the next run of the same app should
> be able to pick up the checkpointed data.
>
> Best Regards,
> Attila
>
>
>
>
> On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Hello people,
>>
>> I'm working on a fix for SPARK-33000
>> . Spark does not
>> cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate
>> configs are set.
>>
>> In the course of developing a fix, another contributor pointed out
>>  that
>> checkpointed data may not be the only type of resource that needs a fix for
>> shutdown cleanup.
>>
>> I'm looking for a committer who might have an opinion on how Spark should
>> clean up disk-based resources on shutdown. The last people who contributed
>> significantly to the ContextCleaner, where this cleanup happens, were
>> @witgo  and @andrewor14
>> . But that was ~6 years ago, and I don't
>> think they are active on the project anymore.
>>
>> Any takers to take a look and give their thoughts? The PR is small
>> . +39 / -2.
>>
>> Nick
>>
>>


Re: Shutdown cleanup of disk-based resources that Spark creates

2021-03-10 Thread Attila Zsolt Piros
Hi Nick!

I am not sure you are fixing a problem here. I think what you see is as
problem is actually an intended behaviour.

Checkpoint data should outlive the unexpected shutdowns. So there is a very
important difference between the reference goes out of scope during a
normal execution (in this case cleanup is expected depending on the config
you mentioned) and when a references goes out of scope because of an
unexpected error (in this case you should keep the checkpoint data).

This way even after an unexpected exit the next run of the same app should
be able to pick up the checkpointed data.

Best Regards,
Attila




On Wed, Mar 10, 2021 at 8:10 PM Nicholas Chammas 
wrote:

> Hello people,
>
> I'm working on a fix for SPARK-33000
> . Spark does not
> cleanup checkpointed RDDs/DataFrames on shutdown, even if the appropriate
> configs are set.
>
> In the course of developing a fix, another contributor pointed out
>  that
> checkpointed data may not be the only type of resource that needs a fix for
> shutdown cleanup.
>
> I'm looking for a committer who might have an opinion on how Spark should
> clean up disk-based resources on shutdown. The last people who contributed
> significantly to the ContextCleaner, where this cleanup happens, were
> @witgo  and @andrewor14
> . But that was ~6 years ago, and I don't
> think they are active on the project anymore.
>
> Any takers to take a look and give their thoughts? The PR is small
> . +39 / -2.
>
> Nick
>
>