Github user szhem commented on the issue:
https://github.com/apache/spark/pull/19410
Hi @asolimando,
I believe that the solution with weak references will work and probably
with `ContextCleaner` too, but there are some points I'd like to discuss if you
don't mind
- Let's take `ContextCleaner` first. In that case we should have a pretty
simple solution, but the backward compatibility of `PeriodicRDDCheckpointer`
and `PeriodicGraphCheckpointer` will be lost, because
- it will be necessary to update these classes to prevent deleting
checkpoint files
- user will always have to provide
`spark.cleaner.referenceTracking.cleanCheckpoints` property in order to clean
unnecessary checkpoints.
- the users who haven't specified
`spark.cleaner.referenceTracking.cleanCheckpoints` previously (and I believe
there will be most of them) will be affected by this new unexpected behaviour
- In case of custom solution based on weak references
- it will be necessary to poll a reference queue at some place and moment
to remove unnecessary checkpoint files.
- polling the reference queue in the separate thread will complicate the
code
- polling the reference queue synchronously does not guarantee deleting
all the unnecessary checkpoint files.
In case of reference queue, could you please recommend the convenient place
in the source code to do it?
As for me
- setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true` by
default should work
- setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true` by
default will allow us to simplify `PeriodicRDDCheckpointer` and
`PeriodicGraphCheckpointer` too by deleting unnecessary code that cleans up
checkpoint files
- setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true` by
default sounds reasonable and should not break the code of those users who
didn't do it previously
- but setting `spark.cleaner.referenceTracking.cleanCheckpoints` to `true`
will probably break the backward compatibility (although this PR tries to
preserve it)
What do you think? Will the community accept setting
`spark.cleaner.referenceTracking.cleanCheckpoints` to `true`by default?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]