Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/1082#issuecomment-50360869
Since broadcast variables also have an `unpersist()` method, should
`unpersistAll()` unpersist them too? If not, we should probably give it a more
specific name, such as `unpersistAllRdds()`.
What do you think is the typical use-case for wanting to unpersist all
state while retaining RDD lineage, etc? I imagine that a Spark Job Server that
runs multiple jobs using the same context might want to unpersist _portions_ of
its state as jobs finish, but bulk-unpersisting all state might be a bad
approach here since it could impact the performance of other jobs.
It seems like this job-server use-case could be addressed by a mechanism
similar to region-based memory management in which unpersist-able objects are
created in some persistence context, persistence contexts can be nested, and an
entire persistence content's resources can be freed at once. This captures
`unpersistAll()` as a special case in which every object is allocated as part
of the root persistence context. Individual jobs could have their own
persistence contexts, allowing users to unpersist only the new broadcast
variables / RDDs associated with a particular job / unit of work.
This "persistence context" approach might be over-engineered, but I think
it would be helpful to come up with a small set of real use-cases where batch
unpersistence is needed to see whether `unpersistAll()` adequately addresses
them. `unpersist()` is essentially manual memory management, so maybe we can
borrow patterns from there, too.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---