Github user holdenk commented on the pull request:
https://github.com/apache/spark/pull/11919#issuecomment-212672336
@MLnick As mentioned in the line comments, that approach turns out to not
be as simple as planned. checkpointing kills all of the parents information we
need to clean up the shuffle files. I could refactor this so that we capture
the dependency information needed - but a count() on a cached RDD should be low
enough cost I'm not sure it would be worth it. What are your thoughts?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]