Github user tdas commented on the pull request:
https://github.com/apache/spark/pull/828#issuecomment-43825145
I dont think this cachePoint is a good idea at all. While it *can* give
better performance, it fundamentally breaks the fault-tolerance properties of
RDDs. If a cachePoint() an RDD with MEMORY_ONLY, and then the executor dies,
you have no way to recover the lost partitions as there is not lineage
information to how that RDD was created. All of Spark operations maintain this
guarantee of fault-tolerance despite failed workers and breaking that is a bad
idea. So this is a fundamentally unsafe operation to expose to the end-user.
In fact this is the same reason why checkpoint() has been implemented using
HDFS, so that fault-tolerance property is maintained (data save to
fault-tolerant storage) even if executors die.
That said, there is a good middle ground out here. We can do what
cachePoint() does while ensuring that the data is replicated within the
executors (so better fault-tolerance guarantee) but not expose it to the users
(so that it does break public API semantics). This would be a ALS-only solution.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---