[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

tdas Wed, 21 May 2014 15:59:24 -0700

Github user tdas commented on the pull request:

    https://github.com/apache/spark/pull/828#issuecomment-43825145
  
    I dont think this cachePoint is a good idea at all. While it *can* give 
better performance, it fundamentally breaks the fault-tolerance properties of 
RDDs. If a cachePoint() an RDD with MEMORY_ONLY, and then the executor dies, 
you have no way to recover the lost partitions as there is not lineage 
information to how that RDD was created. All of Spark operations maintain this 
guarantee of fault-tolerance despite failed workers and breaking that is a bad 
idea. So this is a fundamentally unsafe operation to expose to the end-user.
    
    In fact this is the same reason why checkpoint() has been implemented using 
HDFS, so that fault-tolerance property is maintained (data save to 
fault-tolerant storage) even if executors die. 
    
    That said, there is a good middle ground out here. We can do what 
cachePoint() does while ensuring that the data is replicated within the 
executors (so better fault-tolerance guarantee) but not expose it to the users 
(so that it does break public API semantics). This would be a ALS-only solution.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [WIP]Improve ALS algorithm resource usage

Reply via email to