[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

hhbyyh Wed, 22 Mar 2017 12:53:20 -0700

Github user hhbyyh commented on the issue:

    https://github.com/apache/spark/pull/17014
  
    I'm trying to refresh my memory and clear the targets on the topic, 
basically we want to achieve:
    1. Avoid double caching. If Input Dataset is already cached, then we should 
not cache the internal RDD.
    
    2. If input Dataset is not cached, some algorithms may need internal RDD 
caching to avoid warning from MLlib and also to avoid unnecessary 
re-computation. But I'm not sure about the scope. (Should we add this for all 
the algorithms? This is a behavior change for many algorithms). I don't think 
we have an ideal way to detect if a Dataset should be cached (it's parent may 
be cached already), and thus not sure if we should take action based on the 
speculative condition. 
    
    3. Avoid public API change.
    
    Let me know if I miss anything. I'll check on code now.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

Reply via email to