[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

WeichenXu123 Tue, 29 Aug 2017 19:09:40 -0700

Github user WeichenXu123 commented on the issue:

    https://github.com/apache/spark/pull/17014
  
    cc @zhengruifeng 
    I update my comment you need check again, thanks!
    ------------
    I read the PR again, it still do not resolve double-caching issue in KMeans.
    in KMean, your code check
    val handlePersistence = dataset.storageLevel == StorageLevel.NONE
    supporse the dataset is cached,
    handlePersistence will get false, the following code do not persist dataset:
    if (handlePersistence) instances.persist(StorageLevel.MEMORY_AND_DISK)
    BUT!!
    in MLlibKMeans, it will check the storage level the passed in instance RDD, 
it will still get NONE, so in MLlibKMeans it will cache it again. ==> Double 
caching.
    
    So this case we need to pass the handlePersistence flag to the mllib 
implementation code.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17014: [SPARK-18608][ML] Fix double-caching in ML algorithms

Reply via email to