[GitHub] [spark] yeyuqiang commented on a change in pull request #27052: [SPARK-30390][MLLIB] Avoid double caching in mllib.KMeans#runWithWeights.

GitBox Wed, 12 Aug 2020 08:34:44 -0700


yeyuqiang commented on a change in pull request #27052:
URL: https://github.com/apache/spark/pull/27052#discussion_r469348971




##########
File path: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala
##########
@@ -232,15 +227,13 @@ class KMeans private (
     val zippedData = data.zip(norms).map { case ((v, w), norm) =>
       (new VectorWithNorm(v, norm), w)
     }
-    zippedData.persist(StorageLevel.MEMORY_AND_DISK)
-    val model = runAlgorithmWithWeight(zippedData, instr)
-    zippedData.unpersist()
 
-    // Warn at the end of the run as well, for increased visibility.
     if (data.getStorageLevel == StorageLevel.NONE) {

Review comment:
       Hi, I was testing spark kmeans. There should be an issue that no matter 
we persist the parent RDD, here the data.getStorageLevel will always be NONE 
due to the following operation, this will cause double caching.
   
   ```
   def run(data: RDD[Vector]): KMeansModel = {
       val instances: RDD[(Vector, Double)] = data.map {
         case (point) => (point, 1.0)
       }
       runWithWeight(instances, None)
   }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] yeyuqiang commented on a change in pull request #27052: [SPARK-30390][MLLIB] Avoid double caching in mllib.KMeans#runWithWeights.

Reply via email to