Dong Wang created SPARK-29823: --------------------------------- Summary: Wrong persist strategy in mllib.clustering.KMeans.run() Key: SPARK-29823 URL: https://issues.apache.org/jira/browse/SPARK-29823 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.4.3 Reporter: Dong Wang
In mllib.clustering.KMeans.run(), the rdd norms is persisted. But it only has a single child rdd zippedData, so it's a unnecessary persist. On the other hand, norms's child rdd zippedData will be used by multi times in runAlgorithm, so zippedData should be persisted. {code:scala} private[spark] def run( data: RDD[Vector], instr: Option[Instrumentation]): KMeansModel = { if (data.getStorageLevel == StorageLevel.NONE) { logWarning("The input data is not directly cached, which may hurt performance if its" + " parent RDDs are also uncached.") } // Compute squared norms and cache them. val norms = data.map(Vectors.norm(_, 2.0)) norms.persist() // Unnecessary persist. Only used to generate zippedData. val zippedData = data.zip(norms).map { case (v, norm) => new VectorWithNorm(v, norm) } // needs to persist val model = runAlgorithm(zippedData, instr) norms.unpersist() // Change to zippedData.unpersist() {code} This issue is reported by our tool CacheCheck, which is used to dynamically detecting persist()/unpersist() api misuses. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org