Dong Wang created SPARK-29823:
---------------------------------

             Summary: Wrong persist strategy in mllib.clustering.KMeans.run()
                 Key: SPARK-29823
                 URL: https://issues.apache.org/jira/browse/SPARK-29823
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 2.4.3
            Reporter: Dong Wang


In mllib.clustering.KMeans.run(), the rdd norms is persisted. But it only has a 
single child rdd zippedData, so it's a unnecessary persist. On the other hand, 
norms's child rdd zippedData will be used by multi times in runAlgorithm, so 
zippedData should be persisted.

{code:scala}
  private[spark] def run(
      data: RDD[Vector],
      instr: Option[Instrumentation]): KMeansModel = {
    if (data.getStorageLevel == StorageLevel.NONE) {
      logWarning("The input data is not directly cached, which may hurt 
performance if its"
        + " parent RDDs are also uncached.")
    }
    // Compute squared norms and cache them.
    val norms = data.map(Vectors.norm(_, 2.0))
    norms.persist() // Unnecessary persist. Only used to generate zippedData.
    val zippedData = data.zip(norms).map { case (v, norm) =>
      new VectorWithNorm(v, norm)
    } // needs to persist
    val model = runAlgorithm(zippedData, instr)
    norms.unpersist() // Change to zippedData.unpersist()
{code}

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to