[
https://issues.apache.org/jira/browse/SPARK-29827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean R. Owen resolved SPARK-29827.
----------------------------------
Resolution: Duplicate
Same general answer - it's not clear that persisting is a win here. input is
assumed to be persisted if the user wants to spend the resources - you actually
quote the warning there.
These are also not bugs.
> Wrong persist strategy in mllib.clustering.BisectingKMeans.run
> --------------------------------------------------------------
>
> Key: SPARK-29827
> URL: https://issues.apache.org/jira/browse/SPARK-29827
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 2.4.3
> Reporter: Dong Wang
> Priority: Major
>
> There are three persist misuses in mllib.clustering.BisectingKMeans.run.
> * First, the rdd {color:#de350b}_input_{color} should be persisted, because
> it was not only used by the action _first(),_ but also used by other __
> actions in the following code.
> * Second, the rdd {color:#de350b}_assignments_{color} should be persisted.
> It was used in the fuction _summarize()_ more than once, which containts an
> action on _assignments_.
> * Third, once the rdd _{color:#de350b}assignments{color}_ is persisted_,_
> persisting the rdd {color:#de350b}_norms_{color} would be unnecessary.
> Because {color:#de350b}_norms_ {color} is an intermediate rdd. Since its
> child rdd {color:#de350b}_assignments_{color} is persisted, it is unnecessary
> to persist {color:#de350b}_norms_{color} anymore.
> {code:scala}
> private[spark] def run(
> input: RDD[Vector],
> instr: Option[Instrumentation]): BisectingKMeansModel = {
> if (input.getStorageLevel == StorageLevel.NONE) {
> logWarning(s"The input RDD ${input.id} is not directly cached, which
> may hurt performance if"
> + " its parent RDDs are also not cached.")
> }
> // Needs to persist input
> val d = input.map(_.size).first()
> logInfo(s"Feature dimension: $d.")
> val dMeasure: DistanceMeasure =
> DistanceMeasure.decodeFromString(this.distanceMeasure)
> // Compute and cache vector norms for fast distance computation.
> val norms = input.map(v => Vectors.norm(v,
> 2.0)).persist(StorageLevel.MEMORY_AND_DISK) // Unnecessary persist
> val vectors = input.zip(norms).map { case (x, norm) => new
> VectorWithNorm(x, norm) }
> var assignments = vectors.map(v => (ROOT_INDEX, v)) // Needs to persist
> var activeClusters = summarize(d, assignments, dMeasure)
> {code}
> This issue is reported by our tool CacheCheck, which is used to dynamically
> detecting persist()/unpersist() api misuses.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]