Dong Wang created SPARK-29813:

             Summary: Missing persist in mllib.PrefixSpan.findFrequentItems()
                 Key: SPARK-29813
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 2.4.3
            Reporter: Dong Wang

There are three actions in this piece of code: reduceByKey, sortBy, and 
collect. But data is not persisted, which will cause recomputation.

  private[fpm] def findFrequentItems[Item: ClassTag](
      data: RDD[Array[Array[Item]]],
      minCount: Long): Array[Item] = {

    data.flatMap { itemsets =>
      val uniqItems = mutable.Set.empty[Item]
      itemsets.foreach(set => uniqItems ++= set), 1L))
    }.reduceByKey(_ + _).filter { case (_, count) =>
      count >= minCount

This issue is reported by our tool CacheCheck, which is used to dynamically 
detecting persist()/unpersist() api misuses.

This message was sent by Atlassian Jira

To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to