[ https://issues.apache.org/jira/browse/SPARK-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354890#comment-14354890 ]
Littlestar commented on SPARK-6240: ----------------------------------- ok, I kown, Thanks. I just only notice that Array[Item] is not a distributed structure. > Spark MLlib fpm#FPGrowth genFreqItems use Array[Item] may outOfMemory for > Large Sets > ------------------------------------------------------------------------------------ > > Key: SPARK-6240 > URL: https://issues.apache.org/jira/browse/SPARK-6240 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.3.0 > Reporter: Littlestar > Priority: Minor > > Spark MLlib fpm#FPGrowth genFreqItems use Array[Item] may outOfMemory for > Large Sets > {noformat} > private def genFreqItems[Item: ClassTag]( > data: RDD[Array[Item]], > minCount: Long, > partitioner: Partitioner): Array[Item] = { > data.flatMap { t => > val uniq = t.toSet > if (t.size != uniq.size) { > throw new SparkException(s"Items in a transaction must be unique but > got ${t.toSeq}.") > } > t > }.map(v => (v, 1L)) > .reduceByKey(partitioner, _ + _) > .filter(_._2 >= minCount) > .collect() > .sortBy(-_._2) > .map(_._1) > } > {noformat} > I use 10*10000*10000 records for test, for output all simultaneously pair. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org