[
https://issues.apache.org/jira/browse/SPARK-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354890#comment-14354890
]
Littlestar commented on SPARK-6240:
-----------------------------------
ok, I kown, Thanks.
I just only notice that Array[Item] is not a distributed structure.
> Spark MLlib fpm#FPGrowth genFreqItems use Array[Item] may outOfMemory for
> Large Sets
> ------------------------------------------------------------------------------------
>
> Key: SPARK-6240
> URL: https://issues.apache.org/jira/browse/SPARK-6240
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 1.3.0
> Reporter: Littlestar
> Priority: Minor
>
> Spark MLlib fpm#FPGrowth genFreqItems use Array[Item] may outOfMemory for
> Large Sets
> {noformat}
> private def genFreqItems[Item: ClassTag](
> data: RDD[Array[Item]],
> minCount: Long,
> partitioner: Partitioner): Array[Item] = {
> data.flatMap { t =>
> val uniq = t.toSet
> if (t.size != uniq.size) {
> throw new SparkException(s"Items in a transaction must be unique but
> got ${t.toSeq}.")
> }
> t
> }.map(v => (v, 1L))
> .reduceByKey(partitioner, _ + _)
> .filter(_._2 >= minCount)
> .collect()
> .sortBy(-_._2)
> .map(_._1)
> }
> {noformat}
> I use 10*10000*10000 records for test, for output all simultaneously pair.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]