[ 
https://issues.apache.org/jira/browse/SPARK-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14354890#comment-14354890
 ] 

Littlestar commented on SPARK-6240:
-----------------------------------

ok, I kown, Thanks.
I  just only notice that Array[Item] is not a distributed structure.

> Spark MLlib fpm#FPGrowth genFreqItems use Array[Item] may outOfMemory for 
> Large Sets
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-6240
>                 URL: https://issues.apache.org/jira/browse/SPARK-6240
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Littlestar
>            Priority: Minor
>
> Spark MLlib fpm#FPGrowth genFreqItems use Array[Item] may outOfMemory for 
> Large Sets
> {noformat}
>   private def genFreqItems[Item: ClassTag](
>       data: RDD[Array[Item]],
>       minCount: Long,
>       partitioner: Partitioner): Array[Item] = {
>     data.flatMap { t =>
>       val uniq = t.toSet
>       if (t.size != uniq.size) {
>         throw new SparkException(s"Items in a transaction must be unique but 
> got ${t.toSeq}.")
>       }
>       t
>     }.map(v => (v, 1L))
>       .reduceByKey(partitioner, _ + _)
>       .filter(_._2 >= minCount)
>       .collect()
>       .sortBy(-_._2)
>       .map(_._1)
>   }
> {noformat}
> I use 10*10000*10000 records for test, for output all simultaneously pair.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to