[
https://issues.apache.org/jira/browse/SPARK-2612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiangrui Meng resolved SPARK-2612.
----------------------------------
Resolution: Fixed
Fix Version/s: 1.1.0
> ALS has data skew for popular product
> -------------------------------------
>
> Key: SPARK-2612
> URL: https://issues.apache.org/jira/browse/SPARK-2612
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 1.0.0
> Reporter: Peng Zhang
> Assignee: Peng Zhang
> Fix For: 1.1.0
>
>
> Usually there are some popular products which are related with many users in
> Rating inputs.
> groupByKey() in updateFeatures() may cause one extra Shuffle stage to gather
> data of the popular product to one task, because it's RDD's partitioner may
> be not used as the join() partitioner.
> The following join() need to shuffle from the aggregated product data. The
> shuffle block can easily be bigger than 2G, and shuffle failed as mentioned
> in SPARK-1476
> And increasing blocks number doesn't work.
> IMHO, groupByKey() should use the same partitioner as the other RDD in
> join(). So groupByKey() and join() will be in the same stage, and shuffle
> data from many previous tasks will not trigger "2G" limits.
--
This message was sent by Atlassian JIRA
(v6.2#6252)