[
https://issues.apache.org/jira/browse/SPARK-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958438#comment-13958438
]
DB Tsai commented on SPARK-1401:
--------------------------------
In SPARK-1212, this issue is addressed by aggregate. Gonna close it.
> Use mapParitions instead of map to avoid creating expensive object in
> GradientDescent optimizer
> -----------------------------------------------------------------------------------------------
>
> Key: SPARK-1401
> URL: https://issues.apache.org/jira/browse/SPARK-1401
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: DB Tsai
> Priority: Minor
> Labels: easyfix, performance
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> In GradientDescent, currently, each row of the input data will create its own
> gradient matrix object, and then we sum them up in the reducer.
> We found that when the number of features are in the order of thousands, it
> becomes the bottleneck. The situation was worse when we tested with Newton
> optimizer due to that the dimension of hessian matrix is so huge.
> In our testing, when the # of features are hundreds of thousands, the GC
> kicks in for each row of input, and it sometimes brings down the workers.
> With aggregating the lossSum, and gradientSum using mapPartitions, we solved
> the GC issue, and scale better with # of features.
--
This message was sent by Atlassian JIRA
(v6.2#6252)