[
https://issues.apache.org/jira/browse/SPARK-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DB Tsai closed SPARK-1401.
--------------------------
Resolution: Duplicate
Fix Version/s: 0.9.1
> Use mapParitions instead of map to avoid creating expensive object in
> GradientDescent optimizer
> -----------------------------------------------------------------------------------------------
>
> Key: SPARK-1401
> URL: https://issues.apache.org/jira/browse/SPARK-1401
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: DB Tsai
> Priority: Minor
> Labels: easyfix, performance
> Fix For: 0.9.1
>
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> In GradientDescent, currently, each row of the input data will create its own
> gradient matrix object, and then we sum them up in the reducer.
> We found that when the number of features are in the order of thousands, it
> becomes the bottleneck. The situation was worse when we tested with Newton
> optimizer due to that the dimension of hessian matrix is so huge.
> In our testing, when the # of features are hundreds of thousands, the GC
> kicks in for each row of input, and it sometimes brings down the workers.
> With aggregating the lossSum, and gradientSum using mapPartitions, we solved
> the GC issue, and scale better with # of features.
--
This message was sent by Atlassian JIRA
(v6.2#6252)