[ 
https://issues.apache.org/jira/browse/SPARK-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958438#comment-13958438
 ] 

DB Tsai commented on SPARK-1401:
--------------------------------

In SPARK-1212, this issue is addressed by aggregate. Gonna close it.

> Use mapParitions instead of map to avoid creating expensive object in 
> GradientDescent optimizer
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-1401
>                 URL: https://issues.apache.org/jira/browse/SPARK-1401
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: DB Tsai
>            Priority: Minor
>              Labels: easyfix, performance
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In GradientDescent, currently, each row of the input data will create its own 
> gradient matrix object, and then we sum them up in the reducer. 
> We found that when the number of features are in the order of thousands, it 
> becomes the bottleneck. The situation was worse when we tested with Newton 
> optimizer due to that the dimension of hessian matrix is so huge. 
> In our testing, when the # of features are hundreds of thousands, the GC 
> kicks in for each row of input, and it sometimes brings down the workers. 
> With aggregating the lossSum, and gradientSum using mapPartitions, we solved 
> the GC issue, and scale better with # of features.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to