[ 
https://issues.apache.org/jira/browse/SPARK-1401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai closed SPARK-1401.
--------------------------

       Resolution: Duplicate
    Fix Version/s: 0.9.1

> Use mapParitions instead of map to avoid creating expensive object in 
> GradientDescent optimizer
> -----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-1401
>                 URL: https://issues.apache.org/jira/browse/SPARK-1401
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: DB Tsai
>            Priority: Minor
>              Labels: easyfix, performance
>             Fix For: 0.9.1
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In GradientDescent, currently, each row of the input data will create its own 
> gradient matrix object, and then we sum them up in the reducer. 
> We found that when the number of features are in the order of thousands, it 
> becomes the bottleneck. The situation was worse when we tested with Newton 
> optimizer due to that the dimension of hessian matrix is so huge. 
> In our testing, when the # of features are hundreds of thousands, the GC 
> kicks in for each row of input, and it sometimes brings down the workers. 
> With aggregating the lossSum, and gradientSum using mapPartitions, we solved 
> the GC issue, and scale better with # of features.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to