DB Tsai created SPARK-1401:
------------------------------

             Summary: Use mapParitions instead of map to avoid creating 
expensive object in GradientDescent optimizer
                 Key: SPARK-1401
                 URL: https://issues.apache.org/jira/browse/SPARK-1401
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
            Reporter: DB Tsai
            Priority: Minor


In GradientDescent, currently, each row of the input data will create its own 
gradient matrix object, and then we sum them up in the reducer. 

We found that when the number of features are in the order of thousands, it 
becomes the bottleneck. The situation was worse when we tested with Newton 
optimizer due to that the dimension of hessian matrix is so huge. 

In our testing, when the # of features are hundreds of thousands, the GC kicks 
in for each row of input, and it sometimes brings down the workers. 

With aggregating the lossSum, and gradientSum using mapPartitions, we solved 
the GC issue, and scale better with # of features.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to