DB Tsai created SPARK-1401:
------------------------------
Summary: Use mapParitions instead of map to avoid creating
expensive object in GradientDescent optimizer
Key: SPARK-1401
URL: https://issues.apache.org/jira/browse/SPARK-1401
Project: Spark
Issue Type: Improvement
Components: MLlib
Reporter: DB Tsai
Priority: Minor
In GradientDescent, currently, each row of the input data will create its own
gradient matrix object, and then we sum them up in the reducer.
We found that when the number of features are in the order of thousands, it
becomes the bottleneck. The situation was worse when we tested with Newton
optimizer due to that the dimension of hessian matrix is so huge.
In our testing, when the # of features are hundreds of thousands, the GC kicks
in for each row of input, and it sometimes brings down the workers.
With aggregating the lossSum, and gradientSum using mapPartitions, we solved
the GC issue, and scale better with # of features.
--
This message was sent by Atlassian JIRA
(v6.2#6252)