Xiangrui Meng created SPARK-8520:
------------------------------------
Summary: Improve GLM's scalability on number of features
Key: SPARK-8520
URL: https://issues.apache.org/jira/browse/SPARK-8520
Project: Spark
Issue Type: Improvement
Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Priority: Critical
MLlib's GLM implementation uses driver to collect gradient updates. When there
exist many features (>20 million), the driver becomes the performance
bottleneck. In practice, it is common to see a problem with a large feature
dimension, resulting from hashing or other feature transformations. So it is
important to improve MLlib's scalability on number of features.
There are couple possible solutions:
1. still use driver to collect updates, but reduce the amount of data it
collects at each iteration.
2. apply 2D partitioning to the training data and store the model coefficients
distributively (e.g., vector-free l-bfgs)
3. parameter server
4. ...
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]