Xiangrui Meng created SPARK-8520:
------------------------------------

             Summary: Improve GLM's scalability on number of features
                 Key: SPARK-8520
                 URL: https://issues.apache.org/jira/browse/SPARK-8520
             Project: Spark
          Issue Type: Improvement
          Components: ML
    Affects Versions: 1.4.0
            Reporter: Xiangrui Meng
            Priority: Critical


MLlib's GLM implementation uses driver to collect gradient updates. When there 
exist many features (>20 million), the driver becomes the performance 
bottleneck. In practice, it is common to see a problem with a large feature 
dimension, resulting from hashing or other feature transformations. So it is 
important to improve MLlib's scalability on number of features.

There are couple possible solutions:

1. still use driver to collect updates, but reduce the amount of data it 
collects at each iteration.
2. apply 2D partitioning to the training data and store the model coefficients 
distributively (e.g., vector-free l-bfgs)
3. parameter server
4. ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to