[
https://issues.apache.org/jira/browse/SPARK-14464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon resolved SPARK-14464.
----------------------------------
Resolution: Incomplete
> Logistic regression performs poorly for very large vectors, even when the
> number of non-zero features is small
> --------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-14464
> URL: https://issues.apache.org/jira/browse/SPARK-14464
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 1.6.0
> Reporter: Daniel Siegmann
> Priority: Major
> Labels: bulk-closed
>
> When training (a.k.a. fitting)
> org.apache.spark.ml.classification.LogisticRegression, aggregation is done
> using arrays (which are dense structures). This is the case regardless of
> whether the features of each instance are stored in a sparse or a dense
> vector.
> When the feature vectors are very large, performance is poor because there's
> a lot of overhead in transmitting these large arrays across workers.
> However, just because the feature vectors are large doesn't mean all these
> features are actually being used. If the actual features are sparse, very
> large arrays are being allocated unnecessarily.
> To solve this case, there should be an option to aggregate using sparse
> vectors. It should be up to the sure to set this explicitly as a parameter on
> the estimator, since the user should have some idea whether it is necessary
> for their particular case.
> As an example, I have a use case where the features vector size is around 20
> million. However, there are only 7 - 8 thousand non-zero features. The time
> to train a model with only 10 worker is 1.3 hours on my test cluster. With
> 100 workers this balloons to over 10 hours on the same cluster! Also,
> spark.driver.maxResultSize is set to 112 GB to accommodate the data being
> pulled back to the driver.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]