[
https://issues.apache.org/jira/browse/SPARK-16008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiangrui Meng resolved SPARK-16008.
-----------------------------------
Resolution: Fixed
Fix Version/s: 2.0.0
Issue resolved by pull request 13729
[https://github.com/apache/spark/pull/13729]
> ML Logistic Regression aggregator serializes unnecessary data
> -------------------------------------------------------------
>
> Key: SPARK-16008
> URL: https://issues.apache.org/jira/browse/SPARK-16008
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Reporter: Seth Hendrickson
> Assignee: Seth Hendrickson
> Fix For: 2.0.0
>
>
> LogisticRegressionAggregator class is used to collect gradient updates in ML
> logistic regression algorithm. The class stores a reference to the
> coefficients array of length equal to the number of features. It also stores
> a reference to an array of standard deviations which is length numFeatures
> also. When a task is completed it serializes the class which also serializes
> a copy of the two arrays. These arrays don't need to be serialized (only the
> gradient updates are being aggregated). This causes issues performance issues
> when the number of features is large and can trigger excess garbage
> collection when the executor doesn't have much excess memory.
> This results in serializing 2*numFeatures excess data. When multiclass
> logistic regression is implemented, the excess will be numFeatures +
> numClasses * numFeatures.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]