Seth Hendrickson created SPARK-16008:
----------------------------------------
Summary: ML Logistic Regression aggregator serializes unnecessary
data
Key: SPARK-16008
URL: https://issues.apache.org/jira/browse/SPARK-16008
Project: Spark
Issue Type: Bug
Components: ML
Reporter: Seth Hendrickson
LogisticRegressionAggregator class is used to collect gradient updates in ML
logistic regression algorithm. The class stores a reference to the coefficients
array of length equal to the number of features. It also stores a reference to
an array of standard deviations which is length numFeatures also. When a task
is completed it serializes the class which also serializes a copy of the two
arrays. These arrays don't need to be serialized (only the gradient updates are
being aggregated). This causes issues performance issues when the number of
features is large and can trigger excess garbage collection when the executor
doesn't have much excess memory.
This results in serializing 2*numFeatures excess data. When multiclass logistic
regression is implemented, the excess will be numFeatures + numClasses *
numFeatures.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]