[
https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782841#comment-16782841
]
Sean Owen commented on SPARK-25544:
-----------------------------------
I think this is a reasonable change -- you can test it in a PR if you're up for
it
> Slow/failed convergence in Spark ML models due to internal predictor scaling
> ----------------------------------------------------------------------------
>
> Key: SPARK-25544
> URL: https://issues.apache.org/jira/browse/SPARK-25544
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.3.2
> Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11
> Reporter: Andrew Crosby
> Priority: Major
>
> The LinearRegression and LogisticRegression estimators in Spark ML can take a
> large number of iterations to converge, or fail to converge altogether, when
> trained using the l-bfgs method with standardization turned off.
> *Details:*
> LinearRegression and LogisticRegression standardize their input features by
> default. In SPARK-8522 the option to disable standardization was added. This
> is implemented internally by changing the effective strength of
> regularization rather than disabling the feature scaling. Mathematically,
> both changing the effective regularizaiton strength, and disabling feature
> scaling should give the same solution, but they can have very different
> convergence properties.
> The normal justication given for scaling features is that it ensures that all
> covariances are O(1) and should improve numerical convergence, but this
> argument does not account for the regularization term. This doesn't cause any
> issues if standardization is set to true, since all features will have an
> O(1) regularization strength. But it does cause issues when standardization
> is set to false, since the effecive regularization strength of feature i is
> now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature.
> This means that predictors with small standard deviations (which can occur
> legitimately e.g. via one hot encoding) will have very large effective
> regularization strengths and consequently lead to very large gradients and
> thus poor convergence in the solver.
> *Example code to recreate:*
> To demonstrate just how bad these convergence issues can be, here is a very
> simple test case which builds a linear regression model with a categorical
> feature, a numerical feature and their interaction. When fed the specified
> training data, this model will fail to converge before it hits the maximum
> iteration limit. In this case, it is the interaction between category "2" and
> the numeric feature that leads to a feature with a small standard deviation.
> Training data:
> ||category||numericFeature||label||
> |1|1.0|0.5|
> |1|0.5|1.0|
> |2|0.01|2.0|
>
> {code:java}
> val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2,
> 2.0)).toDF("category", "numericFeature", "label")
> val indexer = new StringIndexer().setInputCol("category")
> .setOutputCol("categoryIndex")
> val encoder = new
> OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
> val interaction = new Interaction().setInputCols(Array("categoryEncoded",
> "numericFeature")).setOutputCol("interaction")
> val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded",
> "interaction")).setOutputCol("features")
> val model = new
> LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
> val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction,
> assembler, model))
> val pipelineModel = pipeline.fit(df)
> val numIterations =
> pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
> *Possible fix:*
> These convergence issues can be fixed by turning off feature scaling when
> standardization is set to false rather than using an effective regularization
> strength. This can be hacked into LinearRegression.scala by simply replacing
> line 423
> {code:java}
> val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
> {code}
> with
> {code:java}
> val featuresStd = if ($(standardization))
> featuresSummarizer.variance.toArray.map(math.sqrt) else
> featuresSummarizer.variance.toArray.map(x => 1.0)
> {code}
> Rerunning the above test code with that hack in place, will lead to
> convergence after just 4 iterations instead of hitting the max iterations
> limit!
> *Impact:*
> I can't speak for other people, but I've personally encountered these
> convergence issues several times when building production scale Spark ML
> models, and have resorted to writing my only implementation of
> LinearRegression with the above hack in place. The issue is made worse by the
> fact that Spark does not raise an error when the maximum number of iterations
> is hit, so the first time you encounter the issue it can take a while to
> figure out what is going on.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]