[ 
https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-25544:
----------------------------------
    Description: 
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal justication given for scaling features is that it ensures that all 
covariances are O(1) and should improve numerical convergence, but this 
argument does not account for the regularization term. This doesn't cause any 
issues if standardization is set to true, since all features will have an O(1) 
regularization strength. But it does cause issues when standardization is set 
to false, since the effecive regularization strength of feature i is now O(1/ 
sigma_i^2) where sigma_i is the standard deviation of the feature. This means 
that predictors with small standard deviations will have very large effective 
regularization strengths and consequently lead to very large gradients and thus 
poor convergence in the solver.

*Example code to recreate:*

To demonstrate just how bad these convergence issues can be, here is a very 
simple test case which builds a linear regression model with a categorical 
feature, a numerical feature and their interaction. When fed the specified 
training data, this model will fail to converge before it hits the maximum 
iteration limit.

Training data:
||category||numericFeature||label||
|1|1.0|0.5|
|1|0.5|1.0|
|2|0.01|2.0|

 
{code:java}
val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
2.0)).toDF("category", "numericFeature", "label")

val indexer = new StringIndexer().setInputCol("category") 
.setOutputCol("categoryIndex")
val encoder = new 
OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
"numericFeature")).setOutputCol("interaction")
val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
"interaction")).setOutputCol("features")
val model = new 
LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
assembler, model))

val pipelineModel  = pipeline.fit(df)

val numIterations = 
pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
 *Possible fix:*

These convergence issues can be fixed by turning off feature scaling when 
standardization is set to false rather than using an effective regularization 
strength. This can be hacked into LinearRegression.scala by simply replacing 
line 423
{code:java}
val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
{code}
with
{code:java}
val featuresStd = if ($(standardization)) 
featuresSummarizer.variance.toArray.map(math.sqrt) else 
featuresSummarizer.variance.toArray.map(x => 1.0)
{code}
Rerunning the above test code with that hack in place, will lead to convergence 
after just 4 iterations instead of hitting the max iterations limit!

*Impact:*

I can't speak for other people, but I've personally encountered these 
convergence issues several times when building production scale Spark ML 
models, and have resorted to writing my only implementation of LinearRegression 
with the above hack in place. The issue is made worse by the fact that Spark 
does not raise an error when the maximum number of iterations is hit, so the 
first time you encounter the issue it can take a while to figure out what is 
going on.

 

  was:
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal justication given for scaling features is that it ensures that all 
covariances are O(1) and should improve numerical convergence, but this 
argument does not account for the regularization term. This doesn't cause any 
issues if standardization is set to true, since all features will have an O(1) 
regularization strength. But it does cause issues when standardization is set 
to false, since the effecive regularization strength of feature i is now O(1/ 
sigma_i^2) where sigma_i is the standard deviation of the feature. This means 
that predictors with small standard deviations will have very large effective 
regularization strengths and consequently lead to very large gradients and thus 
poor convergence in the solver.

*Example code to recreate:*

To demonstrate just how bad these convergence issues can be, here is a very 
simple test case which builds a linear regression model with a categorical 
feature, a numerical feature and their interaction. When fed the specified 
training data, this model will fail to converge before it hits the maximum 
iteration limit.

Training data:
||category||numericFeature||label||
|1|1.0|0.5|
|1|0.5|1.0|
|2|0.01|2.0|

 
{code:java}
val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
2.0)).toDF("category", "numericFeature", "label")

val indexer = new StringIndexer().setInputCol("category") 
.setOutputCol("categoryIndex")
val encoder = new 
OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
"numericFeature")).setOutputCol("interaction")
val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
"interaction")).setOutputCol("features")
val model = new 
LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
assembler, model))

val pipelineModel  = pipeline.fit(df)

val numIterations = 
pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
 *Possible fix:*

These convergence issues can be fixed by turning of feature scaling when 
standardization is set to false rather than using an effective regularization 
strength. This can be hacked into LinearRegression.scala by simply replacing 
line 423
{code:java}
val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
{code}
with
{code:java}
val featuresStd = if ($(standardization)) 
featuresSummarizer.variance.toArray.map(math.sqrt) else 
featuresSummarizer.variance.toArray.map(x => 1.0)
{code}
Rerunning the above test code with that hack in place, will lead to convergence 
after just 4 iterations instead of hitting the max iterations limit!

*Impact:*

I can't speak for other people, but I've personally encountered these 
convergence issues several times when building production scale Spark ML 
models, and have resorted to writing my only implementation of LinearRegression 
with the above hack in place. The issue is made worse by the fact that Spark 
does not raise an error when the maximum number of iterations is hit, so the 
first time you encounter the issue it can take a while to figure out what is 
going on.

 


> Slow/failed convergence in Spark ML models due to internal predictor scaling
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-25544
>                 URL: https://issues.apache.org/jira/browse/SPARK-25544
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.2
>         Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11
>            Reporter: Andrew Crosby
>            Priority: Major
>
> The LinearRegression and LogisticRegression estimators in Spark ML can take a 
> large number of iterations to converge, or fail to converge altogether, when 
> trained using the l-bfgs method with standardization turned off.
> *Details:*
> LinearRegression and LogisticRegression standardize their input features by 
> default. In SPARK-8522 the option to disable standardization was added. This 
> is implemented internally by changing the effective strength of 
> regularization rather than disabling the feature scaling. Mathematically, 
> both changing the effective regularizaiton strength, and disabling feature 
> scaling should give the same solution, but they can have very different 
> convergence properties.
> The normal justication given for scaling features is that it ensures that all 
> covariances are O(1) and should improve numerical convergence, but this 
> argument does not account for the regularization term. This doesn't cause any 
> issues if standardization is set to true, since all features will have an 
> O(1) regularization strength. But it does cause issues when standardization 
> is set to false, since the effecive regularization strength of feature i is 
> now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. 
> This means that predictors with small standard deviations will have very 
> large effective regularization strengths and consequently lead to very large 
> gradients and thus poor convergence in the solver.
> *Example code to recreate:*
> To demonstrate just how bad these convergence issues can be, here is a very 
> simple test case which builds a linear regression model with a categorical 
> feature, a numerical feature and their interaction. When fed the specified 
> training data, this model will fail to converge before it hits the maximum 
> iteration limit.
> Training data:
> ||category||numericFeature||label||
> |1|1.0|0.5|
> |1|0.5|1.0|
> |2|0.01|2.0|
>  
> {code:java}
> val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
> 2.0)).toDF("category", "numericFeature", "label")
> val indexer = new StringIndexer().setInputCol("category") 
> .setOutputCol("categoryIndex")
> val encoder = new 
> OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
> val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
> "numericFeature")).setOutputCol("interaction")
> val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
> "interaction")).setOutputCol("features")
> val model = new 
> LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
> val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
> assembler, model))
> val pipelineModel  = pipeline.fit(df)
> val numIterations = 
> pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
>  *Possible fix:*
> These convergence issues can be fixed by turning off feature scaling when 
> standardization is set to false rather than using an effective regularization 
> strength. This can be hacked into LinearRegression.scala by simply replacing 
> line 423
> {code:java}
> val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
> {code}
> with
> {code:java}
> val featuresStd = if ($(standardization)) 
> featuresSummarizer.variance.toArray.map(math.sqrt) else 
> featuresSummarizer.variance.toArray.map(x => 1.0)
> {code}
> Rerunning the above test code with that hack in place, will lead to 
> convergence after just 4 iterations instead of hitting the max iterations 
> limit!
> *Impact:*
> I can't speak for other people, but I've personally encountered these 
> convergence issues several times when building production scale Spark ML 
> models, and have resorted to writing my only implementation of 
> LinearRegression with the above hack in place. The issue is made worse by the 
> fact that Spark does not raise an error when the maximum number of iterations 
> is hit, so the first time you encounter the issue it can take a while to 
> figure out what is going on.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to