[jira] [Created] (SPARK-26787) Fix standardization error message in WeightedLeastSquares

Brian Scannell (JIRA) Wed, 30 Jan 2019 11:17:51 -0800

Brian Scannell created SPARK-26787:
--------------------------------------

             Summary: Fix standardization error message in WeightedLeastSquares
                 Key: SPARK-26787
                 URL: https://issues.apache.org/jira/browse/SPARK-26787
             Project: Spark
          Issue Type: Documentation
          Components: MLlib
    Affects Versions: 2.4.0, 2.3.1, 2.3.0
         Environment: Tested in Spark 2.4.0 on DataBricks running in 5.1 ML 
Beta. The following Python code will replicate the error. 
{code:java}
import pandas as pd
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression


df = pd.DataFrame({'foo': [1,2,3], 'bar':[4,5,6],'label':[1,1,1]})
spark_df = spark.createDataFrame(df)

vectorAssembler = VectorAssembler(inputCols = ['foo', 'bar'], outputCol = 
'features')
train_sdf = vectorAssembler.transform(spark_df).select(['features', 'label'])

lr = LinearRegression(featuresCol='features', labelCol='label', 
fitIntercept=False, standardization=False, regParam=1e-4)

lr_model = lr.fit(train_sdf)
{code}
 

For context, the reason someone might want to do this is if they are trying to 
fit a model to estimate components of a fixed total. The label indicates the 
total is always 100%, but the components vary. For example, trying to estimate 
the unknown weights of different quantities of substances in a series of full 
bins. 

 

 
            Reporter: Brian Scannell


There is an error message in WeightedLeastSquares.scala that is incorrect and 
thus not very helpful for diagnosing an issue. The problem arises when doing 
regularized LinearRegression on a constant label. Even when the parameter 
standardization=False, the error will falsely state that standardization was 
set to True:

{{The standard deviation of the label is zero. Model cannot be regularized with 
standardization=true}}

This is because under the hood, LinearRegression automatically sets a parameter 
standardizeLabel=True. This was chosen for consistency with GLMNet, although 
WeightedLeastSquares is written to allow standardizeLabel to be set either way 
and work (although the public LinearRegression API does not allow it).

 

I will submit a pull request with my suggested wording.

 

Relevant:

[https://github.com/apache/spark/pull/10702]

[https://github.com/apache/spark/pull/10274/commits/d591989f7383b713110750f80b2720bcf24814b5]
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-26787) Fix standardization error message in WeightedLeastSquares

Reply via email to