Brian Scannell created SPARK-26787: -------------------------------------- Summary: Fix standardization error message in WeightedLeastSquares Key: SPARK-26787 URL: https://issues.apache.org/jira/browse/SPARK-26787 Project: Spark Issue Type: Documentation Components: MLlib Affects Versions: 2.4.0, 2.3.1, 2.3.0 Environment: Tested in Spark 2.4.0 on DataBricks running in 5.1 ML Beta. The following Python code will replicate the error. {code:java} import pandas as pd from pyspark.ml.feature import VectorAssembler from pyspark.ml.regression import LinearRegression
df = pd.DataFrame({'foo': [1,2,3], 'bar':[4,5,6],'label':[1,1,1]}) spark_df = spark.createDataFrame(df) vectorAssembler = VectorAssembler(inputCols = ['foo', 'bar'], outputCol = 'features') train_sdf = vectorAssembler.transform(spark_df).select(['features', 'label']) lr = LinearRegression(featuresCol='features', labelCol='label', fitIntercept=False, standardization=False, regParam=1e-4) lr_model = lr.fit(train_sdf) {code} For context, the reason someone might want to do this is if they are trying to fit a model to estimate components of a fixed total. The label indicates the total is always 100%, but the components vary. For example, trying to estimate the unknown weights of different quantities of substances in a series of full bins. Reporter: Brian Scannell There is an error message in WeightedLeastSquares.scala that is incorrect and thus not very helpful for diagnosing an issue. The problem arises when doing regularized LinearRegression on a constant label. Even when the parameter standardization=False, the error will falsely state that standardization was set to True: {{The standard deviation of the label is zero. Model cannot be regularized with standardization=true}} This is because under the hood, LinearRegression automatically sets a parameter standardizeLabel=True. This was chosen for consistency with GLMNet, although WeightedLeastSquares is written to allow standardizeLabel to be set either way and work (although the public LinearRegression API does not allow it). I will submit a pull request with my suggested wording. Relevant: [https://github.com/apache/spark/pull/10702] [https://github.com/apache/spark/pull/10274/commits/d591989f7383b713110750f80b2720bcf24814b5] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org