Repository: spark Updated Branches: refs/heads/master 890baaca5 -> 6ecedf39b
[SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression behavior difference ## What changes were proposed in this pull request? When fitting ```LinearRegressionModel```(by "l-bfgs" solver) and ```LogisticRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce same model as R glmnet but different from LIBSVM. When fitting ```AFTSurvivalRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce different model compared with R survival::survreg. We should output a warning message and clarify in document for this condition. ## How was this patch tested? Document change, no unit test. cc mengxr Author: Yanbo Liang <yblia...@gmail.com> Closes #12731 from yanboliang/spark-13590. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6ecedf39 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6ecedf39 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6ecedf39 Branch: refs/heads/master Commit: 6ecedf39b44c9acd58cdddf1a31cf11e8e24428c Parents: 890baac Author: Yanbo Liang <yblia...@gmail.com> Authored: Tue Jun 7 15:25:36 2016 -0700 Committer: Yanbo Liang <yblia...@gmail.com> Committed: Tue Jun 7 15:25:36 2016 -0700 ---------------------------------------------------------------------- docs/ml-classification-regression.md | 6 ++++++ .../apache/spark/ml/classification/LogisticRegression.scala | 7 +++++++ .../apache/spark/ml/regression/AFTSurvivalRegression.scala | 9 ++++++++- .../org/apache/spark/ml/regression/LinearRegression.scala | 7 +++++++ 4 files changed, 28 insertions(+), 1 deletion(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/6ecedf39/docs/ml-classification-regression.md ---------------------------------------------------------------------- diff --git a/docs/ml-classification-regression.md b/docs/ml-classification-regression.md index ff8dec6..88457d4 100644 --- a/docs/ml-classification-regression.md +++ b/docs/ml-classification-regression.md @@ -62,6 +62,8 @@ For more background and more details about the implementation, refer to the docu > The current implementation of logistic regression in `spark.ml` only supports binary classes. Support for multiclass regression will be added in the future. + > When fitting LogisticRegressionModel without intercept on dataset with constant nonzero column, Spark MLlib outputs zero coefficients for constant nonzero columns. This behavior is the same as R glmnet but different from LIBSVM. + **Example** The following example shows how to train a logistic regression model @@ -351,6 +353,8 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.classificat The interface for working with linear regression models and model summaries is similar to the logistic regression case. + > When fitting LinearRegressionModel without intercept on dataset with constant nonzero column by "l-bfgs" solver, Spark MLlib outputs zero coefficients for constant nonzero columns. This behavior is the same as R glmnet but different from LIBSVM. + **Example** The following @@ -666,6 +670,8 @@ The optimization algorithm underlying the implementation is L-BFGS. The implementation matches the result from R's survival function [survreg](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html) + > When fitting AFTSurvivalRegressionModel without intercept on dataset with constant nonzero column, Spark MLlib outputs zero coefficients for constant nonzero columns. This behavior is different from R survival::survreg. + **Example** <div class="codetabs"> http://git-wip-us.apache.org/repos/asf/spark/blob/6ecedf39/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala ---------------------------------------------------------------------- diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala index 1ea4d90..51ede15 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala @@ -333,6 +333,13 @@ class LogisticRegression @Since("1.2.0") ( val featuresMean = summarizer.mean.toArray val featuresStd = summarizer.variance.toArray.map(math.sqrt) + if (!$(fitIntercept) && (0 until numFeatures).exists { i => + featuresStd(i) == 0.0 && featuresMean(i) != 0.0 }) { + logWarning("Fitting LogisticRegressionModel without intercept on dataset with " + + "constant nonzero column, Spark MLlib outputs zero coefficients for constant " + + "nonzero columns. This behavior is the same as R glmnet but different from LIBSVM.") + } + val regParamL1 = $(elasticNetParam) * $(regParam) val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam) http://git-wip-us.apache.org/repos/asf/spark/blob/6ecedf39/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala ---------------------------------------------------------------------- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala index c440073..e5f23f4 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala @@ -209,11 +209,18 @@ class AFTSurvivalRegression @Since("1.6.0") (@Since("1.6.0") override val uid: S } val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) + val numFeatures = featuresStd.size + + if (!$(fitIntercept) && (0 until numFeatures).exists { i => + featuresStd(i) == 0.0 && featuresSummarizer.mean(i) != 0.0 }) { + logWarning("Fitting AFTSurvivalRegressionModel without intercept on dataset with " + + "constant nonzero column, Spark MLlib outputs zero coefficients for constant nonzero " + + "columns. This behavior is different from R survival::survreg.") + } val costFun = new AFTCostFun(instances, $(fitIntercept), featuresStd) val optimizer = new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) - val numFeatures = featuresStd.size /* The parameters vector has three parts: the first element: Double, log(sigma), the log of scale parameter http://git-wip-us.apache.org/repos/asf/spark/blob/6ecedf39/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala ---------------------------------------------------------------------- diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala index 6be2584..52ec40e 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala @@ -267,6 +267,13 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") override val uid: String val featuresMean = featuresSummarizer.mean.toArray val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) + if (!$(fitIntercept) && (0 until numFeatures).exists { i => + featuresStd(i) == 0.0 && featuresMean(i) != 0.0 }) { + logWarning("Fitting LinearRegressionModel without intercept on dataset with " + + "constant nonzero column, Spark MLlib outputs zero coefficients for constant nonzero " + + "columns. This behavior is the same as R glmnet but different from LIBSVM.") + } + // Since we implicitly do the feature scaling when we compute the cost function // to improve the convergence, the effective regParam will be changed. val effectiveRegParam = $(regParam) / yStd --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org