spark git commit: [SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression behavior difference

yliang Tue, 07 Jun 2016 15:26:21 -0700

Repository: spark
Updated Branches:
  refs/heads/master 890baaca5 -> 6ecedf39b



[SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression 
behavior difference

## What changes were proposed in this pull request?
When fitting ```LinearRegressionModel```(by "l-bfgs" solver) and 
```LogisticRegressionModel``` w/o intercept on dataset with constant nonzero 
column, spark.ml produce same model as R glmnet but different from LIBSVM.

When fitting ```AFTSurvivalRegressionModel``` w/o intercept on dataset with 
constant nonzero column, spark.ml produce different model compared with R 
survival::survreg.

We should output a warning message and clarify in document for this condition.

## How was this patch tested?
Document change, no unit test.

cc mengxr

Author: Yanbo Liang <yblia...@gmail.com>

Closes #12731 from yanboliang/spark-13590.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/6ecedf39
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/6ecedf39
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/6ecedf39

Branch: refs/heads/master
Commit: 6ecedf39b44c9acd58cdddf1a31cf11e8e24428c
Parents: 890baac
Author: Yanbo Liang <yblia...@gmail.com>
Authored: Tue Jun 7 15:25:36 2016 -0700
Committer: Yanbo Liang <yblia...@gmail.com>
Committed: Tue Jun 7 15:25:36 2016 -0700

----------------------------------------------------------------------
 docs/ml-classification-regression.md                        | 6 ++++++
 .../apache/spark/ml/classification/LogisticRegression.scala | 7 +++++++
 .../apache/spark/ml/regression/AFTSurvivalRegression.scala  | 9 ++++++++-
 .../org/apache/spark/ml/regression/LinearRegression.scala   | 7 +++++++
 4 files changed, 28 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/6ecedf39/docs/ml-classification-regression.md
----------------------------------------------------------------------
diff --git a/docs/ml-classification-regression.md 
b/docs/ml-classification-regression.md
index ff8dec6..88457d4 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -62,6 +62,8 @@ For more background and more details about the 
implementation, refer to the docu
 
   > The current implementation of logistic regression in `spark.ml` only 
supports binary classes. Support for multiclass regression will be added in the 
future.
 
+  > When fitting LogisticRegressionModel without intercept on dataset with 
constant nonzero column, Spark MLlib outputs zero coefficients for constant 
nonzero columns. This behavior is the same as R glmnet but different from 
LIBSVM.
+
 **Example**
 
 The following example shows how to train a logistic regression model
@@ -351,6 +353,8 @@ Refer to the [Python API 
docs](api/python/pyspark.ml.html#pyspark.ml.classificat
 The interface for working with linear regression models and model
 summaries is similar to the logistic regression case.
 
+  > When fitting LinearRegressionModel without intercept on dataset with 
constant nonzero column by "l-bfgs" solver, Spark MLlib outputs zero 
coefficients for constant nonzero columns. This behavior is the same as R 
glmnet but different from LIBSVM.
+
 **Example**
 
 The following
@@ -666,6 +670,8 @@ The optimization algorithm underlying the implementation is 
L-BFGS.
 The implementation matches the result from R's survival function 
 
[survreg](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html)
 
+  > When fitting AFTSurvivalRegressionModel without intercept on dataset with 
constant nonzero column, Spark MLlib outputs zero coefficients for constant 
nonzero columns. This behavior is different from R survival::survreg.
+
 **Example**
 
 <div class="codetabs">

http://git-wip-us.apache.org/repos/asf/spark/blob/6ecedf39/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
index 1ea4d90..51ede15 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
@@ -333,6 +333,13 @@ class LogisticRegression @Since("1.2.0") (
         val featuresMean = summarizer.mean.toArray
         val featuresStd = summarizer.variance.toArray.map(math.sqrt)
 
+        if (!$(fitIntercept) && (0 until numFeatures).exists { i =>
+          featuresStd(i) == 0.0 && featuresMean(i) != 0.0 }) {
+          logWarning("Fitting LogisticRegressionModel without intercept on 
dataset with " +
+            "constant nonzero column, Spark MLlib outputs zero coefficients 
for constant " +
+            "nonzero columns. This behavior is the same as R glmnet but 
different from LIBSVM.")
+        }
+
         val regParamL1 = $(elasticNetParam) * $(regParam)
         val regParamL2 = (1.0 - $(elasticNetParam)) * $(regParam)
 

http://git-wip-us.apache.org/repos/asf/spark/blob/6ecedf39/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
 
b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
index c440073..e5f23f4 100644
--- 
a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
@@ -209,11 +209,18 @@ class AFTSurvivalRegression @Since("1.6.0") 
(@Since("1.6.0") override val uid: S
     }
 
     val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
+    val numFeatures = featuresStd.size
+
+    if (!$(fitIntercept) && (0 until numFeatures).exists { i =>
+        featuresStd(i) == 0.0 && featuresSummarizer.mean(i) != 0.0 }) {
+      logWarning("Fitting AFTSurvivalRegressionModel without intercept on 
dataset with " +
+        "constant nonzero column, Spark MLlib outputs zero coefficients for 
constant nonzero " +
+        "columns. This behavior is different from R survival::survreg.")
+    }
 
     val costFun = new AFTCostFun(instances, $(fitIntercept), featuresStd)
     val optimizer = new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol))
 
-    val numFeatures = featuresStd.size
     /*
        The parameters vector has three parts:
        the first element: Double, log(sigma), the log of scale parameter

http://git-wip-us.apache.org/repos/asf/spark/blob/6ecedf39/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
----------------------------------------------------------------------
diff --git 
a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala 
b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
index 6be2584..52ec40e 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
@@ -267,6 +267,13 @@ class LinearRegression @Since("1.3.0") (@Since("1.3.0") 
override val uid: String
     val featuresMean = featuresSummarizer.mean.toArray
     val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
 
+    if (!$(fitIntercept) && (0 until numFeatures).exists { i =>
+      featuresStd(i) == 0.0 && featuresMean(i) != 0.0 }) {
+      logWarning("Fitting LinearRegressionModel without intercept on dataset 
with " +
+        "constant nonzero column, Spark MLlib outputs zero coefficients for 
constant nonzero " +
+        "columns. This behavior is the same as R glmnet but different from 
LIBSVM.")
+    }
+
     // Since we implicitly do the feature scaling when we compute the cost 
function
     // to improve the convergence, the effective regParam will be changed.
     val effectiveRegParam = $(regParam) / yStd


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression behavior difference

Reply via email to