[GitHub] [spark] ykerzhner commented on a change in pull request #31657: [Spark-34448][3.0][DO NOT MERGE] Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

GitBox Mon, 01 Mar 2021 09:34:20 -0800


ykerzhner commented on a change in pull request #31657:
URL: https://github.com/apache/spark/pull/31657#discussion_r584895352




##########
File path: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
##########
@@ -842,9 +852,25 @@ class LogisticRegression @Since("1.2.0") (
             denseCoefficientMatrix.update(classIndex, featureIndex,
               value / featuresStd(featureIndex))
           }
+          // It seems that we only adjust coef, intercept is missing?
           if (isIntercept) interceptVec.toArray(classIndex) = value
         }
 
+        // adjust intercept
+        if ($(fitIntercept)) {
+          var i = 0
+          while (i < denseCoefficientMatrix.numRows) {
+            var j = 0
+            var adjust = 0.0
+            while (j < numFeatures) {
+              adjust += denseCoefficientMatrix(i, j) * featuresMean(j)

Review comment:
       While I understand perfectly well what you are doing here, someone who 
hasnt thought about this will not.  I would add a comment explaining why the 
adjustment is equal to the dot product of the coefficients and the means.  
Something along the lines of "If one applies a linear transformation to the 
data x -> (x - avg(x))/std(x), and finds a minimum of the objective function 
with the transformed data, then moving all the averages over to the intercept 
changes the intercept by coeffs(in unscaled space) dot averages."

##########
File path: 
mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
##########
@@ -842,9 +852,25 @@ class LogisticRegression @Since("1.2.0") (
             denseCoefficientMatrix.update(classIndex, featureIndex,
               value / featuresStd(featureIndex))
           }
+          // It seems that we only adjust coef, intercept is missing?

Review comment:
       As you added the intercept adjustment below, can probably change the 
comment

##########
File path: 
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala
##########
@@ -216,28 +217,77 @@ private[ml] class LogisticAggregator(
       s"is applied, multinomial loss will produce a result different from 
binary loss.")
   }
 
+  // dot of a empty vector (with intercept if any)
+  // With it as an offset, for a sparse vector,
+  // we only need to deal with the non-zero values in prediction.
+  @transient private lazy val binarySumOffset: Double = {
+    val localCoefficients = coefficientsArray
+    val localFeaturesAvg = bcFeaturesAvg.value
+    val localFeaturesStd = bcFeaturesStd.value
+    var sum = if (fitIntercept) localCoefficients(numFeaturesPlusIntercept - 
1) else 0.0
+    var i = 0
+    while (i < localFeaturesAvg.length) {
+      if (localFeaturesStd(i) != 0.0) {
+        sum += localCoefficients(i) * (0.0 - localFeaturesAvg(i)) / 
localFeaturesStd(i)
+      }
+      i += 1
+    }
+    sum
+  }
+
+  // With it as an offset, for a sparse vector,
+  // we only need to deal with the non-zero values in gradient calculation.
+  @transient private lazy val binaryGradOffset: Array[Double] = {
+    bcFeaturesAvg.value.zip(bcFeaturesStd.value).map { case (avg, std) =>
+      if (std != 0) avg / std else 0.0
+    }
+  }
+
   /** Update gradient and loss using binary loss function. */
   private def binaryUpdateInPlace(features: Vector, weight: Double, label: 
Double): Unit = {
 
+    val localFeaturesAvg = bcFeaturesAvg.value
     val localFeaturesStd = bcFeaturesStd.value
     val localCoefficients = coefficientsArray
     val localGradientArray = gradientSumArray
+
+
+    // following equals to:
+    //    val margin = - {
+    //      var sum = 0.0
+    //      features.foreach { (index, value) =>
+    //        if (localFeaturesStd(index) != 0.0) {
+    //          sum += localCoefficients(index) *
+    //            (value - localFeaturesAvg(index)) / localFeaturesStd(index)
+    //        }

Review comment:
       add one more line to the comment:  = localCoefficients(index) * value / 
localFeaturesStd(index) - localCoefficients(index) * localFeaturesAvg(index) / 
localFeaturesStd(index), and then a comment that the second terms have been 
summed up above into the binarySumOffset. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] ykerzhner commented on a change in pull request #31657: [Spark-34448][3.0][DO NOT MERGE] Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

Reply via email to