[GitHub] [spark] zhengruifeng commented on a change in pull request #31693: [SPARK-34448][ML][WIP] Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

GitBox Mon, 01 Mar 2021 02:33:11 -0800


zhengruifeng commented on a change in pull request #31693:
URL: https://github.com/apache/spark/pull/31693#discussion_r584592008




##########
File path: 
mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala
##########
@@ -606,3 +606,149 @@ private[ml] class BlockLogisticAggregator(
     }
   }
 }
+
+
+/**
+ * BlockBinaryLogisticAggregator computes the gradient and loss used in 
Logistic classification
+ * for blocks in sparse or dense matrix in an online fashion.
+ *
+ * Two BlockLogisticAggregators can be merged together to have a summary of 
loss and gradient of
+ * the corresponding joint dataset.
+ *
+ * NOTE: The feature values are expected to already have be scaled (divided by 
[[bcFeaturesStd]],
+ * NOT centered) before computation.
+ *
+ * @param bcCoefficients The coefficients corresponding to the features.
+ * @param fitIntercept Whether to fit an intercept term.
+ * @param fitWithMean Whether to center the data with mean before training. If 
true, we MUST adjust
+ *                    the intercept of both initial coefficients and final 
solution in the caller.
+ */
+private[ml] class BlockBinaryLogisticAggregator(
+    bcFeaturesStd: Broadcast[Array[Double]],
+    bcFeaturesMean: Broadcast[Array[Double]],
+    fitIntercept: Boolean,
+    fitWithMean: Boolean)(bcCoefficients: Broadcast[Vector])
+  extends DifferentiableLossAggregator[InstanceBlock, 
BlockBinaryLogisticAggregator] with Logging {
+
+  if (fitWithMean) {
+    require(fitIntercept, s"for training without intercept, should not center 
the vectors")
+  }
+
+  private val numFeatures = bcFeaturesStd.value.length
+  protected override val dim: Int = bcCoefficients.value.size
+
+  @transient private lazy val coefficientsArray = bcCoefficients.value match {
+    case DenseVector(values) => values
+    case _ => throw new IllegalArgumentException(s"coefficients only supports 
dense vector but " +
+      s"got type ${bcCoefficients.value.getClass}.)")
+  }
+
+  @transient private lazy val linear = if (fitIntercept) {
+    new DenseVector(coefficientsArray.take(numFeatures))
+  } else {
+    new DenseVector(coefficientsArray)
+  }
+
+  @transient private lazy val scaledMean = if (fitWithMean) {

Review comment:
       An advantage over using `StandardScaler` before LoR is:
   With two pre-computed variables (`scaledMean` and `emptyPrediction`), we do 
not need to transform a sparse dataset into a dense one via `StandardScaler`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng commented on a change in pull request #31693: [SPARK-34448][ML][WIP] Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

Reply via email to