srowen commented on a change in pull request #25829: [SPARK-29144][ML] 
Binarizer handle sparse vectors incorrectly with negative threshold
URL: https://github.com/apache/spark/pull/25829#discussion_r326130567
 
 

 ##########
 File path: mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala
 ##########
 @@ -75,30 +75,40 @@ final class Binarizer @Since("1.4.0") (@Since("1.4.0") 
override val uid: String)
     val schema = dataset.schema
     val inputType = schema($(inputCol)).dataType
     val td = $(threshold)
+    val metadata = outputSchema($(outputCol)).metadata
 
-    val binarizerDouble = udf { in: Double => if (in > td) 1.0 else 0.0 }
-    val binarizerVector = udf { (data: Vector) =>
-      val indices = ArrayBuilder.make[Int]
-      val values = ArrayBuilder.make[Double]
-
-      data.foreachActive { (index, value) =>
-        if (value > td) {
-          indices += index
-          values +=  1.0
+    val binarizerUDF = inputType match {
+      case DoubleType =>
+        udf { in: Double => if (in > td) 1.0 else 0.0 }
+
+      case _: VectorUDT if td >= 0 =>
+        udf { vector: Vector =>
+          val indices = ArrayBuilder.make[Int]
+          val values = ArrayBuilder.make[Double]
+          vector.foreachActive { (index, value) =>
+            if (value > td) {
+              indices += index
+              values +=  1.0
+            }
+          }
+          Vectors.sparse(vector.size, indices.result(), 
values.result()).compressed
         }
-      }
 
-      Vectors.sparse(data.size, indices.result(), values.result()).compressed
+      case _: VectorUDT if td < 0 =>
+        this.logWarning(s"Binarization operations on sparse dataset with 
negative threshold " +
 
 Review comment:
   I think this is OK. It will almost always be dense but not always. The 
warning is spurious if the input is dense already, but, a negative threshold is 
rare... I think. I'm trying to recall whether this is ever applied to outputs 
of classifiers like SVMs that output [-1, 1].

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to