srowen commented on a change in pull request #25829: [SPARK-29144][ML]
Binarizer handle sparse vectors incorrectly with negative threshold
URL: https://github.com/apache/spark/pull/25829#discussion_r326130567
##########
File path: mllib/src/main/scala/org/apache/spark/ml/feature/Binarizer.scala
##########
@@ -75,30 +75,40 @@ final class Binarizer @Since("1.4.0") (@Since("1.4.0")
override val uid: String)
val schema = dataset.schema
val inputType = schema($(inputCol)).dataType
val td = $(threshold)
+ val metadata = outputSchema($(outputCol)).metadata
- val binarizerDouble = udf { in: Double => if (in > td) 1.0 else 0.0 }
- val binarizerVector = udf { (data: Vector) =>
- val indices = ArrayBuilder.make[Int]
- val values = ArrayBuilder.make[Double]
-
- data.foreachActive { (index, value) =>
- if (value > td) {
- indices += index
- values += 1.0
+ val binarizerUDF = inputType match {
+ case DoubleType =>
+ udf { in: Double => if (in > td) 1.0 else 0.0 }
+
+ case _: VectorUDT if td >= 0 =>
+ udf { vector: Vector =>
+ val indices = ArrayBuilder.make[Int]
+ val values = ArrayBuilder.make[Double]
+ vector.foreachActive { (index, value) =>
+ if (value > td) {
+ indices += index
+ values += 1.0
+ }
+ }
+ Vectors.sparse(vector.size, indices.result(),
values.result()).compressed
}
- }
- Vectors.sparse(data.size, indices.result(), values.result()).compressed
+ case _: VectorUDT if td < 0 =>
+ this.logWarning(s"Binarization operations on sparse dataset with
negative threshold " +
Review comment:
I think this is OK. It will almost always be dense but not always. The
warning is spurious if the input is dense already, but, a negative threshold is
rare... I think. I'm trying to recall whether this is ever applied to outputs
of classifiers like SVMs that output [-1, 1].
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]