Github user jkbradley commented on a diff in the pull request:
https://github.com/apache/spark/pull/20829#discussion_r175913440
--- Diff:
mllib/src/test/scala/org/apache/spark/ml/feature/VectorAssemblerSuite.scala ---
@@ -147,4 +149,72 @@ class VectorAssemblerSuite
.filter(vectorUDF($"features") > 1)
.count() == 1)
}
+
+ test("assemble should keep nulls") {
+ import org.apache.spark.ml.feature.VectorAssembler.assemble
+ assert(assemble(Seq(1, 1), true)(1.0, null) === Vectors.dense(1.0,
Double.NaN))
+ assert(assemble(Seq(1, 2), true)(1.0, null) === Vectors.dense(1.0,
Double.NaN, Double.NaN))
+ assert(assemble(Seq(1), true)(null) === Vectors.dense(Double.NaN))
+ assert(assemble(Seq(2), true)(null) === Vectors.dense(Double.NaN,
Double.NaN))
+ }
+
+ test("assemble should throw errors") {
+ import org.apache.spark.ml.feature.VectorAssembler.assemble
+ intercept[SparkException](assemble(Seq(1, 1), false)(1.0, null) ===
+ Vectors.dense(1.0, Double.NaN))
+ intercept[SparkException](assemble(Seq(1, 2), false)(1.0, null) ===
+ Vectors.dense(1.0, Double.NaN, Double.NaN))
+ intercept[SparkException](assemble(Seq(1), false)(null) ===
Vectors.dense(Double.NaN))
+ intercept[SparkException](assemble(Seq(2), false)(null) ===
+ Vectors.dense(Double.NaN, Double.NaN))
+ }
+
+ test("get lengths function") {
--- End diff --
This is great that you're testing this carefully, but I recommend we make
sure to pass better exceptions to users. E.g., they won't know what to do with
a NullPointerException, so we could instead tell them something like: "Column x
in the first row of the dataset has a null entry, but VectorAssembler expected
a non-null entry. This can be fixed by explicitly specifying the expected size
using VectorSizeHint."
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]