zhengruifeng created SPARK-22700:
------------------------------------
Summary: Bucketizer.transform incorrectly drops row containing NaN
Key: SPARK-22700
URL: https://issues.apache.org/jira/browse/SPARK-22700
Project: Spark
Issue Type: Improvement
Components: ML
Affects Versions: 2.2.0, 2.3.0
Reporter: zhengruifeng
{code}
import org.apache.spark.ml.feature._
val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7,
Double.NaN))).toDF("a", "b")
val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity)
val bucketizer: Bucketizer = new
Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits)
bucketizer.setHandleInvalid("skip")
scala> df.show
+---+---+
| a| b|
+---+---+
|2.3|3.0|
|NaN|3.0|
|6.7|NaN|
+---+---+
scala> bucketizer.transform(df).show
+---+---+---+
| a| b| aa|
+---+---+---+
|2.3|3.0|0.0|
+---+---+---+
{code}
When {{handleInvalid}} is set {{skip}}, the last item in input is incorrectly
droped, though colum 'b' is not an input column
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]