Maciej Szymkiewicz created SPARK-11569: ------------------------------------------
Summary: StringIndexer transform fails when column contains nulls Key: SPARK-11569 URL: https://issues.apache.org/jira/browse/SPARK-11569 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 1.5.0, 1.4.0, 1.6.0 Reporter: Maciej Szymkiewicz Transforming column containing {{null}} values using {{StringIndexer}} results in {{java.lang.NullPointerException}} {code} from pyspark.ml.feature import StringIndexer df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v")) df.printSchema() ## root ## |-- k: string (nullable = true) ## |-- v: long (nullable = true) indexer = StringIndexer(inputCol="k", outputCol="kIdx") indexer.fit(df).transform(df) ## <repr(<pyspark.sql.dataframe.DataFrame at 0x7f4b0d8e7110>) failed: py4j.protocol.Py4JJavaError: An error occurred while calling o75.json. ## : java.lang.NullPointerException {code} Problem disappears when we drop {code} df1 = df.na.drop() indexer.fit(df1).transform(df1) {code} or replace {{nulls}} {code} from pyspark.sql.functions import col, when k = col("k") df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k)) indexer.fit(df2).transform(df2) {code} and cannot be reproduced using Scala API {code} import org.apache.spark.ml.feature.StringIndexer val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v") df.printSchema // root // |-- k: string (nullable = true) // |-- v: integer (nullable = false) val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx") indexer.fit(df).transform(df).count // 2 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org