[ https://issues.apache.org/jira/browse/SPARK-11569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001439#comment-15001439 ]
Jia Li commented on SPARK-11569: -------------------------------- Hi [~josephkb] [~holdenk_amp], I'd like to hear your opinion on the expected behavior for this test case. I can think of these possibilities: 1) the tuple with null gets the last index as shown below +-----+----+---+---+-----+ | x0| x1| x2| x3|x0idx| +-----+----+---+---+-----+ |asd2s|1e1e|1.1| 0| 0.0| |asd2s|1e1e|0.1| 0| 0.0| | null|1e3e|1.2| 0| 2.0| |bd34t|1e1e|5.1| 1| 1.0| |asd2s|1e3e|0.2| 0| 0.0| |bd34t|1e2e|4.3| 1| 1.0| +-----+----+---+---+-----+ 2) the tuple with null gets index 0 before everything else 3) eliminate the tuple from the result Which one do you prefer? Thanks, > StringIndexer transform fails when column contains nulls > -------------------------------------------------------- > > Key: SPARK-11569 > URL: https://issues.apache.org/jira/browse/SPARK-11569 > Project: Spark > Issue Type: Bug > Components: ML, PySpark > Affects Versions: 1.4.0, 1.5.0, 1.6.0 > Reporter: Maciej Szymkiewicz > > Transforming column containing {{null}} values using {{StringIndexer}} > results in {{java.lang.NullPointerException}} > {code} > from pyspark.ml.feature import StringIndexer > df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v")) > df.printSchema() > ## root > ## |-- k: string (nullable = true) > ## |-- v: long (nullable = true) > indexer = StringIndexer(inputCol="k", outputCol="kIdx") > indexer.fit(df).transform(df) > ## <repr(<pyspark.sql.dataframe.DataFrame at 0x7f4b0d8e7110>) failed: > py4j.protocol.Py4JJavaError: An error occurred while calling o75.json. > ## : java.lang.NullPointerException > {code} > Problem disappears when we drop > {code} > df1 = df.na.drop() > indexer.fit(df1).transform(df1) > {code} > or replace {{nulls}} > {code} > from pyspark.sql.functions import col, when > k = col("k") > df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k)) > indexer.fit(df2).transform(df2) > {code} > and cannot be reproduced using Scala API > {code} > import org.apache.spark.ml.feature.StringIndexer > val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v") > df.printSchema > // root > // |-- k: string (nullable = true) > // |-- v: integer (nullable = false) > val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx") > indexer.fit(df).transform(df).count > // 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org