[ https://issues.apache.org/jira/browse/SPARK-11569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15813712#comment-15813712 ]
Joseph K. Bradley commented on SPARK-11569: ------------------------------------------- Hi all, I'm sorry for not following up on this, but I would like us to do this at some point. However, I will insist that we do some research before adding an API based on just a few users' requirements. Have you looked at other libraries? * scikit-learn * various R libraries * pandas * other more specialized but popular ML libraries > StringIndexer transform fails when column contains nulls > -------------------------------------------------------- > > Key: SPARK-11569 > URL: https://issues.apache.org/jira/browse/SPARK-11569 > Project: Spark > Issue Type: Bug > Components: ML, PySpark > Affects Versions: 1.4.0, 1.5.0, 1.6.0 > Reporter: Maciej Szymkiewicz > > Transforming column containing {{null}} values using {{StringIndexer}} > results in {{java.lang.NullPointerException}} > {code} > from pyspark.ml.feature import StringIndexer > df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v")) > df.printSchema() > ## root > ## |-- k: string (nullable = true) > ## |-- v: long (nullable = true) > indexer = StringIndexer(inputCol="k", outputCol="kIdx") > indexer.fit(df).transform(df) > ## <repr(<pyspark.sql.dataframe.DataFrame at 0x7f4b0d8e7110>) failed: > py4j.protocol.Py4JJavaError: An error occurred while calling o75.json. > ## : java.lang.NullPointerException > {code} > Problem disappears when we drop > {code} > df1 = df.na.drop() > indexer.fit(df1).transform(df1) > {code} > or replace {{nulls}} > {code} > from pyspark.sql.functions import col, when > k = col("k") > df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k)) > indexer.fit(df2).transform(df2) > {code} > and cannot be reproduced using Scala API > {code} > import org.apache.spark.ml.feature.StringIndexer > val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v") > df.printSchema > // root > // |-- k: string (nullable = true) > // |-- v: integer (nullable = false) > val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx") > indexer.fit(df).transform(df).count > // 2 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org