Maciej Szymkiewicz created SPARK-11569:
------------------------------------------

             Summary: StringIndexer transform fails when column contains nulls
                 Key: SPARK-11569
                 URL: https://issues.apache.org/jira/browse/SPARK-11569
             Project: Spark
          Issue Type: Bug
          Components: ML, PySpark
    Affects Versions: 1.5.0, 1.4.0, 1.6.0
            Reporter: Maciej Szymkiewicz


Transforming column containing {{null}} values using {{StringIndexer}} results 
in {{java.lang.NullPointerException}}

{code}
from pyspark.ml.feature import StringIndexer

df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v"))
df.printSchema()
## root
##  |-- k: string (nullable = true)
##  |-- v: long (nullable = true)

indexer = StringIndexer(inputCol="k", outputCol="kIdx")

indexer.fit(df).transform(df)
## <repr(<pyspark.sql.dataframe.DataFrame at 0x7f4b0d8e7110>) failed: 
py4j.protocol.Py4JJavaError: An error occurred while calling o75.json.
## : java.lang.NullPointerException
{code}

Problem disappears when we drop 

{code}
df1 = df.na.drop()
indexer.fit(df1).transform(df1)
{code}

or replace {{nulls}}

{code}
from pyspark.sql.functions import col, when

k = col("k")
df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k))
indexer.fit(df2).transform(df2)
{code}

and cannot be reproduced using Scala API

{code}
import org.apache.spark.ml.feature.StringIndexer

val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v")
df.printSchema
// root
//  |-- k: string (nullable = true)
//  |-- v: integer (nullable = false)

val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx")

indexer.fit(df).transform(df).count
// 2
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to