[ 
https://issues.apache.org/jira/browse/SPARK-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989091#comment-14989091
 ] 

Yanbo Liang commented on SPARK-11478:
-------------------------------------

I have found the cause of this bug, but not figured out a way to resolve it. 
In "transformSchema" we use the following code snippet to produce the output 
column schema :
{code}
val attr = NominalAttribute.defaultAttr.withName($(outputCol))
attr.toStructField()
{code}
And in Attribute.toStructField() we cast the "nullable" to "false" under any 
condition.

But in "transform" we use DataFrame operations to produce new columns. We have 
no authority to change the "nullable" of specified column which is decided by 
the DataFrame internal implementation.

In most of other feature transformers have the same problem like this. I 
propose not to check "nullable" because it is difficult to get the "nullable" 
value of specific column before it generated. So in "transformSchema" we can 
not get the right value of "nullable". If we set it with a specific value, 
"transform" can not follow this rule. [~mengxr]

> ML StringIndexer return inconsistent schema
> -------------------------------------------
>
>                 Key: SPARK-11478
>                 URL: https://issues.apache.org/jira/browse/SPARK-11478
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>            Reporter: Yanbo Liang
>
> ML StringIndexer transform and transformSchema return inconsistent schema.
> {code}
> val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
> "a"), (5, "c")), 2)
> val df = sqlContext.createDataFrame(data).toDF("id", "label")
> val indexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("labelIndex")
>   .fit(df)
> val transformed = indexer.transform(df)
> println(transformed.schema.toString())
> println(indexer.transformSchema(df.schema))
> The nullable of "labelIndex" return inconsistent value:
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,true))
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,false))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to