[ 
https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15483514#comment-15483514
 ] 

Vincent edited comment on SPARK-17498 at 9/12/16 8:43 AM:
----------------------------------------------------------

Here is what we cc [~qhuang] see about this issue
and correct me if any misunderstanding [~miro.balaz]
val df= sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), 
(5, "c")), 2)
val indexer = new StringIndexer().fit(df)
when transform is call on a new dataframe with unseen label, 
say, 
val dfNew = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "e"), (3, "d")), 2)
indexer.transform(dfNew)
should return 3, 4 for label "d", "e" instead of skipping/deleting the new 
incoming labels, and IndexToString  should return NaN for these added indexes 
3, 4

[~yanboliang] [~srowen] [~josephkb] what do you think of this issue? Currently 
it can either skip the unseen label or throw an error in such case, do you 
think we should add such 'new' way of handler as proposed for StringIndexer?


was (Author: vincexie):
Here is what we cc [~qhuang] see about this issue
and correct me if any misunderstanding [~miro.balaz]
val df= sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), 
(5, "c")), 2)
val indexer = new StringIndexer().fit(df)
when transform is call on a new dataframe with unseen label, 
say, 
val dfNew = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "e"), (3, "d")), 2)
indexer.transform(dfNew)
should return 3, 4 for label "d", "e" instead of skipping/deleting the new 
incoming labels, and IndexToString  should return NaN for these added indexes 
3, 4

[~yanboliang] [~srowen] [~josephkb] what do you think of this issue? Currently 
it can either skip the unseen label or throw an error for such case, do you 
think we should add such 'new' way of handler for StringIndexer?

> StringIndexer.setHandleInvalid sohuld have another option 'new'
> ---------------------------------------------------------------
>
>                 Key: SPARK-17498
>                 URL: https://issues.apache.org/jira/browse/SPARK-17498
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>            Reporter: Miroslav Balaz
>
> That will map unseen label to maximum known label +1, IndexToString would map 
> that back to "<undef>" or NA if there is something like that in spark,



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to