[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16226448#comment-16226448
 ] 

Nick Pentreath commented on SPARK-13030:
----------------------------------------

I just think it makes sense for OHE to be an Estimator (as it is in sklearn). 
It really should have been from the beginning. The fact that it is not is 
actually a bug, IMO.

The proposal to have a size param could fix the issue but it is a bit of a 
band-aid fix. It requires the user to specify the size (num categories) 
manually. This doesn't really feel like the right workflow to me, the OHE 
should be able to figure that out itself. So that adds one more "speed bump", 
albeit a small one, in using the component in a pipeline.

It is possible that it can use a sort of "hack" for {{fit}} i.e. during the 
first transform call, set the param if not set already. But that just argues 
for the fact that it should be an {{Estimator/Model}} pair. Sure we could wait 
until {{3.0}} but if the work is already done I don't see a compelling reason 
not to do that now.

> Change OneHotEncoder to Estimator
> ---------------------------------
>
>                 Key: SPARK-13030
>                 URL: https://issues.apache.org/jira/browse/SPARK-13030
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.6.0
>            Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to