[ 
https://issues.apache.org/jira/browse/SPARK-11106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352690#comment-15352690
 ] 

Max Moroz commented on SPARK-11106:
-----------------------------------

Automatic feature transformation creates a lot of implicit rules that are not 
always intuitive.

For example, [~xusen] pointed out that one-hot encoding of strings isn't always 
reasonable. But even StringIndexer might not be the best solution either; for 
example, it's recommended if strings are converted to numbers, to first sort 
them in order of how often they represent each class (in case of binary 
classifier). Another option is to use a multi-class split (which require 
categorical variables and appropriate support from DT/RF classifier - but that 
may become available in the future). In addition, whatever transformation is 
done pre-training, the inverse has to be done post-prediction (at least to 
emulate sklearn, but also to keep things sane from user expectation 
perspective). 

What would be a clean way to specify which exact transformation to use?
How to make sure this transformation would be inverted at the end of the 
pipeline without creating repeated string --> number --> string transfomations 
at each node?

Without a good answer to these questions, it might be better to leave the API 
as is. 

By the same token, trying to automatically impute null values is not too 
appealing either: median or average are often not a good replacement value. 

> Should ML Models contains single models or Pipelines?
> -----------------------------------------------------
>
>                 Key: SPARK-11106
>                 URL: https://issues.apache.org/jira/browse/SPARK-11106
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Joseph K. Bradley
>            Priority: Critical
>
> This JIRA is for discussing whether an ML Estimators should do feature 
> processing.
> h2. Issue
> Currently, almost all ML Estimators require strict input types.  E.g., 
> DecisionTreeClassifier requires that the label column be Double type and have 
> metadata indicating the number of classes.
> This requires users to know how to preprocess data.
> h2. Ideal workflow
> A user should be able to pass any reasonable data to a Transformer or 
> Estimator and have it "do the right thing."
> E.g.:
> * If DecisionTreeClassifier is given a String column for labels, it should 
> know to index the Strings.
> * See [SPARK-10513] for a similar issue with OneHotEncoder.
> h2. Possible solutions
> There are a few solutions I have thought of.  Please comment with feedback or 
> alternative ideas!
> h3. Leave as is
> Pro: The current setup is good in that it forces the user to be very aware of 
> what they are doing.  Feature transformations will not happen silently.
> Con: The user has to write boilerplate code for transformations.  The API is 
> not what some users would expect; e.g., coming from R, a user might expect 
> some automatic transformations.
> h3. All Transformers can contain PipelineModels
> We could allow all Transformers and Models to contain arbitrary 
> PipelineModels.  E.g., if a DecisionTreeClassifier were given a String label 
> column, it might return a Model which contains a simple fitted PipelineModel 
> containing StringIndexer + DecisionTreeClassificationModel.
> The API could present this to the user, or it could be hidden from the user.  
> Ideally, it would be hidden from the beginner user, but accessible for 
> experts.
> The main problem is that we might have to break APIs.  E.g., OneHotEncoder 
> may need to do indexing if given a String input column.  This means it should 
> no longer be a Transformer; it should be an Estimator.
> h3. All Estimators should use RFormula
> The best option I have thought of is to make RFormula be the primary method 
> for automatic feature transformation.  We could start adding an RFormula 
> Param to all Estimators, and it could handle most of these feature 
> transformation issues.
> We could maintain old APIs:
> * If a user sets the input column names, then those can be used in the 
> traditional (no automatic transformation) way.
> * If a user sets the RFormula Param, then it can be used instead.  (This 
> should probably take precedence over the old API.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to