[jira] [Commented] (SPARK-11106) Should ML Models contains single models or Pipelines?

Xusen Yin (JIRA) Tue, 07 Jun 2016 16:33:06 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-11106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319685#comment-15319685
 ]


Xusen Yin commented on SPARK-11106:
-----------------------------------

RFormula is easy to use, but it may not always do right things. For example, 
RFormula indexes categorical features with OneHotEncoder, but in some scenario 
(like RandomForest), a StringIndexer is better.

> Should ML Models contains single models or Pipelines?
> -----------------------------------------------------
>
>                 Key: SPARK-11106
>                 URL: https://issues.apache.org/jira/browse/SPARK-11106
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>            Reporter: Joseph K. Bradley
>            Priority: Critical
>
> This JIRA is for discussing whether an ML Estimators should do feature 
> processing.
> h2. Issue
> Currently, almost all ML Estimators require strict input types.  E.g., 
> DecisionTreeClassifier requires that the label column be Double type and have 
> metadata indicating the number of classes.
> This requires users to know how to preprocess data.
> h2. Ideal workflow
> A user should be able to pass any reasonable data to a Transformer or 
> Estimator and have it "do the right thing."
> E.g.:
> * If DecisionTreeClassifier is given a String column for labels, it should 
> know to index the Strings.
> * See [SPARK-10513] for a similar issue with OneHotEncoder.
> h2. Possible solutions
> There are a few solutions I have thought of.  Please comment with feedback or 
> alternative ideas!
> h3. Leave as is
> Pro: The current setup is good in that it forces the user to be very aware of 
> what they are doing.  Feature transformations will not happen silently.
> Con: The user has to write boilerplate code for transformations.  The API is 
> not what some users would expect; e.g., coming from R, a user might expect 
> some automatic transformations.
> h3. All Transformers can contain PipelineModels
> We could allow all Transformers and Models to contain arbitrary 
> PipelineModels.  E.g., if a DecisionTreeClassifier were given a String label 
> column, it might return a Model which contains a simple fitted PipelineModel 
> containing StringIndexer + DecisionTreeClassificationModel.
> The API could present this to the user, or it could be hidden from the user.  
> Ideally, it would be hidden from the beginner user, but accessible for 
> experts.
> The main problem is that we might have to break APIs.  E.g., OneHotEncoder 
> may need to do indexing if given a String input column.  This means it should 
> no longer be a Transformer; it should be an Estimator.
> h3. All Estimators should use RFormula
> The best option I have thought of is to make RFormula be the primary method 
> for automatic feature transformation.  We could start adding an RFormula 
> Param to all Estimators, and it could handle most of these feature 
> transformation issues.
> We could maintain old APIs:
> * If a user sets the input column names, then those can be used in the 
> traditional (no automatic transformation) way.
> * If a user sets the RFormula Param, then it can be used instead.  (This 
> should probably take precedence over the old API.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-11106) Should ML Models contains single models or Pipelines?

Reply via email to