[
https://issues.apache.org/jira/browse/SPARK-11106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319685#comment-15319685
]
Xusen Yin commented on SPARK-11106:
-----------------------------------
RFormula is easy to use, but it may not always do right things. For example,
RFormula indexes categorical features with OneHotEncoder, but in some scenario
(like RandomForest), a StringIndexer is better.
> Should ML Models contains single models or Pipelines?
> -----------------------------------------------------
>
> Key: SPARK-11106
> URL: https://issues.apache.org/jira/browse/SPARK-11106
> Project: Spark
> Issue Type: Sub-task
> Components: ML
> Reporter: Joseph K. Bradley
> Priority: Critical
>
> This JIRA is for discussing whether an ML Estimators should do feature
> processing.
> h2. Issue
> Currently, almost all ML Estimators require strict input types. E.g.,
> DecisionTreeClassifier requires that the label column be Double type and have
> metadata indicating the number of classes.
> This requires users to know how to preprocess data.
> h2. Ideal workflow
> A user should be able to pass any reasonable data to a Transformer or
> Estimator and have it "do the right thing."
> E.g.:
> * If DecisionTreeClassifier is given a String column for labels, it should
> know to index the Strings.
> * See [SPARK-10513] for a similar issue with OneHotEncoder.
> h2. Possible solutions
> There are a few solutions I have thought of. Please comment with feedback or
> alternative ideas!
> h3. Leave as is
> Pro: The current setup is good in that it forces the user to be very aware of
> what they are doing. Feature transformations will not happen silently.
> Con: The user has to write boilerplate code for transformations. The API is
> not what some users would expect; e.g., coming from R, a user might expect
> some automatic transformations.
> h3. All Transformers can contain PipelineModels
> We could allow all Transformers and Models to contain arbitrary
> PipelineModels. E.g., if a DecisionTreeClassifier were given a String label
> column, it might return a Model which contains a simple fitted PipelineModel
> containing StringIndexer + DecisionTreeClassificationModel.
> The API could present this to the user, or it could be hidden from the user.
> Ideally, it would be hidden from the beginner user, but accessible for
> experts.
> The main problem is that we might have to break APIs. E.g., OneHotEncoder
> may need to do indexing if given a String input column. This means it should
> no longer be a Transformer; it should be an Estimator.
> h3. All Estimators should use RFormula
> The best option I have thought of is to make RFormula be the primary method
> for automatic feature transformation. We could start adding an RFormula
> Param to all Estimators, and it could handle most of these feature
> transformation issues.
> We could maintain old APIs:
> * If a user sets the input column names, then those can be used in the
> traditional (no automatic transformation) way.
> * If a user sets the RFormula Param, then it can be used instead. (This
> should probably take precedence over the old API.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]