[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14933903#comment-14933903
 ] 

Seth Hendrickson commented on SPARK-7129:
-----------------------------------------

Thanks for clearing those up!

One thing I'm concerned about is assuming that any spark.ml predictor should be 
able to be a base learner. There are certain requirements on base learners that 
make them different than their strong learner forms. If these requirements 
cannot be easily imposed under the given spark.ml abstractions, then we should 
consider alternatives to make sure that the API is extensible moving forward. 
For instance, weakness may be imposed on a linear regression by allowing the 
model to be trained on only one feature; I'm not sure there is an easy/elegant 
way to do this under the current {{Predictor}} abstractions. We may find that 
there is so little overlap between the types of models that spark.ml has or 
will have in the near future, and the types of base learners commonly used in 
boosting, that the convenience of leveraging the current abstractions is not 
that high. I think the design of the base learner abstraction merits some 
consideration, since this will be a major factor in the extensibility of the 
API in the future. Thoughts, [~meihuawu] [~josephkb]?

> Add generic boosting algorithm to spark.ml
> ------------------------------------------
>
>                 Key: SPARK-7129
>                 URL: https://issues.apache.org/jira/browse/SPARK-7129
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to