[ 
https://issues.apache.org/jira/browse/SPARK-7129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14909512#comment-14909512
 ] 

Meihua Wu commented on SPARK-7129:
----------------------------------

[~sethah] Thank you for your comments! I have updated the design doc to 
(hopefully) address your concerns.

* I agree with you that we will have separate classes for the algorithms. 
Different algorithms will have different type requirement for the base learner. 
So they cannot be combined in one single class. However, I think we need to 
keep the relevant methods and parameters consistent across the boosting 
algorithm classes. For example, they all use `setBaseLearner` to specify the 
base learner.

* I corrected the design doc to clarify this: `setBaseLearner` should take a 
instance of the type `Classifier[FeatureType, Learner, LearnerModel] with 
HasWeightCol`. This type requirement the base learner will make use of the 
weight data in the estimation. At the moment, `setBaseLearner` will take 
`LogisticRegression` for `AdaBoostClassifier` and `LinearRegression` for 
`AdaBoostRegression`. Did I answer your question?

* Sure, I will revise `SAMMEClassifier` to `AdaBoostClassifier`.

* `setNumberOfBaseLearners` is to set the number of iterations. 

Finally, the current proposal only supports one base learner. This is the same 
as the AdaBoost algorithm in SciKit Learn. Adding multiple base learner could 
be our next step.



> Add generic boosting algorithm to spark.ml
> ------------------------------------------
>
>                 Key: SPARK-7129
>                 URL: https://issues.apache.org/jira/browse/SPARK-7129
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Joseph K. Bradley
>
> The Pipelines API will make it easier to create a generic Boosting algorithm 
> which can work with any Classifier or Regressor. Creating this feature will 
> require researching the possible variants and extensions of boosting which we 
> may want to support now and/or in the future, and planning an API which will 
> be properly extensible.
> In particular, it will be important to think about supporting:
> * multiple loss functions (for AdaBoost, LogitBoost, gradient boosting, etc.)
> * multiclass variants
> * multilabel variants (which will probably be in a separate class and JIRA)
> * For more esoteric variants, we should consider them but not design too much 
> around them: totally corrective boosting, cascaded models
> Note: This may interact some with the existing tree ensemble methods, but it 
> should be largely separate since the tree ensemble APIs and implementations 
> are specialized for trees.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to