[jira] [Comment Edited] (SPARK-4210) Add Extra-Trees algorithm to MLlib

Vincent Botta (JIRA) Wed, 05 Nov 2014 04:26:44 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-4210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14198321#comment-14198321
 ]


Vincent Botta edited comment on SPARK-4210 at 11/5/14 12:25 PM:
----------------------------------------------------------------

[~manishamde]: Indeed it will lead to interesting implementation tradeoffs. 
There are two levels in the split choices:
- *level 1*: at each tested variable, we just have to pick a valid (meaning it 
cannot lead to an empty partition) random threshold instead of searching for 
THE best one, which has an (positive) impact on algorithm complexity. I’m not 
aware of the many possible ways to do that with Spark,  but I suppose this can 
be done in many ways. We will have to evaluate the different strategies and see 
what’s best given the different scenarios. Not sure yet how I plan to do this, 
I will need to do some more digging into the current MLLib code. Suggestions 
are welcome.
- *level 2*: among the ones we picked at random, evaluate the one that maximize 
a given score. I guess that can be done as in the current Random Forest (RF). 
ATM I propose to rely on what has been done in the RF. We will see where that 
leads us.

Here is a link to the original [Extremely randomized trees 
article|http://www.montefiore.ulg.ac.be/~ernst/uploads/news/id63/extremely-randomized-trees.pdf].
 Even though I see the ensemble of decision tree methods as a more general 
framework in which there are many little boxes that can be fine tuned. See 
[this 
flowchart|https://www.dropbox.com/s/ignnt0wqxw4sg9c/flowchart-tree.pdf?dl=0] 
where each boxes corresponds to steps that can be customized/particularized to 
produce Single Decision Tree, Random Forests, Extra-Trees or whatever that will 
suit your needs.


was (Author: 0asa):
[~manishamde]: Indeed it will lead to interesting implementation tradeoffs. 
There are two levels in the split choices:
- *level 1*: at each tested variable, we just have to pick a valid (meaning it 
cannot lead to an empty partition) random threshold instead of searching for 
THE best one, which has an (positive) impact on algorithm complexity. I’m not 
aware of the many possible ways to do that with Spark,  but I suppose this can 
be done in many ways. We will have to evaluate the different strategies and see 
what’s best given the different scenarios. Not sure yet how I plan to do this, 
I will need to do some more digging into the current MLLib code.
- *level 2*: among the ones we picked at random, evaluate the one that maximize 
a given score. I guess that can be done as in the current Random Forest (RF). 
ATM I propose to rely on what has been done in the RF. We will see where that 
leads us.

Here is a link to the original [Extremely randomized trees 
article|http://www.montefiore.ulg.ac.be/~ernst/uploads/news/id63/extremely-randomized-trees.pdf].
 Even though I see the ensemble of decision tree methods as a more general 
framework in which there are many little boxes that can be fine tuned. See 
[this 
flowchart|https://www.dropbox.com/s/ignnt0wqxw4sg9c/flowchart-tree.pdf?dl=0] 
where each boxes corresponds to steps that can be customized/particularized to 
produce Single Decision Tree, Random Forests, Extra-Trees or whatever that will 
suit your needs.

> Add Extra-Trees algorithm to MLlib
> ----------------------------------
>
>                 Key: SPARK-4210
>                 URL: https://issues.apache.org/jira/browse/SPARK-4210
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Vincent Botta
>
> This task will add Extra-Trees support to Spark MLlib. The implementation 
> could be inspired from the current Random Forest algorithm. This algorithm is 
> expected to be particularly suited as sorting of attributes is not required 
> as opposed to to the original Random Forest approach (with similar and/or 
> better predictive power). 
> The tasks involves:
> - Code implementation
> - Unit tests
> - Functional tests
> - Performance tests
> - Documentation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4210) Add Extra-Trees algorithm to MLlib

Reply via email to