[jira] [Comment Edited] (SPARK-9478) Add sample weights to Random Forest

Joseph K. Bradley (JIRA) Sun, 16 Apr 2017 15:37:03 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15970541#comment-15970541
 ]


Joseph K. Bradley edited comment on SPARK-9478 at 4/16/17 10:36 PM:
--------------------------------------------------------------------

By the way, one design choice which has come up is whether the current 
minInstancesPerNode Param should take instance weights into account.

Pros of using instance weights with minInstancesPerNode:
* This maintains the semantics of instance weights.  The algorithm should treat 
these 2 datasets identically: (a) {{[(weight 1.0, example A), (weight 1.0, 
example B), (weight 1.0, example B)]}} vs. (b) {{[(weight 1.0, example A), 
(weight 2.0, example B)]}}.
* By maintaining these semantics, we avoid confusion about how RandomForest and 
GBT should treat the instance weights introduced by subsampling.  (Currently, 
these use instance weights with minInstancesPerNode, so this choice is 
consistent with our previous choices.)

Pros of not using instance weights with minInstancesPerNode:
* AFAIK, scikit-learn does not use instance weights with {{min_samples_leaf}}.

I vote for the first choice (taking weights into account).

This does introduce one small complication:
* If you have small instance weights < 1.0, then the current limit on 
minInstancesPerNode of being >= 1.0 (in the ParamValidator) is a bit strict.
* I propose to permit minInstancesPerNode to be set to 0.  I plan to add a 
check to make sure each leaf node does have non-zero weight (i.e., at least one 
instance with non-0 weight).


was (Author: josephkb):
By the way, one design choice which has come up is whether the current 
minInstancesPerNode Param should take instance weights into account.

Pros of using instance weights with minInstancesPerNode:
* This maintains the semantics of instance weights.  The algorithm should treat 
these 2 datasets identically: (a) {{[(weight 1.0, example A), (weight 1.0, 
example B), (weight 1.0, example B)]}} vs. (b) {{[(weight 1.0, example A), 
(weight 2.0, example B)]}}.
* By maintaining these semantics, we avoid confusion about how RandomForest and 
GBT should treat the instance weights introduced by subsampling.  (Currently, 
these use instance weights with minInstancesPerNode, so this choice is 
consistent with our previous choices.)

Pros of not using instance weights with minInstancesPerNode:
* AFAIK, scikit-learn does not use instance weights with {{min_samples_leaf}}.

I vote for the first choice (taking weights into account).

> Add sample weights to Random Forest
> -----------------------------------
>
>                 Key: SPARK-9478
>                 URL: https://issues.apache.org/jira/browse/SPARK-9478
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.4.1
>            Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SPARK-9478) Add sample weights to Random Forest

Reply via email to