[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951685#comment-15951685
 ] 

Joseph K. Bradley commented on SPARK-9478:
------------------------------------------

[~clamus] The current vote is to *not use* weights during sampling and then to 
*use* weights when growing the trees.  That will simplify the sampling process 
so we hopefully won't have to deal with the complexity you're mentioning.  Note 
that we'll have to weight the trees in the forest to make this approach work.

I'm also guessing that it will give better calibrated probability estimates in 
the final forest, though this is based on intuition rather than analysis.  
E.g., given the 4-instance dataset in [~sethah]'s example above, I'd imagine:
* If we use weights during sampling but not when growing trees...
** Say we want 10 trees.  We pick 10 sets of 4 rows.  The probability of always 
picking the weight-1000 row is ~0.89.
** So our forest will probably give us 0/1 (poorly calibrated) probabilities.
* If we do not use weights during sampling but use them when growing trees... 
(current proposal)
** Say we want 10 trees.
** The probability of always picking the weight-1 rows is ~1e-5.  This means 
we'll have at least one tree with the weight-1000 row, so it will dominate our 
predictions (giving good accuracy).
** The probability of having at least 1 tree with only weight-1 rows is ~0.02.  
This means it's pretty likely we'll have some tree predicting label1, so we'll 
keep our probability predictions away from 0 and 1.

This is really hand-wavy, but it does alleviate my fears of having extreme log 
losses.  On the other hand, maybe it could be handle by adding smoothing to 
predictions...

> Add sample weights to Random Forest
> -----------------------------------
>
>                 Key: SPARK-9478
>                 URL: https://issues.apache.org/jira/browse/SPARK-9478
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 1.4.1
>            Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support sample 
> (instance) weights. Weights are important when there is imbalanced training 
> data or the evaluation metric of a classifier is imbalanced (e.g. true 
> positive rate at some false positive threshold).  Sample weights generalize 
> class weights, so this could be used to add class weights later on.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to