[
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15870490#comment-15870490
]
Camilo Lamus commented on SPARK-9478:
-------------------------------------
[~sethah] It is very exiting that you guys are working on adding a weighted
version of the random forest. I am really looking forward to use it in spark ml
RF and other algorithms. As [~josephkb] mention, adding weights to data points
(samples/instances) has a myriad of application in data analysis.
I have a question about the way you are thinking on using the weights. Are you
thinking on using the weights both in the bootstrap sample step as well as in
growing the trees? Using it in both steps might make the weights overly
“important”.
If you are using it in the tree growing process, are you doing something like
what is shown here in slide 4
(http://www.stat.cmu.edu/~ryantibs/datamining/lectures/25-boost.pdf)?
In the case where you would use the weights in the constructing the bootstrap
samples, as you mention, the distribution of the (marginal) counts each data
point is selected in a bootstrap sample is binomial. However, the joint
distribution of the counts is multinomial. Specifically, If you draw N samples
with replacement from the original N data points, selecting each with
probability p_i = 1/ N, the joint distribution is Multinomial(N, p_i = 1/N,
i=1,2,…,N), and this is not the same as drawing independently N times from
Binomial(N, 1/N). For one thing, you might end up with more or less than N
samples. In regard to the poisson approximation, I think this might be more
problematic since I think it requires one of the counts to dominate (i.e,
happen with high probability) (see here:
http://www.jstor.org/stable/3314676?seq=1#page_scan_tab_contents). This is a
theoretical issue, which might not be matter in practice. But who know, it
might. And after all, it might be just better to get the counts from
Multinomial(N, p_i = w_i / sum(w_j)).
Either way, if the poisson approximation is good enough, it does make more
sense to use what you suggest at the end, which is to sample from
Poisson(lambda_i = N w_i / sum(w_j)). Sampling from Poisson(lambda_i = 1), and
then multiply by N w_i / sum(w_j) can worsen the Poisson approximation to the
binomial since the variance of multiplied version is lambda_i^2, and not
lambda_i, as it should be a poisson rv.
> Add sample weights to Random Forest
> -----------------------------------
>
> Key: SPARK-9478
> URL: https://issues.apache.org/jira/browse/SPARK-9478
> Project: Spark
> Issue Type: Improvement
> Components: ML
> Affects Versions: 1.4.1
> Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class
> weights. Class weights are important when there is imbalanced training data
> or the evaluation metric of a classifier is imbalanced (e.g. true positive
> rate at some false positive threshold).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]