[ 
https://issues.apache.org/jira/browse/SPARK-9478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15345208#comment-15345208
 ] 

Seth Hendrickson commented on SPARK-9478:
-----------------------------------------

Thanks for your timely feedback! There are many use cases for sample weights in 
machine learning algorithms that are broadly applicable. In regression, it is 
common to use sample weights to account for changing variance in the data 
generation process. Sample weights can also be used in both classification and 
regression to weight more recent data points that may be more reflective of the 
data generation model. Handling imbalanced datasets with class weights can be 
seen as a specific case of sample weights. Using upsampling/downsampling can 
cause unnecessary duplication of the input data and also makes it more 
difficult to assign arbitrary weights to samples. Even further, implementing 
weighted boosting algorithms like AdaBoost/LogitBoost etc... will not be 
possible without sample weights.

Scikit-learn does indeed support sample weights, as you can see 
[here|http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.fit],
 and in fact the algorithms simply convert class weights into sample weights. 

With this in mind, I think we should support sample weights. We might also want 
to implement a mechanism to support class weights in the API where users don't 
have to manually convert class weights to sample weights - we can open a new 
JIRA to discuss it. [There is an ongoing effort in MLlib to support instance 
weighting|https://issues.apache.org/jira/browse/SPARK-9610] in the various 
algorithms and so I think it is beneficial to add it to trees and forests.

> Add class weights to Random Forest
> ----------------------------------
>
>                 Key: SPARK-9478
>                 URL: https://issues.apache.org/jira/browse/SPARK-9478
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, MLlib
>    Affects Versions: 1.4.1
>            Reporter: Patrick Crenshaw
>
> Currently, this implementation of random forest does not support class 
> weights. Class weights are important when there is imbalanced training data 
> or the evaluation metric of a classifier is imbalanced (e.g. true positive 
> rate at some false positive threshold). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to