[
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692312#comment-14692312
]
Seth Hendrickson commented on SPARK-8971:
-----------------------------------------
I went ahead and created the PR for this issue, even though some of the design
choices still merit discussion. This way, others can at least see the code and
make comments. I did not mark as WIP but I can do that if needed.
> Support balanced class labels when splitting train/cross validation sets
> ------------------------------------------------------------------------
>
> Key: SPARK-8971
> URL: https://issues.apache.org/jira/browse/SPARK-8971
> Project: Spark
> Issue Type: New Feature
> Components: ML
> Reporter: Feynman Liang
> Assignee: Seth Hendrickson
>
> {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are
> Spark classes which partition data into training and evaluation sets for
> performing hyperparameter selection via cross validation.
> Both methods currently perform the split by randomly sampling the datasets.
> However, when class probabilities are highly imbalanced (e.g. detection of
> extremely low-frequency events), random sampling may result in cross
> validation sets not representative of actual out-of-training performance
> (e.g. no positive training examples could be included).
> Mainstream R packages like already
> [caret|http://topepo.github.io/caret/splitting.html] support splitting the
> data based upon the class labels.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]