[
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14658376#comment-14658376
]
Seth Hendrickson commented on SPARK-8971:
-----------------------------------------
[~mengxr] You mentioned that the solution should call {{sampleByKeyExact}},
which is a function that takes a stratified subsample of m < N elements from a
dataset. One problem is that when doing things like train/test split and k fold
creation (which are fundamentally the same as far as sampling goes) is that we
actually need to take random "splits" of the dataset. That is, we need not only
the subsample, but its complement. For k-fold sampling, we need to split the
dataset into k unique, non-overlapping subsamples, which isn't possible with
{{samplyByKeyExact}} in its current state.
I have a pretty coarse prototype which essentially uses the [efficient,
parallel sampling routine|http://jmlr.org/proceedings/papers/v28/meng13a.html]
to find the exact k thresholds needed to split the dataset into k subsamples. I
had to modify the sampling function in
{{org.apache.spark.util.random.StratifiedSamplingUtils}} to compare the random
keys to a range (e.g. x > lb && x <= ub), rather than simply comparing to one
number (x < threshold) which only allows for a bisection of the data. Once you
know the exact k-1 thresholds that provide even splits for each stratum, and
you have a sampling function that can compare the random key to a range, you
have what you need to for stratified k-fold and train/test split. Is there a
way to implement this without touching the {{org.apache.spark.util.random}}
package that I'm missing?
> Support balanced class labels when splitting train/cross validation sets
> ------------------------------------------------------------------------
>
> Key: SPARK-8971
> URL: https://issues.apache.org/jira/browse/SPARK-8971
> Project: Spark
> Issue Type: New Feature
> Components: ML
> Reporter: Feynman Liang
> Assignee: Seth Hendrickson
>
> {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are
> Spark classes which partition data into training and evaluation sets for
> performing hyperparameter selection via cross validation.
> Both methods currently perform the split by randomly sampling the datasets.
> However, when class probabilities are highly imbalanced (e.g. detection of
> extremely low-frequency events), random sampling may result in cross
> validation sets not representative of actual out-of-training performance
> (e.g. no positive training examples could be included).
> Mainstream R packages like already
> [caret|http://topepo.github.io/caret/splitting.html] support splitting the
> data based upon the class labels.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]