[ 
https://issues.apache.org/jira/browse/SPARK-8971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261655#comment-15261655
 ] 

Nick Pentreath commented on SPARK-8971:
---------------------------------------

I think it would be good to have something implemented, so if that means doing 
it with RDD initially that's fine by me.

For you questions
1 - I'd still like to see if this same approach could be used for 
recommendation/ranking style settings, so allowing the user to specify the 
column would be good.
2 / 3 - I agree it makes most sense to respect trainRatio. The idea is to 
maintain the class distribution rather than allow different trainRatios 
effectively between strata. So I vote for exact sampling as you suggest
4 - for now no, but I would imagine the main use case for this is for class 
labels, in which case we can use column metadata (now or in the future) to get 
the labels?

As for API design, I'm not sure what you mean by "output column" in your first 
example?

I would go for the `stratifiedCol` approach personally.

> Support balanced class labels when splitting train/cross validation sets
> ------------------------------------------------------------------------
>
>                 Key: SPARK-8971
>                 URL: https://issues.apache.org/jira/browse/SPARK-8971
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Feynman Liang
>            Assignee: Seth Hendrickson
>
> {{CrossValidator}} and the proposed {{TrainValidatorSplit}} (SPARK-8484) are 
> Spark classes which partition data into training and evaluation sets for 
> performing hyperparameter selection via cross validation.
> Both methods currently perform the split by randomly sampling the datasets. 
> However, when class probabilities are highly imbalanced (e.g. detection of 
> extremely low-frequency events), random sampling may result in cross 
> validation sets not representative of actual out-of-training performance 
> (e.g. no positive training examples could be included).
> Mainstream R packages like already 
> [caret|http://topepo.github.io/caret/splitting.html] support splitting the 
> data based upon the class labels.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to