[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15432463#comment-15432463
 ] 

Vincent edited comment on SPARK-17055 at 8/23/16 9:18 AM:
----------------------------------------------------------

sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, 
or something like that. Though personally I think it's fine to keep the way it 
is, though, it could be still kinda confusing when someone first uses it before 
understanding the idea behinds it.

As for application, take face recognition as an example. features are, say, 
eyes, nose, lips etc. training data are obtained from a number of different 
person, this method can create subject independent folds, so we can train the 
model with features from certain group of people and take the data from the 
rest of group of people for validation. it will enhance the generic ability of 
the model and avoid over-fitting.

it's a useful method, seen in sklearn, and currently caret is on the way trying 
to add this feature.


was (Author: vincexie):
sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, 
or something like that. Though personally I think it's fine to keep the way it 
is, though, it could be still kinda confusing when someone first uses it before 
understanding the idea behinds it.

As for application, take face recognition as an example. features are, say, 
eyes, nose, lips etc. training data are obtained from a number of different 
person, this method can create subject independent folds, so we can train the 
model with features from certain group of people and take the data from the 
rest of group of people for validation. it will enhance the generic ability of 
the model and avoid over-fitting.

it's a useful method, seen in sklearn, and currently caret is on the way add 
this feature.

> add labelKFold to CrossValidator
> --------------------------------
>
>                 Key: SPARK-17055
>                 URL: https://issues.apache.org/jira/browse/SPARK-17055
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Vincent
>            Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to