[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15430381#comment-15430381 ]
Vincent edited comment on SPARK-17055 at 8/22/16 9:14 AM: ---------------------------------------------------------- well, a better model will have a better cv performance on validation data with unseen labels, so the final selected model will have a relatively better capability on predicting samples with unseen categories/labels in real case. was (Author: vincexie): well, a better model will have a better cv performance on data with unseen labels, so the final selected model will have a relatively better capability on predicting samples with unseen categories/labels in real case. > add labelKFold to CrossValidator > -------------------------------- > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Vincent > Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org