If all users are equally important, then the average score should be representative. You shouldn't worry about missing one or two. For stratified sampling, wikipedia has a paragraph about its disadvantage:
http://en.wikipedia.org/wiki/Stratified_sampling#Disadvantages It depends on the size of the population. For example, in the US Census survey sampling design, there are many (>> 100) strata: https://www.census.gov/acs/www/Downloads/survey_methodology/Chapter_4_RevisedDec2010.pdf If you indeed want to do the split per user, you should use groupByKey and apply reservoir sampling for ratings from each user. -Xiangrui On Tue, Nov 18, 2014 at 11:12 AM, Debasish Das <debasish.da...@gmail.com> wrote: > For mllib PR, I will add this logic: "If a user is missing in training and > appears in test, we can simply ignore it." > > I was struggling since users appear in test on which the model was not > trained on... > > For our internal tests we want to cross validate on every product / user as > all of them are equally important and so I have to come up with a sampling > strategy for every user/product... > > In general for stratified sampling what's the bound on strata ? Like number > of classes in a labeled dataset ~ 100 ? > > On Tue, Nov 18, 2014 at 10:31 AM, Xiangrui Meng <men...@gmail.com> wrote: >> >> `sampleByKey` with the same fraction per stratum acts the same as >> `sample`. The operation you want is perhaps `sampleByKeyExact` here. >> However, when you use stratified sampling, there should not be many >> strata. My question is why we need to split on each user's ratings. If >> a user is missing in training and appears in test, we can simply >> ignore it. -Xiangrui >> >> On Tue, Nov 18, 2014 at 6:59 AM, Debasish Das <debasish.da...@gmail.com> >> wrote: >> > Sean, >> > >> > I thought sampleByKey (stratified sampling) in 1.1 was designed to solve >> > the problem that randomSplit can't sample by key... >> > >> > Xiangrui, >> > >> > What's the expected behavior of sampleByKey ? In the dataset sampled >> > using >> > sampleByKey the keys should match the input dataset keys right ? If it >> > is a >> > bug, I can open up a JIRA and look into it... >> > >> > Thanks. >> > Deb >> > >> > On Tue, Nov 18, 2014 at 1:34 AM, Sean Owen <so...@cloudera.com> wrote: >> > >> >> I use randomSplit to make a train/CV/test set in one go. It definitely >> >> produces disjoint data sets and is efficient. The problem is you can't >> >> do it by key. >> >> >> >> I am not sure why your subtract does not work. I suspect it is because >> >> the values do not partition the same way, or they don't evaluate >> >> equality in the expected way, but I don't see any reason why. Tuples >> >> work as expected here. >> >> >> >> On Tue, Nov 18, 2014 at 4:32 AM, Debasish Das >> >> <debasish.da...@gmail.com> >> >> wrote: >> >> > Hi, >> >> > >> >> > I have a rdd whose key is a userId and value is (movieId, rating)... >> >> > >> >> > I want to sample 80% of the (movieId,rating) that each userId has >> >> > seen >> >> for >> >> > train, rest is for test... >> >> > >> >> > val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2)) >> >> > >> >> > val keyedRatings = indexedRating.map{x => (x.product, (x.user, >> >> x.rating))} >> >> > >> >> > val keyedTraining = keyedRatings.sample(true, 0.8, 1L) >> >> > >> >> > val keyedTest = keyedRatings.subtract(keyedTraining) >> >> > >> >> > blocks = sc.maxParallelism >> >> > >> >> > println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}") >> >> > >> >> > println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}") >> >> > >> >> > println(s"Test keys ${keyedTest.groupByKey(blocks).count()}") >> >> > >> >> > My expectation was that the println will produce exact number of keys >> >> > for >> >> > keyedRatings, keyedTraining and keyedTest but this is not the case... >> >> > >> >> > On MovieLens for example I am noticing the following: >> >> > >> >> > Rating keys 3706 >> >> > >> >> > Training keys 3676 >> >> > >> >> > Test keys 3470 >> >> > >> >> > I also tried sampleByKey as follows: >> >> > >> >> > val keyedRatings = indexedRating.map{x => (x.product, (x.user, >> >> x.rating))} >> >> > >> >> > val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap >> >> > >> >> > val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L) >> >> > >> >> > val keyedTest = keyedRatings.subtract(keyedTraining) >> >> > >> >> > Still I get the results as: >> >> > >> >> > Rating keys 3706 >> >> > >> >> > Training keys 3682 >> >> > >> >> > Test keys 3459 >> >> > >> >> > Any idea what's is wrong here... >> >> > >> >> > Are my assumptions about behavior of sample/sampleByKey on a >> >> > key-value >> >> RDD >> >> > correct ? If this is a bug I can dig deeper... >> >> > >> >> > Thanks. >> >> > >> >> > Deb >> >> > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org