Re: Using sampleByKey

Xiangrui Meng Tue, 18 Nov 2014 11:54:18 -0800

If all users are equally important, then the average score should be
representative. You shouldn't worry about missing one or two. For
stratified sampling, wikipedia has a paragraph about its disadvantage:


http://en.wikipedia.org/wiki/Stratified_sampling#Disadvantages

It depends on the size of the population. For example, in the US
Census survey sampling design, there are many (>> 100) strata:

https://www.census.gov/acs/www/Downloads/survey_methodology/Chapter_4_RevisedDec2010.pdf

If you indeed want to do the split per user, you should use groupByKey
and apply reservoir sampling for ratings from each user.

-Xiangrui

On Tue, Nov 18, 2014 at 11:12 AM, Debasish Das <debasish.da...@gmail.com> wrote:
> For mllib PR, I will add this logic: "If a user is missing in training and
> appears in test, we can simply ignore it."
>
> I was struggling since users appear in test on which the model was not
> trained on...
>
> For our internal tests we want to cross validate on every product / user as
> all of them are equally important and so I have to come up with a sampling
> strategy for every user/product...
>
> In general for stratified sampling what's the bound on strata ? Like number
> of classes in a labeled dataset ~ 100 ?
>
> On Tue, Nov 18, 2014 at 10:31 AM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> `sampleByKey` with the same fraction per stratum acts the same as
>> `sample`. The operation you want is perhaps `sampleByKeyExact` here.
>> However, when you use stratified sampling, there should not be many
>> strata. My question is why we need to split on each user's ratings. If
>> a user is missing in training and appears in test, we can simply
>> ignore it. -Xiangrui
>>
>> On Tue, Nov 18, 2014 at 6:59 AM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>> > Sean,
>> >
>> > I thought sampleByKey (stratified sampling) in 1.1 was designed to solve
>> > the problem that randomSplit can't sample by key...
>> >
>> > Xiangrui,
>> >
>> > What's the expected behavior of sampleByKey ? In the dataset sampled
>> > using
>> > sampleByKey the keys should match the input dataset keys right ? If it
>> > is a
>> > bug, I can open up a JIRA and look into it...
>> >
>> > Thanks.
>> > Deb
>> >
>> > On Tue, Nov 18, 2014 at 1:34 AM, Sean Owen <so...@cloudera.com> wrote:
>> >
>> >> I use randomSplit to make a train/CV/test set in one go. It definitely
>> >> produces disjoint data sets and is efficient. The problem is you can't
>> >> do it by key.
>> >>
>> >> I am not sure why your subtract does not work. I suspect it is because
>> >> the values do not partition the same way, or they don't evaluate
>> >> equality in the expected way, but I don't see any reason why. Tuples
>> >> work as expected here.
>> >>
>> >> On Tue, Nov 18, 2014 at 4:32 AM, Debasish Das
>> >> <debasish.da...@gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I have a rdd whose key is a userId and value is (movieId, rating)...
>> >> >
>> >> > I want to sample 80% of the (movieId,rating) that each userId has
>> >> > seen
>> >> for
>> >> > train, rest is for test...
>> >> >
>> >> > val indexedRating = sc.textFile(...).map{x=> Rating(x(0), x(1), x(2))
>> >> >
>> >> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
>> >> x.rating))}
>> >> >
>> >> > val keyedTraining = keyedRatings.sample(true, 0.8, 1L)
>> >> >
>> >> > val keyedTest = keyedRatings.subtract(keyedTraining)
>> >> >
>> >> > blocks = sc.maxParallelism
>> >> >
>> >> > println(s"Rating keys ${keyedRatings.groupByKey(blocks).count()}")
>> >> >
>> >> > println(s"Training keys ${keyedTraining.groupByKey(blocks).count()}")
>> >> >
>> >> > println(s"Test keys ${keyedTest.groupByKey(blocks).count()}")
>> >> >
>> >> > My expectation was that the println will produce exact number of keys
>> >> > for
>> >> > keyedRatings, keyedTraining and keyedTest but this is not the case...
>> >> >
>> >> > On MovieLens for example I am noticing the following:
>> >> >
>> >> > Rating keys 3706
>> >> >
>> >> > Training keys 3676
>> >> >
>> >> > Test keys 3470
>> >> >
>> >> > I also tried sampleByKey as follows:
>> >> >
>> >> > val keyedRatings = indexedRating.map{x => (x.product, (x.user,
>> >> x.rating))}
>> >> >
>> >> > val fractions = keyedRatings.map{x=> (x._1, 0.8)}.collect.toMap
>> >> >
>> >> > val keyedTraining = keyedRatings.sampleByKey(false, fractions, 1L)
>> >> >
>> >> > val keyedTest = keyedRatings.subtract(keyedTraining)
>> >> >
>> >> > Still I get the results as:
>> >> >
>> >> > Rating keys 3706
>> >> >
>> >> > Training keys 3682
>> >> >
>> >> > Test keys 3459
>> >> >
>> >> > Any idea what's is wrong here...
>> >> >
>> >> > Are my assumptions about behavior of sample/sampleByKey on a
>> >> > key-value
>> >> RDD
>> >> > correct ? If this is a bug I can dig deeper...
>> >> >
>> >> > Thanks.
>> >> >
>> >> > Deb
>> >>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Using sampleByKey

Reply via email to