Oh yeah.. This question is not related to our cube sampling stuff that we discussed.. wanted to know the reason behind that just out of curiosity :)
Thanks -- Prasanth On Aug 23, 2012, at 11:20 PM, Dmitriy Ryaboy <dvrya...@gmail.com> wrote: > I think we decided to instead stub in a special loader that reads a > few records from each underlying split, in a single mapper (by using a > single wrapping split), right? > > On Thu, Aug 23, 2012 at 7:55 PM, Prasanth J <buckeye.prasa...@gmail.com> > wrote: >> I see. Thanks Alan for your reply. >> Also one more question that I posted earlier was >> >> I used RandomSampleLoader and specified a sample size of 100. The number of >> map tasks that are executed is 110. So I am expecting total samples that are >> received on the reducer to be 110*100 = 11000 but its always more than the >> expected value. The actual received tuples is between 14000 to 15000. I am >> not sure if its a bug or if I am missing something. Is it an expected >> behavior? >> >> Thanks >> -- Prasanth >> >> On Aug 23, 2012, at 6:20 PM, Alan Gates <ga...@hortonworks.com> wrote: >> >>> Sorry for the very slow response, but here it is, hopefully better late >>> than never. >>> >>> On Jul 25, 2012, at 4:28 PM, Prasanth J wrote: >>> >>>> Thanks Alan. >>>> The requirement for me is that I want to load N number of samples based on >>>> the input file size and perform naive cube computation to determine the >>>> large groups that will not fit in reducer's memory. I need to know the >>>> exact number of samples for calculating the partition factor for large >>>> groups. >>>> Currently I am using RandomSampleLoader to load 1000 tuples from each >>>> mapper. Without knowing the number of mappers I will not be able to find >>>> the exact number of samples loaded. Also RandomSampleLoader doesn't attach >>>> any special marker (as in PoissonSampleLoader) tuples which tells the >>>> number of samples loaded. >>>> Is there any other way to know the exact number of samples loaded? >>> Not that I know of. >>> >>>> >>>> By analyzing the MR plans of order-by and skewed-join, it seems like the >>>> entire dataset is copied to a temp file and then SampleLoaders use the >>>> temp file to load samples. Is there any specific reason for this redundant >>>> copy? Is it because SampleLoaders can only use pig's internal i/o format? >>> Partly, but also because it allows any operators that need to run before >>> the sample (like project or filter) to be placed in the pipeline. >>> >>> Alan. >>> >>