Re: Question about SORT

Vijay Rao Mon, 10 May 2010 12:49:26 -0700

Awesome. Thanks. Where can I find more information about how hadoop
assembles the intermediate output to come up with the final reduce? Since
data is local to the slaves. Also I know the input data is usually stored
with 3 copies(default). Is the output also stored 3 times?


Thanks
VJ

On Mon, May 10, 2010 at 12:35 PM, Thejas Nair <[email protected]> wrote:

> Yes, "order" in pig-latin is sufficient - it will sort the file globally
> (not just within each part file).
>
> An "order" statement results in two MR jobs, the first one takes sample of
> the order-by keys to figure out the distribution and decide how to
> partition
> the data across reducers in the 2nd MR job which does the sorting.
>
> -Thejas
>
>
>
> On 5/10/10 12:16 PM, "Vijay Rao" <[email protected]> wrote:
>
> > Hello,
> >
> > I am new to Hadoop, Pig and have just been reading whatever I could lay
> my
> > hands on. If I needed to sort a dataset using Pig is just the ORDER
> syntax
> > sufficient?
> >
> > For eg here is what I came up with to sort a dataset of users based on
> their
> > login count
> >
> > records = LOAD 'input/sample.txt' AS (username:chararray);
> >
> > grpd = GROUP records BY username;
> >
> > cntd = FOREACH grpd GENERATE
> >           group, COUNT(records) AS cnt;
> >
> > srtd = ORDER cntd BY cnt;
> >
> > STORE srtd INTO 'output';
> >
> > Is this sufficient to sort a dataset? Is there something else that needs
> to
> > be done? I read about partition/combine for SORT when I read Mapreduce
> and
> > hence was confused.
> >
> > Any help is greatly appreciated.
> >
> > Thanks
> > VJ
>
>

Re: Question about SORT

Reply via email to