Yes, "order" in pig-latin is sufficient - it will sort the file globally
(not just within each part file).

An "order" statement results in two MR jobs, the first one takes sample of
the order-by keys to figure out the distribution and decide how to partition
the data across reducers in the 2nd MR job which does the sorting.

-Thejas



On 5/10/10 12:16 PM, "Vijay Rao" <[email protected]> wrote:

> Hello,
> 
> I am new to Hadoop, Pig and have just been reading whatever I could lay my
> hands on. If I needed to sort a dataset using Pig is just the ORDER syntax
> sufficient?
> 
> For eg here is what I came up with to sort a dataset of users based on their
> login count
> 
> records = LOAD 'input/sample.txt' AS (username:chararray);
> 
> grpd = GROUP records BY username;
> 
> cntd = FOREACH grpd GENERATE
>           group, COUNT(records) AS cnt;
> 
> srtd = ORDER cntd BY cnt;
> 
> STORE srtd INTO 'output';
> 
> Is this sufficient to sort a dataset? Is there something else that needs to
> be done? I read about partition/combine for SORT when I read Mapreduce and
> hence was confused.
> 
> Any help is greatly appreciated.
> 
> Thanks
> VJ

Reply via email to