Yes, "order" in pig-latin is sufficient - it will sort the file globally (not just within each part file).
An "order" statement results in two MR jobs, the first one takes sample of the order-by keys to figure out the distribution and decide how to partition the data across reducers in the 2nd MR job which does the sorting. -Thejas On 5/10/10 12:16 PM, "Vijay Rao" <[email protected]> wrote: > Hello, > > I am new to Hadoop, Pig and have just been reading whatever I could lay my > hands on. If I needed to sort a dataset using Pig is just the ORDER syntax > sufficient? > > For eg here is what I came up with to sort a dataset of users based on their > login count > > records = LOAD 'input/sample.txt' AS (username:chararray); > > grpd = GROUP records BY username; > > cntd = FOREACH grpd GENERATE > group, COUNT(records) AS cnt; > > srtd = ORDER cntd BY cnt; > > STORE srtd INTO 'output'; > > Is this sufficient to sort a dataset? Is there something else that needs to > be done? I read about partition/combine for SORT when I read Mapreduce and > hence was confused. > > Any help is greatly appreciated. > > Thanks > VJ
