Awesome. Thanks. Where can I find more information about how hadoop assembles the intermediate output to come up with the final reduce? Since data is local to the slaves. Also I know the input data is usually stored with 3 copies(default). Is the output also stored 3 times?
Thanks VJ On Mon, May 10, 2010 at 12:35 PM, Thejas Nair <[email protected]> wrote: > Yes, "order" in pig-latin is sufficient - it will sort the file globally > (not just within each part file). > > An "order" statement results in two MR jobs, the first one takes sample of > the order-by keys to figure out the distribution and decide how to > partition > the data across reducers in the 2nd MR job which does the sorting. > > -Thejas > > > > On 5/10/10 12:16 PM, "Vijay Rao" <[email protected]> wrote: > > > Hello, > > > > I am new to Hadoop, Pig and have just been reading whatever I could lay > my > > hands on. If I needed to sort a dataset using Pig is just the ORDER > syntax > > sufficient? > > > > For eg here is what I came up with to sort a dataset of users based on > their > > login count > > > > records = LOAD 'input/sample.txt' AS (username:chararray); > > > > grpd = GROUP records BY username; > > > > cntd = FOREACH grpd GENERATE > > group, COUNT(records) AS cnt; > > > > srtd = ORDER cntd BY cnt; > > > > STORE srtd INTO 'output'; > > > > Is this sufficient to sort a dataset? Is there something else that needs > to > > be done? I read about partition/combine for SORT when I read Mapreduce > and > > hence was confused. > > > > Any help is greatly appreciated. > > > > Thanks > > VJ > >
