Hi Rekha, You are right, I mistake the org.apache.pig.piggybank.evaluation.math.MAX as the built-in UDF. Actually the default UDF should be org.apache.pig.builtin.MAX;
On Tue, May 11, 2010 at 2:00 PM, Rekha Joshi <[email protected]> wrote: > Hi Jeff, > > Not sure if we are on the same page;but as you disagree I ran the datasets on > grunt and default MAX works as expected for getting max. > Please let me know. > > Thanks & Regards, > /R > > grunt> A = load '99.txt' using PigStorage(',') as (f1:int,f2:int, > f3:chararray); > grunt> dump A; > 2010-05-11 05:55:07,702 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned > for A > 2010-05-11 05:55:07,702 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned > for A > 2010-05-11 05:55:07,720 [main] WARN org.apache.pig.impl.io.FileLocalizer - > FileLocalizer.create: failed to create /tmp/temp1359137963 > 2010-05-11 05:55:07,827 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully > stored result in: "file:/tmp/temp1359137963/tmp-2084075282" > 2010-05-11 05:55:07,827 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records > written : 4 > 2010-05-11 05:55:07,827 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written > : 96 > 2010-05-11 05:55:07,827 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! > 2010-05-11 05:55:07,827 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! > (10,155,ABC) > (20,100,DEF) > (30,200,XYZ) > (40,100,XXX) > grunt> B = load '91.txt' using PigStorage(',') as (f4:int, f5:int, > f6:chararray); > grunt> dump B; > 2010-05-11 05:55:38,530 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned > for B > 2010-05-11 05:55:38,530 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned > for B > 2010-05-11 05:55:38,546 [main] WARN org.apache.pig.impl.io.FileLocalizer - > FileLocalizer.create: failed to create /tmp/temp1359137963 > 2010-05-11 05:55:38,604 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully > stored result in: "file:/tmp/temp1359137963/tmp511625931" > 2010-05-11 05:55:38,604 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records > written : 3 > 2010-05-11 05:55:38,605 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written > : 72 > 2010-05-11 05:55:38,605 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! > 2010-05-11 05:55:38,605 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! > (10,160,CBA) > (20,90,QQQ) > (40,150,AAA) > grunt> D = union A, B; > grunt> dump D; > 2010-05-11 05:56:06,399 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned > for B > 2010-05-11 05:56:06,399 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned > for B > 2010-05-11 05:56:06,399 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned > for A > 2010-05-11 05:56:06,399 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned > for A > 2010-05-11 05:56:06,426 [main] WARN org.apache.pig.impl.io.FileLocalizer - > FileLocalizer.create: failed to create /tmp/temp1359137963 > 2010-05-11 05:56:06,573 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully > stored result in: "file:/tmp/temp1359137963/tmp-1380306413" > 2010-05-11 05:56:06,573 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records > written : 7 > 2010-05-11 05:56:06,573 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written > : 168 > 2010-05-11 05:56:06,573 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! > 2010-05-11 05:56:06,573 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! > (10,155,ABC) > (10,160,CBA) > (20,100,DEF) > (20,90,QQQ) > (30,200,XYZ) > (40,150,AAA) > (40,100,XXX) > grunt> E = group D by $0; > grunt> dump E; > 2010-05-11 05:56:39,830 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned > for B > 2010-05-11 05:56:39,831 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned > for B > 2010-05-11 05:56:39,831 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned > for A > 2010-05-11 05:56:39,831 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned > for A > 2010-05-11 05:56:39,857 [main] WARN org.apache.pig.impl.io.FileLocalizer - > FileLocalizer.create: failed to create /tmp/temp1359137963 > 2010-05-11 05:56:39,995 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully > stored result in: "file:/tmp/temp1359137963/tmp-1896759683" > 2010-05-11 05:56:39,995 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records > written : 4 > 2010-05-11 05:56:39,995 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written > : 235 > 2010-05-11 05:56:39,995 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! > 2010-05-11 05:56:39,995 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! > (10,{(10,155,ABC),(10,160,CBA)}) > (20,{(20,100,DEF),(20,90,QQQ)}) > (30,{(30,200,XYZ)}) > (40,{(40,150,AAA),(40,100,XXX)}) > grunt> describe E; > E: {group: int,D: {f1: int,f2: int,f3: chararray}} > grunt> F = foreach E generate group, MAX(D.f2); > grunt> dump F; > 2010-05-11 05:57:10,378 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned > for B > 2010-05-11 05:57:10,378 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned > for B > 2010-05-11 05:57:10,378 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned > for A > 2010-05-11 05:57:10,378 [main] INFO > org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned > for A > 2010-05-11 05:57:10,430 [main] WARN org.apache.pig.impl.io.FileLocalizer - > FileLocalizer.create: failed to create /tmp/temp1359137963 > 2010-05-11 05:57:10,555 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully > stored result in: "file:/tmp/temp1359137963/tmp1923083174" > 2010-05-11 05:57:10,555 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records > written : 4 > 2010-05-11 05:57:10,555 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written > : 72 > 2010-05-11 05:57:10,555 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete! > 2010-05-11 05:57:10,555 [main] INFO > org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!! > (10,160) > (20,100) > (30,200) > (40,150) > grunt> > > > > On 5/11/10 11:21 AM, "Jeff Zhang" <[email protected]> wrote: > > Hi Rekha, > > I look at the source code, the MAX udf built-in accept a Tuple rather > than a DataBag. And it can only handle two values > > > On Tue, May 11, 2010 at 1:35 PM, Rekha Joshi <[email protected]> wrote: >> Hi Moeller, >> >> I think the default MAX udf can get the max of second element within the >> group, as the group would be something like below. >> (10,{(10,155,ABC),(10,160,CBA)}) >> (20,{(20,100,DEF),(20,90,QQQ)}) >> (30,{(30,200,XYZ)}) >> (40,{(40,150,AAA),(40,100,XXX)}) >> >> Something like , Z = foreach Y generate group, MAX(X.f2); >> (10,160) >> (20,100) >> (30,200) >> (40,150) >> >> Refer http://hadoop.apache.org/pig/docs/r0.6.0/piglatin_ref2.html >> >> You might have to resort to another join or maybe a conditional expr to get >> third element, or write your udf. >> >> Thanks & Regards >> / >> >> On 5/11/10 10:44 AM, "Jeff Zhang" <[email protected]> wrote: >> >> myoutput = FOREACH grouped GENERATE >> group,org.apache.pig.piggybank.myudf.MaxElement($1.$1); >> >> And the following is the udf: >> >> package org.apache.pig.piggybank.myudf; >> >> import java.io.IOException; >> >> import org.apache.pig.EvalFunc; >> import org.apache.pig.data.DataBag; >> import org.apache.pig.data.Tuple; >> >> public class MaxElement extends EvalFunc<Integer> { >> >> �...@override >> public Integer exec(Tuple input) throws IOException { >> int max = Integer.MIN_VALUE; >> DataBag bag = (DataBag) input.get(0); >> for (Tuple tuple : bag) { >> Integer value = (Integer)tuple.get(0); >> if (value > max){ >> max=value; >> } >> } >> return max; >> } >> >> } >> >> >> >> On Tue, May 11, 2010 at 12:50 PM, Mads Moeller <[email protected]> wrote: >>> Hi all, >>> >>> I am new to Pig/Hadoop and I am trying to figure out how I can merge >>> two (or more) input files, based on the value in one of the data >>> fields. E.g. from the below input files (INPUT 1 and INPUT 2), I want >>> to join on $0 and keep the row containing the highest value in $1 of >>> each row. >>> >>> INPUT 1 >>> 10,155,ABC >>> 20,100,DEF >>> 30,200,XYZ >>> 40,100,XXX >>> >>> INPUT 2 >>> 10,160,CBA >>> 20,90,QQQ >>> 40,150,AAA >>> >>> DESIRED OUTPUT >>> 10,160,CBA >>> 20,100,DEF >>> 30,200,XYZ >>> 40,150,AAA >>> >>> -- pig script start >>> INPUT1 = LOAD 'file1' USING PigStorage(',') AS (id:int, myval:int, >>> name:chararray); >>> INPUT2 = LOAD 'file2' USING PigStorage(',') AS (id:int, myval:int, >>> name:chararray); >>> combined = UNION INPUT1, INPUT2; >>> grouped = GROUP filtered BY id; >>> -- myoutput = FOREACH grouped GENERATE ??? -- I am stuck :-) >>> >>> STORE myoutput INTO 'output.csv' USING PigStorage(','); >>> -- pig script end >>> >>> Any suggestions to how this can be accomplished? >>> >>> Thanks. >>> >> >> >> >> -- >> Best Regards >> >> Jeff Zhang >> >> > > > > -- > Best Regards > > Jeff Zhang > > -- Best Regards Jeff Zhang
