Hi Rekha, I look at the source code, the MAX udf built-in accept a Tuple rather than a DataBag. And it can only handle two values
On Tue, May 11, 2010 at 1:35 PM, Rekha Joshi <[email protected]> wrote: > Hi Moeller, > > I think the default MAX udf can get the max of second element within the > group, as the group would be something like below. > (10,{(10,155,ABC),(10,160,CBA)}) > (20,{(20,100,DEF),(20,90,QQQ)}) > (30,{(30,200,XYZ)}) > (40,{(40,150,AAA),(40,100,XXX)}) > > Something like , Z = foreach Y generate group, MAX(X.f2); > (10,160) > (20,100) > (30,200) > (40,150) > > Refer http://hadoop.apache.org/pig/docs/r0.6.0/piglatin_ref2.html > > You might have to resort to another join or maybe a conditional expr to get > third element, or write your udf. > > Thanks & Regards > / > > On 5/11/10 10:44 AM, "Jeff Zhang" <[email protected]> wrote: > > myoutput = FOREACH grouped GENERATE > group,org.apache.pig.piggybank.myudf.MaxElement($1.$1); > > And the following is the udf: > > package org.apache.pig.piggybank.myudf; > > import java.io.IOException; > > import org.apache.pig.EvalFunc; > import org.apache.pig.data.DataBag; > import org.apache.pig.data.Tuple; > > public class MaxElement extends EvalFunc<Integer> { > > �...@override > public Integer exec(Tuple input) throws IOException { > int max = Integer.MIN_VALUE; > DataBag bag = (DataBag) input.get(0); > for (Tuple tuple : bag) { > Integer value = (Integer)tuple.get(0); > if (value > max){ > max=value; > } > } > return max; > } > > } > > > > On Tue, May 11, 2010 at 12:50 PM, Mads Moeller <[email protected]> wrote: >> Hi all, >> >> I am new to Pig/Hadoop and I am trying to figure out how I can merge >> two (or more) input files, based on the value in one of the data >> fields. E.g. from the below input files (INPUT 1 and INPUT 2), I want >> to join on $0 and keep the row containing the highest value in $1 of >> each row. >> >> INPUT 1 >> 10,155,ABC >> 20,100,DEF >> 30,200,XYZ >> 40,100,XXX >> >> INPUT 2 >> 10,160,CBA >> 20,90,QQQ >> 40,150,AAA >> >> DESIRED OUTPUT >> 10,160,CBA >> 20,100,DEF >> 30,200,XYZ >> 40,150,AAA >> >> -- pig script start >> INPUT1 = LOAD 'file1' USING PigStorage(',') AS (id:int, myval:int, >> name:chararray); >> INPUT2 = LOAD 'file2' USING PigStorage(',') AS (id:int, myval:int, >> name:chararray); >> combined = UNION INPUT1, INPUT2; >> grouped = GROUP filtered BY id; >> -- myoutput = FOREACH grouped GENERATE ??? -- I am stuck :-) >> >> STORE myoutput INTO 'output.csv' USING PigStorage(','); >> -- pig script end >> >> Any suggestions to how this can be accomplished? >> >> Thanks. >> > > > > -- > Best Regards > > Jeff Zhang > > -- Best Regards Jeff Zhang
