myoutput = FOREACH grouped GENERATE
group,org.apache.pig.piggybank.myudf.MaxElement($1.$1);
And the following is the udf:
package org.apache.pig.piggybank.myudf;
import java.io.IOException;
import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;
public class MaxElement extends EvalFunc<Integer> {
@Override
public Integer exec(Tuple input) throws IOException {
int max = Integer.MIN_VALUE;
DataBag bag = (DataBag) input.get(0);
for (Tuple tuple : bag) {
Integer value = (Integer)tuple.get(0);
if (value > max){
max=value;
}
}
return max;
}
}
On Tue, May 11, 2010 at 12:50 PM, Mads Moeller <[email protected]> wrote:
> Hi all,
>
> I am new to Pig/Hadoop and I am trying to figure out how I can merge
> two (or more) input files, based on the value in one of the data
> fields. E.g. from the below input files (INPUT 1 and INPUT 2), I want
> to join on $0 and keep the row containing the highest value in $1 of
> each row.
>
> INPUT 1
> 10,155,ABC
> 20,100,DEF
> 30,200,XYZ
> 40,100,XXX
>
> INPUT 2
> 10,160,CBA
> 20,90,QQQ
> 40,150,AAA
>
> DESIRED OUTPUT
> 10,160,CBA
> 20,100,DEF
> 30,200,XYZ
> 40,150,AAA
>
> -- pig script start
> INPUT1 = LOAD 'file1' USING PigStorage(',') AS (id:int, myval:int,
> name:chararray);
> INPUT2 = LOAD 'file2' USING PigStorage(',') AS (id:int, myval:int,
> name:chararray);
> combined = UNION INPUT1, INPUT2;
> grouped = GROUP filtered BY id;
> -- myoutput = FOREACH grouped GENERATE ??? -- I am stuck :-)
>
> STORE myoutput INTO 'output.csv' USING PigStorage(',');
> -- pig script end
>
> Any suggestions to how this can be accomplished?
>
> Thanks.
>
--
Best Regards
Jeff Zhang