myoutput = FOREACH grouped GENERATE
group,org.apache.pig.piggybank.myudf.MaxElement($1.$1);

And the following is the udf:

package org.apache.pig.piggybank.myudf;

import java.io.IOException;

import org.apache.pig.EvalFunc;
import org.apache.pig.data.DataBag;
import org.apache.pig.data.Tuple;

public class MaxElement extends EvalFunc<Integer> {

    @Override
    public Integer exec(Tuple input) throws IOException {
        int max = Integer.MIN_VALUE;
        DataBag bag = (DataBag) input.get(0);
        for (Tuple tuple : bag) {
            Integer value = (Integer)tuple.get(0);
            if (value > max){
                max=value;
            }
        }
        return max;
    }

}



On Tue, May 11, 2010 at 12:50 PM, Mads Moeller <[email protected]> wrote:
> Hi all,
>
> I am new to Pig/Hadoop and I am trying to figure out how I can merge
> two (or more) input files, based on the value in one of the data
> fields. E.g. from the below input files (INPUT 1 and INPUT 2), I want
> to join on $0 and keep the row containing the highest value in $1 of
> each row.
>
> INPUT 1
> 10,155,ABC
> 20,100,DEF
> 30,200,XYZ
> 40,100,XXX
>
> INPUT 2
> 10,160,CBA
> 20,90,QQQ
> 40,150,AAA
>
> DESIRED OUTPUT
> 10,160,CBA
> 20,100,DEF
> 30,200,XYZ
> 40,150,AAA
>
> -- pig script start
> INPUT1 = LOAD 'file1' USING PigStorage(',') AS (id:int, myval:int,
> name:chararray);
> INPUT2 = LOAD 'file2' USING PigStorage(',') AS (id:int, myval:int,
> name:chararray);
> combined = UNION INPUT1, INPUT2;
> grouped = GROUP filtered BY id;
> -- myoutput = FOREACH grouped GENERATE ??? -- I am stuck :-)
>
> STORE myoutput INTO 'output.csv' USING PigStorage(',');
> -- pig script end
>
> Any suggestions to how this can be accomplished?
>
> Thanks.
>



-- 
Best Regards

Jeff Zhang

Reply via email to