Re: Merge two or more files

Jeff Zhang Mon, 10 May 2010 22:52:15 -0700

Hi Rekha,

I look at the source code, the MAX udf built-in accept a Tuple rather
than a DataBag. And it can only handle two values



On Tue, May 11, 2010 at 1:35 PM, Rekha Joshi <[email protected]> wrote:
> Hi Moeller,
>
> I think  the default MAX udf can get the max of second element within the 
> group, as the group would be something like below.
> (10,{(10,155,ABC),(10,160,CBA)})
> (20,{(20,100,DEF),(20,90,QQQ)})
> (30,{(30,200,XYZ)})
> (40,{(40,150,AAA),(40,100,XXX)})
>
> Something like , Z = foreach Y generate group, MAX(X.f2);
> (10,160)
> (20,100)
> (30,200)
> (40,150)
>
> Refer http://hadoop.apache.org/pig/docs/r0.6.0/piglatin_ref2.html
>
> You might have to resort to another join or maybe a conditional expr to get 
> third element, or write your udf.
>
> Thanks & Regards
> /
>
> On 5/11/10 10:44 AM, "Jeff Zhang" <[email protected]> wrote:
>
> myoutput = FOREACH grouped GENERATE
> group,org.apache.pig.piggybank.myudf.MaxElement($1.$1);
>
> And the following is the udf:
>
> package org.apache.pig.piggybank.myudf;
>
> import java.io.IOException;
>
> import org.apache.pig.EvalFunc;
> import org.apache.pig.data.DataBag;
> import org.apache.pig.data.Tuple;
>
> public class MaxElement extends EvalFunc<Integer> {
>
>   �...@override
>    public Integer exec(Tuple input) throws IOException {
>        int max = Integer.MIN_VALUE;
>        DataBag bag = (DataBag) input.get(0);
>        for (Tuple tuple : bag) {
>            Integer value = (Integer)tuple.get(0);
>            if (value > max){
>                max=value;
>            }
>        }
>        return max;
>    }
>
> }
>
>
>
> On Tue, May 11, 2010 at 12:50 PM, Mads Moeller <[email protected]> wrote:
>> Hi all,
>>
>> I am new to Pig/Hadoop and I am trying to figure out how I can merge
>> two (or more) input files, based on the value in one of the data
>> fields. E.g. from the below input files (INPUT 1 and INPUT 2), I want
>> to join on $0 and keep the row containing the highest value in $1 of
>> each row.
>>
>> INPUT 1
>> 10,155,ABC
>> 20,100,DEF
>> 30,200,XYZ
>> 40,100,XXX
>>
>> INPUT 2
>> 10,160,CBA
>> 20,90,QQQ
>> 40,150,AAA
>>
>> DESIRED OUTPUT
>> 10,160,CBA
>> 20,100,DEF
>> 30,200,XYZ
>> 40,150,AAA
>>
>> -- pig script start
>> INPUT1 = LOAD 'file1' USING PigStorage(',') AS (id:int, myval:int,
>> name:chararray);
>> INPUT2 = LOAD 'file2' USING PigStorage(',') AS (id:int, myval:int,
>> name:chararray);
>> combined = UNION INPUT1, INPUT2;
>> grouped = GROUP filtered BY id;
>> -- myoutput = FOREACH grouped GENERATE ??? -- I am stuck :-)
>>
>> STORE myoutput INTO 'output.csv' USING PigStorage(',');
>> -- pig script end
>>
>> Any suggestions to how this can be accomplished?
>>
>> Thanks.
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>



-- 
Best Regards

Jeff Zhang

Re: Merge two or more files

Reply via email to