Hi all,

I am new to Pig/Hadoop and I am trying to figure out how I can merge
two (or more) input files, based on the value in one of the data
fields. E.g. from the below input files (INPUT 1 and INPUT 2), I want
to join on $0 and keep the row containing the highest value in $1 of
each row.

INPUT 1
10,155,ABC
20,100,DEF
30,200,XYZ
40,100,XXX

INPUT 2
10,160,CBA
20,90,QQQ
40,150,AAA

DESIRED OUTPUT
10,160,CBA
20,100,DEF
30,200,XYZ
40,150,AAA

-- pig script start
INPUT1 = LOAD 'file1' USING PigStorage(',') AS (id:int, myval:int,
name:chararray);
INPUT2 = LOAD 'file2' USING PigStorage(',') AS (id:int, myval:int,
name:chararray);
combined = UNION INPUT1, INPUT2;
grouped = GROUP filtered BY id;
-- myoutput = FOREACH grouped GENERATE ??? -- I am stuck :-)

STORE myoutput INTO 'output.csv' USING PigStorage(',');
-- pig script end

Any suggestions to how this can be accomplished?

Thanks.

Reply via email to