Re: Merge two or more files

Rekha Joshi Mon, 10 May 2010 23:01:50 -0700

Hi Jeff,

Not sure if we are on the same page;but as you disagree I ran the datasets on 
grunt and default MAX works as expected for getting max.
Please let me know.


Thanks & Regards,
/R

grunt> A = load '99.txt' using PigStorage(',') as (f1:int,f2:int, f3:chararray);
grunt> dump A;
2010-05-11 05:55:07,702 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for A
2010-05-11 05:55:07,702 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
for A
2010-05-11 05:55:07,720 [main] WARN  org.apache.pig.impl.io.FileLocalizer - 
FileLocalizer.create: failed to create /tmp/temp1359137963
2010-05-11 05:55:07,827 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully 
stored result in: "file:/tmp/temp1359137963/tmp-2084075282"
2010-05-11 05:55:07,827 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written 
: 4
2010-05-11 05:55:07,827 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 
96
2010-05-11 05:55:07,827 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2010-05-11 05:55:07,827 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(10,155,ABC)
(20,100,DEF)
(30,200,XYZ)
(40,100,XXX)
grunt> B = load '91.txt' using PigStorage(',') as (f4:int, f5:int, 
f6:chararray);
grunt> dump B;
2010-05-11 05:55:38,530 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for B
2010-05-11 05:55:38,530 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
for B
2010-05-11 05:55:38,546 [main] WARN  org.apache.pig.impl.io.FileLocalizer - 
FileLocalizer.create: failed to create /tmp/temp1359137963
2010-05-11 05:55:38,604 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully 
stored result in: "file:/tmp/temp1359137963/tmp511625931"
2010-05-11 05:55:38,604 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written 
: 3
2010-05-11 05:55:38,605 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 
72
2010-05-11 05:55:38,605 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2010-05-11 05:55:38,605 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(10,160,CBA)
(20,90,QQQ)
(40,150,AAA)
grunt> D = union A, B;
grunt> dump D;
2010-05-11 05:56:06,399 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for B
2010-05-11 05:56:06,399 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
for B
2010-05-11 05:56:06,399 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for A
2010-05-11 05:56:06,399 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
for A
2010-05-11 05:56:06,426 [main] WARN  org.apache.pig.impl.io.FileLocalizer - 
FileLocalizer.create: failed to create /tmp/temp1359137963
2010-05-11 05:56:06,573 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully 
stored result in: "file:/tmp/temp1359137963/tmp-1380306413"
2010-05-11 05:56:06,573 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written 
: 7
2010-05-11 05:56:06,573 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 
168
2010-05-11 05:56:06,573 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2010-05-11 05:56:06,573 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(10,155,ABC)
(10,160,CBA)
(20,100,DEF)
(20,90,QQQ)
(30,200,XYZ)
(40,150,AAA)
(40,100,XXX)
grunt> E = group D by $0;
grunt> dump E;
2010-05-11 05:56:39,830 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for B
2010-05-11 05:56:39,831 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
for B
2010-05-11 05:56:39,831 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for A
2010-05-11 05:56:39,831 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
for A
2010-05-11 05:56:39,857 [main] WARN  org.apache.pig.impl.io.FileLocalizer - 
FileLocalizer.create: failed to create /tmp/temp1359137963
2010-05-11 05:56:39,995 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully 
stored result in: "file:/tmp/temp1359137963/tmp-1896759683"
2010-05-11 05:56:39,995 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written 
: 4
2010-05-11 05:56:39,995 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 
235
2010-05-11 05:56:39,995 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2010-05-11 05:56:39,995 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(10,{(10,155,ABC),(10,160,CBA)})
(20,{(20,100,DEF),(20,90,QQQ)})
(30,{(30,200,XYZ)})
(40,{(40,150,AAA),(40,100,XXX)})
grunt> describe E;
E: {group: int,D: {f1: int,f2: int,f3: chararray}}
grunt> F = foreach E generate group, MAX(D.f2);
grunt> dump F;
2010-05-11 05:57:10,378 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for B
2010-05-11 05:57:10,378 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
for B
2010-05-11 05:57:10,378 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No column pruned for A
2010-05-11 05:57:10,378 [main] INFO  
org.apache.pig.impl.logicalLayer.optimizer.PruneColumns - No map keys pruned 
for A
2010-05-11 05:57:10,430 [main] WARN  org.apache.pig.impl.io.FileLocalizer - 
FileLocalizer.create: failed to create /tmp/temp1359137963
2010-05-11 05:57:10,555 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Successfully 
stored result in: "file:/tmp/temp1359137963/tmp1923083174"
2010-05-11 05:57:10,555 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Records written 
: 4
2010-05-11 05:57:10,555 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Bytes written : 
72
2010-05-11 05:57:10,555 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - 100% complete!
2010-05-11 05:57:10,555 [main] INFO  
org.apache.pig.backend.local.executionengine.LocalPigLauncher - Success!!
(10,160)
(20,100)
(30,200)
(40,150)
grunt>



On 5/11/10 11:21 AM, "Jeff Zhang" <[email protected]> wrote:

Hi Rekha,

I look at the source code, the MAX udf built-in accept a Tuple rather
than a DataBag. And it can only handle two values


On Tue, May 11, 2010 at 1:35 PM, Rekha Joshi <[email protected]> wrote:
> Hi Moeller,
>
> I think  the default MAX udf can get the max of second element within the 
> group, as the group would be something like below.
> (10,{(10,155,ABC),(10,160,CBA)})
> (20,{(20,100,DEF),(20,90,QQQ)})
> (30,{(30,200,XYZ)})
> (40,{(40,150,AAA),(40,100,XXX)})
>
> Something like , Z = foreach Y generate group, MAX(X.f2);
> (10,160)
> (20,100)
> (30,200)
> (40,150)
>
> Refer http://hadoop.apache.org/pig/docs/r0.6.0/piglatin_ref2.html
>
> You might have to resort to another join or maybe a conditional expr to get 
> third element, or write your udf.
>
> Thanks & Regards
> /
>
> On 5/11/10 10:44 AM, "Jeff Zhang" <[email protected]> wrote:
>
> myoutput = FOREACH grouped GENERATE
> group,org.apache.pig.piggybank.myudf.MaxElement($1.$1);
>
> And the following is the udf:
>
> package org.apache.pig.piggybank.myudf;
>
> import java.io.IOException;
>
> import org.apache.pig.EvalFunc;
> import org.apache.pig.data.DataBag;
> import org.apache.pig.data.Tuple;
>
> public class MaxElement extends EvalFunc<Integer> {
>
>    @Override
>    public Integer exec(Tuple input) throws IOException {
>        int max = Integer.MIN_VALUE;
>        DataBag bag = (DataBag) input.get(0);
>        for (Tuple tuple : bag) {
>            Integer value = (Integer)tuple.get(0);
>            if (value > max){
>                max=value;
>            }
>        }
>        return max;
>    }
>
> }
>
>
>
> On Tue, May 11, 2010 at 12:50 PM, Mads Moeller <[email protected]> wrote:
>> Hi all,
>>
>> I am new to Pig/Hadoop and I am trying to figure out how I can merge
>> two (or more) input files, based on the value in one of the data
>> fields. E.g. from the below input files (INPUT 1 and INPUT 2), I want
>> to join on $0 and keep the row containing the highest value in $1 of
>> each row.
>>
>> INPUT 1
>> 10,155,ABC
>> 20,100,DEF
>> 30,200,XYZ
>> 40,100,XXX
>>
>> INPUT 2
>> 10,160,CBA
>> 20,90,QQQ
>> 40,150,AAA
>>
>> DESIRED OUTPUT
>> 10,160,CBA
>> 20,100,DEF
>> 30,200,XYZ
>> 40,150,AAA
>>
>> -- pig script start
>> INPUT1 = LOAD 'file1' USING PigStorage(',') AS (id:int, myval:int,
>> name:chararray);
>> INPUT2 = LOAD 'file2' USING PigStorage(',') AS (id:int, myval:int,
>> name:chararray);
>> combined = UNION INPUT1, INPUT2;
>> grouped = GROUP filtered BY id;
>> -- myoutput = FOREACH grouped GENERATE ??? -- I am stuck :-)
>>
>> STORE myoutput INTO 'output.csv' USING PigStorage(',');
>> -- pig script end
>>
>> Any suggestions to how this can be accomplished?
>>
>> Thanks.
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>



--
Best Regards

Jeff Zhang

Re: Merge two or more files

Reply via email to