Tanton Gibbs
Wed, 11 Jun 2008 08:37:00 -0700
I created a simple UDF that defines an "Identity" tuple to allow me to
use constants. If there is a built-in way you can use that, instead.
Here is the UDF:
import org.apache.pig.EvalFunc;
import org.apache.pig.data.Tuple;
import java.io.IOException;
public class IdentityTuple extends EvalFunc<Tuple> {
@Override
public void exec(Tuple input, Tuple output) throws IOException {
output.copyFrom(input);
}
}
Here is the script:
register Identity.jar
A = LOAD 'mytestA.txt' USING PigStorage();
B = LOAD 'mytestB.txt' USING PigStorage();
C = COGROUP A BY $0, B BY $0;
D = FOREACH C GENERATE flatten(((COUNT(A) == '0') ? IdentityTuple('',
'', '') : A)), flatten(((COUNT(B) == '0') ? IdentityTuple('', '', '')
: B));
dump D;
Here is the output:
(1, 1, 1, , , )
(2, 1, 1, , , )
(3, 1, 1, , , )
(4, 1, 1, 4, 2, 2)
(5, 1, 1, 5, 2, 2)
(, , , 6, 2, 2)
(, , , 7, 2, 2)
On Wed, Jun 11, 2008 at 10:07 AM, Tanton Gibbs <[EMAIL PROTECTED]> wrote:
> You can almost do it, but I can't seem to figure out how to generate a
> constant tuple.
>
> Here is code that works and gets close to what you want, but not quite:
>
> $ cat pigscript.test
> A = LOAD 'mytestA.txt' USING PigStorage();
> B = LOAD 'mytestB.txt' USING PigStorage();
> C = COGROUP A BY $0, B BY $0;
> D = FOREACH C GENERATE flatten(((COUNT(A) == '0') ? '' : A)),
> flatten(((COUNT(B) == '0') ? '' : B));
> dump D;
>
> (1, 1, 1, )
> (2, 1, 1, )
> (3, 1, 1, )
> (4, 1, 1, 4, 2, 2)
> (5, 1, 1, 5, 2, 2)
> (, 6, 2, 2)
> (, 7, 2, 2)
>
> Now, if I could figure out how to use a constant ('', '', '') instead
> of the single '', then I would have what you are looking for, but I
> can't seem to get that to work.
>
> Ideas?
>
> On Wed, Jun 11, 2008 at 9:16 AM, Iván de Prado
> <[EMAIL PROTECTED]> wrote:
>> Lets suppose 1.txt is:
>>
>> 1 1 1
>> 2 1 1
>> 3 1 1
>> 4 1 1
>> 5 1 1
>>
>> And 2.txt is:
>>
>> 4 2 2
>> 5 2 2
>> 6 2 2
>> 7 2 2
>>
>> The script:
>>
>> A = LOAD 'ivan/1.txt' USING PigStorage();
>> B = LOAD 'ivan/2.txt' USING PigStorage();
>> C = COGROUP A by $0, B by $0;
>> D = FOREACH C GENERATE flatten(A), flatten(B);
>>
>> dump C:
>>
>> (1, {(1, 1, 1)}, {})
>> (2, {(2, 1, 1)}, {})
>> (3, {(3, 1, 1)}, {})
>> (4, {(4, 1, 1)}, {(4, 2, 2)})
>> (5, {(5, 1, 1)}, {(5, 2, 2)})
>> (6, {}, {(6, 2, 2)})
>> (7, {}, {(7, 2, 2)})
>> (8, {}, {(8, 2, 2)})
>>
>> dump D;
>>
>> (4, 1, 1, 4, 2, 2)
>> (5, 1, 1, 5, 2, 2)
>>
>> But this is not the result that I expected. I would like to obtain this
>> result:
>>
>> (1, 1, 1, 1, '', '', '')
>> (2, 2, 1, 1, '', '', '')
>> (2, 3, 1, 1, '', '', '')
>> (4, 1, 1, 4, 2, 2)
>> (5, 1, 1, 5, 2, 2)
>> ('','','',6, 2, 2)
>> ('','','',7, 2, 2)
>> ('','','',8, 2, 2)
>>
>> This is the expected result if you do an outer join in SQL. How can I
>> modify the script to get this result? (apart of doing 3 FILTERs over C
>> and then a UNION)
>>
>> Thanks and regards,
>> Iván de Prado
>> www.ivanprado.es
>>
>>
>