As a suffix to what Dmitriy described - just add a project to pick the
columns you need.
c = join a by filename, b by filename PARALLEL $MY_PARALLELISM;
--- Please check this syntax though with pig latin docs.
d = foreach c generate a::filename; --- Or anything else you want to pick.
if you need, just do a distinct of d's output to remove duplicates ...
though this might result in more MR jobs.
- Mridul
Rob Stewart wrote:
Hi, yeah I thought so,
the only slightly confusing issue is that the output would be:
bar.dat bar.dat
? (i.e. - showing you a.filename b.filename ) ?
Rob.
2010/1/12 Dmitriy Ryaboy <[email protected]>
Rob, it's just a join.
a = load 'rel1' using FooStorage() as (id, filename);
b = load 'rel2' using FooStorage() as (id, filename);
c = join a by filename, b by filename;
Rows that don't match won't make it.
If you DO want them to make it in, you need to use "outer" for the
relations whose non-matching rows you want retained (the rest of the
fields in the resulting relation will be filled in with nulls).
Naturally, since Pig can do it, MR can do it.
-D
On Tue, Jan 12, 2010 at 2:57 PM, Rob Stewart
<[email protected]> wrote:
Hi folks,
I have a somewhat obvious question, that needs asking (for my sakes).
Pig can do Joins, I realise that. But take for example:
Table_1
----------------------
| ID | fileName |
1 foo.dat
2 bar.dat
3 harry.dat
Table_2
----------------------
| ID | fileName |
1 tom.dat
2 bar.dat
3 gamma.dat
SQL Syntax for conditional select:
"select t1.fileName from Table_1 t1, Table_2 t2 where t1.fileName =
t2.fileName"
Result
--------
bar.dat
How is such a query represented in Pig?
tableOne = LOAD 'input1.dat' USING PigStorage() AS (id:int,
filename:chararray);
tableTwo = LOAD 'input2.dat' USING PigStorage() AS (id:int,
filename:chararray);
[Now what??]
STORE query INTO 'Output.pig' USING PigStorage();
As a bonus question, can anybody tell me if this sort of conditional
select
query is possible writing in Java MapReduce?
thanks,
Rob Stewart