Do you want to keep the distinct values separate by input, or mingle
them? The following script will keep them separate.
A = load 'students' as (name);
B = load 'employees' as (name);
C = cogroup A by name, B by name;
D = filter C by IsEmpty(A);
E = foreach D generate flatten(B);
store E into 'only_employees';
F = filter C by IsEmpty(B);
G = foreach F flatten(A);
store G into 'only_students';
to mingle them replace the two store calls by:
H = union E, G;
store H into 'only_employees_or_students';
Alan.
On Nov 25, 2009, at 11:41 AM, James Leek wrote:
Hi, I'm trying to do something with hadoop or pig, that I thought
would be pretty straightforward, but it turning out to be difficult
for me to implement. Of course, I'm very new to this, so I'm
probably missing something obvious.
What I want to do is a set difference. I would like to take 2 bags,
and remove the values they have in common between them. Let's say I
have two bags, 'students' and 'employees'. I want to find which
students are just students, and which employees are just employees.
So, an example:
Students:
(Jane)
(John)
(Dave)
Employees:
(Dave)
(Sue)
(Anne)
If I were to join these, I would get the students who are also
employees, or: (Dave).
However, what I want is the distinct values:
Only_Student:
(Jane)
(John)
Only_Employee:
(Sue)
(Anne)
This should be do-able in a single map-reduce pass, but I found I
was going to have to write a custom inputter for it so I could
remember which values were from the students file and which were
from the employees file. (At least, I wasn't able to figure that
bit out.) I also wasn't sure how to write the output to two
separate files.
So I thought pig might have some quick way to do this, but so far
I've had no luck even expressing set subtraction in pig. (I could
do this less efficiently with set subtraction like so: only_employee
= employees - join(students, employees) )
Does anyone know what I'm missing?
Thanks,
Jim