Re: Diffing two bags?

Alan Gates Wed, 25 Nov 2009 11:57:38 -0800

Do you want to keep the distinct values separate by input, or minglethem? The following script will keep them separate.

A = load 'students' as (name);
B = load 'employees' as (name);
C = cogroup A by name, B by name;
D = filter C by IsEmpty(A);
E = foreach D generate flatten(B);
store E into 'only_employees';
F = filter C by IsEmpty(B);
G = foreach F flatten(A);
store G into 'only_students';


to mingle them replace the two store calls by:

H = union E, G;
store H into 'only_employees_or_students';

Alan.

On Nov 25, 2009, at 11:41 AM, James Leek wrote:

Hi, I'm trying to do something with hadoop or pig, that I thoughtwould be pretty straightforward, but it turning out to be difficultfor me to implement. Of course, I'm very new to this, so I'mprobably missing something obvious.
What I want to do is a set difference. I would like to take 2 bags,and remove the values they have in common between them. Let's say Ihave two bags, 'students' and 'employees'. I want to find whichstudents are just students, and which employees are just employees.So, an example:
Students:
(Jane)
(John)
(Dave)

Employees:
(Dave)
(Sue)
(Anne)
If I were to join these, I would get the students who are alsoemployees, or: (Dave).
However, what I want is the distinct values:

Only_Student:
(Jane)
(John)

Only_Employee:
(Sue)
(Anne)
This should be do-able in a single map-reduce pass, but I found Iwas going to have to write a custom inputter for it so I couldremember which values were from the students file and which werefrom the employees file. (At least, I wasn't able to figure thatbit out.) I also wasn't sure how to write the output to twoseparate files.
So I thought pig might have some quick way to do this, but so farI've had no luck even expressing set subtraction in pig. (I coulddo this less efficiently with set subtraction like so: only_employee= employees - join(students, employees) )
Does anyone know what I'm missing?
Thanks,
Jim

Re: Diffing two bags?

Reply via email to