Diffing two bags?

James Leek Wed, 25 Nov 2009 11:42:30 -0800

Hi, I'm trying to do something with hadoop or pig, that I thought wouldbe pretty straightforward, but it turning out to be difficult for me toimplement. Of course, I'm very new to this, so I'm probably missingsomething obvious.

What I want to do is a set difference. I would like to take 2 bags, andremove the values they have in common between them. Let's say I havetwo bags, 'students' and 'employees'. I want to find which students arejust students, and which employees are just employees. So, an example:


Students:
(Jane)
(John)
(Dave)

Employees:
(Dave)
(Sue)
(Anne)

If I were to join these, I would get the students who are alsoemployees, or: (Dave).


However, what I want is the distinct values:

Only_Student:
(Jane)
(John)

Only_Employee:
(Sue)
(Anne)

This should be do-able in a single map-reduce pass, but I found I wasgoing to have to write a custom inputter for it so I could rememberwhich values were from the students file and which were from theemployees file. (At least, I wasn't able to figure that bit out.) Ialso wasn't sure how to write the output to two separate files.

So I thought pig might have some quick way to do this, but so far I'vehad no luck even expressing set subtraction in pig. (I could do thisless efficiently with set subtraction like so: only_employee = employees- join(students, employees) )


Does anyone know what I'm missing?
Thanks,
Jim

Diffing two bags?

Reply via email to