Hi, I'm trying to do something with hadoop or pig, that I thought would be pretty straightforward, but it turning out to be difficult for me to implement. Of course, I'm very new to this, so I'm probably missing something obvious.

What I want to do is a set difference. I would like to take 2 bags, and remove the values they have in common between them. Let's say I have two bags, 'students' and 'employees'. I want to find which students are just students, and which employees are just employees. So, an example:

Students:
(Jane)
(John)
(Dave)

Employees:
(Dave)
(Sue)
(Anne)

If I were to join these, I would get the students who are also employees, or: (Dave).

However, what I want is the distinct values:

Only_Student:
(Jane)
(John)

Only_Employee:
(Sue)
(Anne)

This should be do-able in a single map-reduce pass, but I found I was going to have to write a custom inputter for it so I could remember which values were from the students file and which were from the employees file. (At least, I wasn't able to figure that bit out.) I also wasn't sure how to write the output to two separate files.

So I thought pig might have some quick way to do this, but so far I've had no luck even expressing set subtraction in pig. (I could do this less efficiently with set subtraction like so: only_employee = employees - join(students, employees) )

Does anyone know what I'm missing?
Thanks,
Jim

Reply via email to