Hi, I'm trying to do something with hadoop or pig, that I thought would
be pretty straightforward, but it turning out to be difficult for me to
implement. Of course, I'm very new to this, so I'm probably missing
something obvious.
What I want to do is a set difference. I would like to take 2 bags, and
remove the values they have in common between them. Let's say I have
two bags, 'students' and 'employees'. I want to find which students are
just students, and which employees are just employees. So, an example:
Students:
(Jane)
(John)
(Dave)
Employees:
(Dave)
(Sue)
(Anne)
If I were to join these, I would get the students who are also
employees, or: (Dave).
However, what I want is the distinct values:
Only_Student:
(Jane)
(John)
Only_Employee:
(Sue)
(Anne)
This should be do-able in a single map-reduce pass, but I found I was
going to have to write a custom inputter for it so I could remember
which values were from the students file and which were from the
employees file. (At least, I wasn't able to figure that bit out.) I
also wasn't sure how to write the output to two separate files.
So I thought pig might have some quick way to do this, but so far I've
had no luck even expressing set subtraction in pig. (I could do this
less efficiently with set subtraction like so: only_employee = employees
- join(students, employees) )
Does anyone know what I'm missing?
Thanks,
Jim