Hi Jim, This sounds like a full outer join, with the nulls on the left meaning an employee is just an employee, and a null on the right meaning a student is just a student.
On Wed, Nov 25, 2009 at 2:41 PM, James Leek <[email protected]> wrote: > Hi, I'm trying to do something with hadoop or pig, that I thought would be > pretty straightforward, but it turning out to be difficult for me to > implement. Of course, I'm very new to this, so I'm probably missing > something obvious. > > What I want to do is a set difference. I would like to take 2 bags, and > remove the values they have in common between them. Let's say I have two > bags, 'students' and 'employees'. I want to find which students are just > students, and which employees are just employees. So, an example: > > Students: > (Jane) > (John) > (Dave) > > Employees: > (Dave) > (Sue) > (Anne) > > If I were to join these, I would get the students who are also employees, > or: (Dave). > > However, what I want is the distinct values: > > Only_Student: > (Jane) > (John) > > Only_Employee: > (Sue) > (Anne) > > This should be do-able in a single map-reduce pass, but I found I was going > to have to write a custom inputter for it so I could remember which values > were from the students file and which were from the employees file. (At > least, I wasn't able to figure that bit out.) I also wasn't sure how to > write the output to two separate files. > > So I thought pig might have some quick way to do this, but so far I've had > no luck even expressing set subtraction in pig. (I could do this less > efficiently with set subtraction like so: only_employee = employees - > join(students, employees) ) > > Does anyone know what I'm missing? > Thanks, > Jim >
