Re: Diffing two bags?

Dmitriy Ryaboy Wed, 25 Nov 2009 11:50:44 -0800

Hi Jim,
This sounds like a full outer join, with the nulls on the left meaning an
employee is just an employee, and a null on the right meaning a student is
just a student.


On Wed, Nov 25, 2009 at 2:41 PM, James Leek <[email protected]> wrote:

> Hi, I'm trying to do something with hadoop or pig, that I thought would be
> pretty straightforward, but it turning out to be difficult for me to
> implement.  Of course, I'm very new to this, so I'm probably missing
> something obvious.
>
> What I want to do is a set difference.  I would like to take 2 bags, and
> remove the values they have in common between them.  Let's say I have two
> bags, 'students' and 'employees'.  I want to find which students are just
> students, and which employees are just employees.  So, an example:
>
> Students:
> (Jane)
> (John)
> (Dave)
>
> Employees:
> (Dave)
> (Sue)
> (Anne)
>
> If I were to join these, I would get the students who are also employees,
> or: (Dave).
>
> However, what I want is the distinct values:
>
> Only_Student:
> (Jane)
> (John)
>
> Only_Employee:
> (Sue)
> (Anne)
>
> This should be do-able in a single map-reduce pass, but I found I was going
> to have to write a custom inputter for it so I could remember which values
> were from the students file and which were from the employees file.  (At
> least, I wasn't able to figure that bit out.)  I also wasn't sure how to
> write the output to two separate files.
>
> So I thought pig might have some quick way to do this, but so far I've had
> no luck even expressing set subtraction in pig.  (I could do this less
> efficiently with set subtraction like so: only_employee = employees -
> join(students, employees)  )
>
> Does anyone know what I'm missing?
> Thanks,
> Jim
>

Re: Diffing two bags?

Reply via email to