Re: DeDuplication Techniques

Mark Kerzner Thu, 25 Mar 2010 11:25:32 -0700

Joe,

what about this approach:


using hashmap values as your keys in MR maps. Since they are sorted by keys,
in reducer you will get all duplicates together, so that you can loop
through them. As the simplest solution, you just take the first one.

Sincerely,
Mark

On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein <crypt...@gmail.com> wrote:

> I have been researching ways to handle de-dupping data while running a
> map/reduce program (so as to not re-calculate/re-aggregate data that
> we have seen before[possibly months before]).
>
> The data sets we have are littered with repeats of data from mobile
> devices which continue to come in over time (so we may see duplicates
> of data re-posted months after it originally posted...)
>
> I have 2 ways so far I can go about it (one way I do in production
> without Hadoop) and interested to see if others have faced/solved this
> in Hadoop/HDFS and what their experience might be.
>
> 1) handle my own hash filter (where I continually store and look up a
> hash (MD5, bloom, whatever) of the data I am aggregating on as
> existing already).  We do this now without Hadoop perhaps a variant
> can be ported into HDFS as map task, reducing the results to files and
> restoring the hash table (maybe in Hive or something, dunno yet)
> 2) push the data into Cassandra (our NoSQL solution of choice) and let
> that hash/map system do it for us.   As I get more into Hadoop looking
> at HBas is tempting but then just one more thing to learn.
>
> I would really like to not have to reinvent a wheel here and even
> contribute if something is going on as it is a use case in our work
> effort.
>
> Thanx in advance =8^)  Apologize I posted this on common dev yesterday
> by accident (so this is not a repost spam but appropriate for this
> list)
>
> Cheers.
>
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> */
>

Re: DeDuplication Techniques

Reply via email to