Re: DeDuplication Techniques

Joseph Stein Wed, 31 Mar 2010 08:36:34 -0700

Thanks everyone.

I think we are going to go HBase and use the hash map structures there
to keep uniqueness (table.exists(value)) on our values, see it how it
goes.


Appreciate the insights I do like (being a developer) the extended
sort and join ideas in the MR will likely use that for other things.
Seems like a lot of work just to get to the point to execute our
business logic (time/resources, etc).

As the old saying goes "sometimes you only need chopsticks to catch a fly" =8^)

Thanks again!!!

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/

On Fri, Mar 26, 2010 at 3:26 PM, Jeyendran Balakrishnan
<jbalakrish...@docomolabs-usa.com> wrote:
> Joe,
>
> This is what I use for a related problem, using pure HDFS [no HBase]:
>
> 1. Run a one-time map-reduce job where you input your current historical file 
> of hashes [say it is of the format <hash-key, hash-value> in some kind of 
> flat file] using IdentityMapper and the output of your custom reducer is 
> <key, value> is <hash-key, hash-value> or maybe even <hash-key, dummy value> 
> to save space. The important thing is to use MapFileOutputFormat for the 
> reducer output instead of the typical SequenceFileOutputFormat. Now you have 
> a single look-up table which you use for efficient lookup using your hash 
> keys.
> Note down the HDFS path of where you stored this mapfile, call it 
> dedupMapFile.
>
> 2. In your incremental data update job, pass the HDFS path of dedupMapFile to 
> your conf, then open the mapfile in your reducer configure(), store the 
> reference to the mapfile in the class, and close it in close().
> Inside your reduce(), use the mapfile reference to lookup your hashkey; if 
> there is a hit, it is a dup.
>
> 3. Also, for your reducer in 2. above, you can use a multiple output format 
> custom format, in which one of the outputs is your current output, and the 
> other is a new dedup output sequencefile which is in the same key-value 
> format as the dedupMapFile. So in the reduce() if the current key value is a 
> dup, discard it, else output to both your regular output, and the new dedup 
> output.
>
> 4. After each incremental update job, run a new map reduce job 
> [IdentityMapper and IdentityReducer] to merge the new dedup file with your 
> old dedupMapFile, resulting in the updated dedupMapFile.
>
> Some comments:
> * I didn't read your approach too closely, so I suspect you might be doing 
> something essentially like this already.
> * All this stuff is basically what HBase does for free, where your 
> dedupMapFile is now a HBase table, and you don't have to run Step 4, since 
> you can just write new [non-duplicate] hash-keys to the HBase table in Step 
> 3, and in Step 2, you just use table.exists(hash-key) to check if it is a 
> dup. You still need Step 1 to populate the table with your historical data.
>
> Hope this helps....
>
> Cheers,
> jp
>
>
> -----Original Message-----
> From: Joseph Stein [mailto:crypt...@gmail.com]
> Sent: Thursday, March 25, 2010 11:35 AM
> To: common-user@hadoop.apache.org
> Subject: Re: DeDuplication Techniques
>
> The thing is I have to check historic data (meaning data I have
> already aggregated against) so I basically need to hold and read from
> a file of hashes.
>
> So within the current data set yes this would work but I then have to
> open a file, loop through the value, see it is not there.
>
> If it is there then throw it out, if not there add it to the end.
>
> To me this opening a file checking for dups is a map/reduce task in itself.
>
> What I was thinking is having my mapper take the data I wasn to
> validate as unique.  I then loop through the files filters.  each data
> point has a key that then allows me to get the file that has it's
> data. e.g. a part of the data partions the hash of the data so each
> file holds.  So my map job takes the data and breaks it into the
> key/value pair (the key allows me to look up my filter file).
>
> When it gets to the reducer... the key is the file I open up, I then
> open the file... loop through it... if it is there throw the data
> away.  if it is not there then add the hash of my data to the filter
> file and then output (as the reduce output) the value of the unique.
>
> This output of the unique is then the data I aggregate on which also
> updated my historic filter so the next job (5 minutes later) see it,
> etc.
>
> On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner <markkerz...@gmail.com> wrote:
>> Joe,
>>
>> what about this approach:
>>
>> using hashmap values as your keys in MR maps. Since they are sorted by keys,
>> in reducer you will get all duplicates together, so that you can loop
>> through them. As the simplest solution, you just take the first one.
>>
>> Sincerely,
>> Mark
>>
>> On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein <crypt...@gmail.com> wrote:
>>
>>> I have been researching ways to handle de-dupping data while running a
>>> map/reduce program (so as to not re-calculate/re-aggregate data that
>>> we have seen before[possibly months before]).
>>>
>>> The data sets we have are littered with repeats of data from mobile
>>> devices which continue to come in over time (so we may see duplicates
>>> of data re-posted months after it originally posted...)
>>>
>>> I have 2 ways so far I can go about it (one way I do in production
>>> without Hadoop) and interested to see if others have faced/solved this
>>> in Hadoop/HDFS and what their experience might be.
>>>
>>> 1) handle my own hash filter (where I continually store and look up a
>>> hash (MD5, bloom, whatever) of the data I am aggregating on as
>>> existing already).  We do this now without Hadoop perhaps a variant
>>> can be ported into HDFS as map task, reducing the results to files and
>>> restoring the hash table (maybe in Hive or something, dunno yet)
>>> 2) push the data into Cassandra (our NoSQL solution of choice) and let
>>> that hash/map system do it for us.   As I get more into Hadoop looking
>>> at HBas is tempting but then just one more thing to learn.
>>>
>>> I would really like to not have to reinvent a wheel here and even
>>> contribute if something is going on as it is a use case in our work
>>> effort.
>>>
>>> Thanx in advance =8^)  Apologize I posted this on common dev yesterday
>>> by accident (so this is not a repost spam but appropriate for this
>>> list)
>>>
>>> Cheers.
>>>
>>> /*
>>> Joe Stein
>>> http://www.linkedin.com/in/charmalloc
>>> */
>>>
>>
>
>
>
> --
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> */
>

Re: DeDuplication Techniques

Reply via email to