Thanks everyone. I think we are going to go HBase and use the hash map structures there to keep uniqueness (table.exists(value)) on our values, see it how it goes.
Appreciate the insights I do like (being a developer) the extended sort and join ideas in the MR will likely use that for other things. Seems like a lot of work just to get to the point to execute our business logic (time/resources, etc). As the old saying goes "sometimes you only need chopsticks to catch a fly" =8^) Thanks again!!! /* Joe Stein http://www.linkedin.com/in/charmalloc */ On Fri, Mar 26, 2010 at 3:26 PM, Jeyendran Balakrishnan <jbalakrish...@docomolabs-usa.com> wrote: > Joe, > > This is what I use for a related problem, using pure HDFS [no HBase]: > > 1. Run a one-time map-reduce job where you input your current historical file > of hashes [say it is of the format <hash-key, hash-value> in some kind of > flat file] using IdentityMapper and the output of your custom reducer is > <key, value> is <hash-key, hash-value> or maybe even <hash-key, dummy value> > to save space. The important thing is to use MapFileOutputFormat for the > reducer output instead of the typical SequenceFileOutputFormat. Now you have > a single look-up table which you use for efficient lookup using your hash > keys. > Note down the HDFS path of where you stored this mapfile, call it > dedupMapFile. > > 2. In your incremental data update job, pass the HDFS path of dedupMapFile to > your conf, then open the mapfile in your reducer configure(), store the > reference to the mapfile in the class, and close it in close(). > Inside your reduce(), use the mapfile reference to lookup your hashkey; if > there is a hit, it is a dup. > > 3. Also, for your reducer in 2. above, you can use a multiple output format > custom format, in which one of the outputs is your current output, and the > other is a new dedup output sequencefile which is in the same key-value > format as the dedupMapFile. So in the reduce() if the current key value is a > dup, discard it, else output to both your regular output, and the new dedup > output. > > 4. After each incremental update job, run a new map reduce job > [IdentityMapper and IdentityReducer] to merge the new dedup file with your > old dedupMapFile, resulting in the updated dedupMapFile. > > Some comments: > * I didn't read your approach too closely, so I suspect you might be doing > something essentially like this already. > * All this stuff is basically what HBase does for free, where your > dedupMapFile is now a HBase table, and you don't have to run Step 4, since > you can just write new [non-duplicate] hash-keys to the HBase table in Step > 3, and in Step 2, you just use table.exists(hash-key) to check if it is a > dup. You still need Step 1 to populate the table with your historical data. > > Hope this helps.... > > Cheers, > jp > > > -----Original Message----- > From: Joseph Stein [mailto:crypt...@gmail.com] > Sent: Thursday, March 25, 2010 11:35 AM > To: common-user@hadoop.apache.org > Subject: Re: DeDuplication Techniques > > The thing is I have to check historic data (meaning data I have > already aggregated against) so I basically need to hold and read from > a file of hashes. > > So within the current data set yes this would work but I then have to > open a file, loop through the value, see it is not there. > > If it is there then throw it out, if not there add it to the end. > > To me this opening a file checking for dups is a map/reduce task in itself. > > What I was thinking is having my mapper take the data I wasn to > validate as unique. I then loop through the files filters. each data > point has a key that then allows me to get the file that has it's > data. e.g. a part of the data partions the hash of the data so each > file holds. So my map job takes the data and breaks it into the > key/value pair (the key allows me to look up my filter file). > > When it gets to the reducer... the key is the file I open up, I then > open the file... loop through it... if it is there throw the data > away. if it is not there then add the hash of my data to the filter > file and then output (as the reduce output) the value of the unique. > > This output of the unique is then the data I aggregate on which also > updated my historic filter so the next job (5 minutes later) see it, > etc. > > On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner <markkerz...@gmail.com> wrote: >> Joe, >> >> what about this approach: >> >> using hashmap values as your keys in MR maps. Since they are sorted by keys, >> in reducer you will get all duplicates together, so that you can loop >> through them. As the simplest solution, you just take the first one. >> >> Sincerely, >> Mark >> >> On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein <crypt...@gmail.com> wrote: >> >>> I have been researching ways to handle de-dupping data while running a >>> map/reduce program (so as to not re-calculate/re-aggregate data that >>> we have seen before[possibly months before]). >>> >>> The data sets we have are littered with repeats of data from mobile >>> devices which continue to come in over time (so we may see duplicates >>> of data re-posted months after it originally posted...) >>> >>> I have 2 ways so far I can go about it (one way I do in production >>> without Hadoop) and interested to see if others have faced/solved this >>> in Hadoop/HDFS and what their experience might be. >>> >>> 1) handle my own hash filter (where I continually store and look up a >>> hash (MD5, bloom, whatever) of the data I am aggregating on as >>> existing already). We do this now without Hadoop perhaps a variant >>> can be ported into HDFS as map task, reducing the results to files and >>> restoring the hash table (maybe in Hive or something, dunno yet) >>> 2) push the data into Cassandra (our NoSQL solution of choice) and let >>> that hash/map system do it for us. As I get more into Hadoop looking >>> at HBas is tempting but then just one more thing to learn. >>> >>> I would really like to not have to reinvent a wheel here and even >>> contribute if something is going on as it is a use case in our work >>> effort. >>> >>> Thanx in advance =8^) Apologize I posted this on common dev yesterday >>> by accident (so this is not a repost spam but appropriate for this >>> list) >>> >>> Cheers. >>> >>> /* >>> Joe Stein >>> http://www.linkedin.com/in/charmalloc >>> */ >>> >> > > > > -- > /* > Joe Stein > http://www.linkedin.com/in/charmalloc > */ >