Re: DeDuplication Techniques

Andrew Klochkov Fri, 26 Mar 2010 03:05:56 -0700

Would it be a good optimization to have historical data (stored in HDFS)
sorted by the primary key, and also sort new data before joining? I guess in
this case join can be performed more effective (in a InputFormat
implementation), avoiding sort/shuffle/copy-to-reducer part.
The only double I have is how fast can data be sorted, wouldn't it kill all
the optimization.


On Fri, Mar 26, 2010 at 7:20 AM, Ankur C. Goel <gan...@yahoo-inc.com> wrote:

> The kind of need to you specified is quite common in ETL style of
> processing. The fastest and most efficient way to do this is when you have
> all your historical data in HDFS itself. In this case you can do a LEFT
> outer join between the two datasets (assuming new data is your left
> relation) in map-reduce without querying a database or any other persistent
> store. Then you would keep only the records which have fields from the left
> relation and NOT the right relation (historical data).
>
> A join can be easily implemented in map-reduce using the secondary sort
> trick. Basically you can specify different mappers for different input data
> in the same M/R job and in each mapper tag the record key with relation ids
> (0, 1...). This makes sure that records from one relation for matching key
> appear before the record of other relation in reducer. You then cache them
> in memory and do a cross of this with each record of the new relation you
> see.
> This might sound more complicated then it really is. Hadoop has sample code
> under examples for secondary sort but no code for join.
>
> Another option is to use a high level languages like Pig or HIVE that
> provide join operations and also expose extensions to take care of skew in
> data i.e data getting divided unevenly die to few keys having majority of
> records. This is the simplest and quickest (in terms of developer
> productivity) IMO.
>
> Regards
> -...@nkur
>
>
> On 3/26/10 12:05 AM, "Joseph Stein" <crypt...@gmail.com> wrote:
>
> The thing is I have to check historic data (meaning data I have
> already aggregated against) so I basically need to hold and read from
> a file of hashes.
>
> So within the current data set yes this would work but I then have to
> open a file, loop through the value, see it is not there.
>
> If it is there then throw it out, if not there add it to the end.
>
> To me this opening a file checking for dups is a map/reduce task in itself.
>
> What I was thinking is having my mapper take the data I wasn to
> validate as unique.  I then loop through the files filters.  each data
> point has a key that then allows me to get the file that has it's
> data. e.g. a part of the data partions the hash of the data so each
> file holds.  So my map job takes the data and breaks it into the
> key/value pair (the key allows me to look up my filter file).
>
> When it gets to the reducer... the key is the file I open up, I then
> open the file... loop through it... if it is there throw the data
> away.  if it is not there then add the hash of my data to the filter
> file and then output (as the reduce output) the value of the unique.
>
> This output of the unique is then the data I aggregate on which also
> updated my historic filter so the next job (5 minutes later) see it,
> etc.
>
> On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner <markkerz...@gmail.com>
> wrote:
> > Joe,
> >
> > what about this approach:
> >
> > using hashmap values as your keys in MR maps. Since they are sorted by
> keys,
> > in reducer you will get all duplicates together, so that you can loop
> > through them. As the simplest solution, you just take the first one.
> >
> > Sincerely,
> > Mark
> >
> > On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein <crypt...@gmail.com>
> wrote:
> >
> >> I have been researching ways to handle de-dupping data while running a
> >> map/reduce program (so as to not re-calculate/re-aggregate data that
> >> we have seen before[possibly months before]).
> >>
> >> The data sets we have are littered with repeats of data from mobile
> >> devices which continue to come in over time (so we may see duplicates
> >> of data re-posted months after it originally posted...)
> >>
> >> I have 2 ways so far I can go about it (one way I do in production
> >> without Hadoop) and interested to see if others have faced/solved this
> >> in Hadoop/HDFS and what their experience might be.
> >>
> >> 1) handle my own hash filter (where I continually store and look up a
> >> hash (MD5, bloom, whatever) of the data I am aggregating on as
> >> existing already).  We do this now without Hadoop perhaps a variant
> >> can be ported into HDFS as map task, reducing the results to files and
> >> restoring the hash table (maybe in Hive or something, dunno yet)
> >> 2) push the data into Cassandra (our NoSQL solution of choice) and let
> >> that hash/map system do it for us.   As I get more into Hadoop looking
> >> at HBas is tempting but then just one more thing to learn.
> >>
> >> I would really like to not have to reinvent a wheel here and even
> >> contribute if something is going on as it is a use case in our work
> >> effort.
> >>
> >> Thanx in advance =8^)  Apologize I posted this on common dev yesterday
> >> by accident (so this is not a repost spam but appropriate for this
> >> list)
> >>
> >> Cheers.
> >>
> >> /*
> >> Joe Stein
> >> http://www.linkedin.com/in/charmalloc
> >> */
> >>
> >
>
>
>
> --
> /*
> Joe Stein
> http://www.linkedin.com/in/charmalloc
> */
>
>


-- 
Andrew Klochkov

Re: DeDuplication Techniques

Reply via email to