Would it be a good optimization to have historical data (stored in HDFS) sorted by the primary key, and also sort new data before joining? I guess in this case join can be performed more effective (in a InputFormat implementation), avoiding sort/shuffle/copy-to-reducer part. The only double I have is how fast can data be sorted, wouldn't it kill all the optimization.
On Fri, Mar 26, 2010 at 7:20 AM, Ankur C. Goel <gan...@yahoo-inc.com> wrote: > The kind of need to you specified is quite common in ETL style of > processing. The fastest and most efficient way to do this is when you have > all your historical data in HDFS itself. In this case you can do a LEFT > outer join between the two datasets (assuming new data is your left > relation) in map-reduce without querying a database or any other persistent > store. Then you would keep only the records which have fields from the left > relation and NOT the right relation (historical data). > > A join can be easily implemented in map-reduce using the secondary sort > trick. Basically you can specify different mappers for different input data > in the same M/R job and in each mapper tag the record key with relation ids > (0, 1...). This makes sure that records from one relation for matching key > appear before the record of other relation in reducer. You then cache them > in memory and do a cross of this with each record of the new relation you > see. > This might sound more complicated then it really is. Hadoop has sample code > under examples for secondary sort but no code for join. > > Another option is to use a high level languages like Pig or HIVE that > provide join operations and also expose extensions to take care of skew in > data i.e data getting divided unevenly die to few keys having majority of > records. This is the simplest and quickest (in terms of developer > productivity) IMO. > > Regards > -...@nkur > > > On 3/26/10 12:05 AM, "Joseph Stein" <crypt...@gmail.com> wrote: > > The thing is I have to check historic data (meaning data I have > already aggregated against) so I basically need to hold and read from > a file of hashes. > > So within the current data set yes this would work but I then have to > open a file, loop through the value, see it is not there. > > If it is there then throw it out, if not there add it to the end. > > To me this opening a file checking for dups is a map/reduce task in itself. > > What I was thinking is having my mapper take the data I wasn to > validate as unique. I then loop through the files filters. each data > point has a key that then allows me to get the file that has it's > data. e.g. a part of the data partions the hash of the data so each > file holds. So my map job takes the data and breaks it into the > key/value pair (the key allows me to look up my filter file). > > When it gets to the reducer... the key is the file I open up, I then > open the file... loop through it... if it is there throw the data > away. if it is not there then add the hash of my data to the filter > file and then output (as the reduce output) the value of the unique. > > This output of the unique is then the data I aggregate on which also > updated my historic filter so the next job (5 minutes later) see it, > etc. > > On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner <markkerz...@gmail.com> > wrote: > > Joe, > > > > what about this approach: > > > > using hashmap values as your keys in MR maps. Since they are sorted by > keys, > > in reducer you will get all duplicates together, so that you can loop > > through them. As the simplest solution, you just take the first one. > > > > Sincerely, > > Mark > > > > On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein <crypt...@gmail.com> > wrote: > > > >> I have been researching ways to handle de-dupping data while running a > >> map/reduce program (so as to not re-calculate/re-aggregate data that > >> we have seen before[possibly months before]). > >> > >> The data sets we have are littered with repeats of data from mobile > >> devices which continue to come in over time (so we may see duplicates > >> of data re-posted months after it originally posted...) > >> > >> I have 2 ways so far I can go about it (one way I do in production > >> without Hadoop) and interested to see if others have faced/solved this > >> in Hadoop/HDFS and what their experience might be. > >> > >> 1) handle my own hash filter (where I continually store and look up a > >> hash (MD5, bloom, whatever) of the data I am aggregating on as > >> existing already). We do this now without Hadoop perhaps a variant > >> can be ported into HDFS as map task, reducing the results to files and > >> restoring the hash table (maybe in Hive or something, dunno yet) > >> 2) push the data into Cassandra (our NoSQL solution of choice) and let > >> that hash/map system do it for us. As I get more into Hadoop looking > >> at HBas is tempting but then just one more thing to learn. > >> > >> I would really like to not have to reinvent a wheel here and even > >> contribute if something is going on as it is a use case in our work > >> effort. > >> > >> Thanx in advance =8^) Apologize I posted this on common dev yesterday > >> by accident (so this is not a repost spam but appropriate for this > >> list) > >> > >> Cheers. > >> > >> /* > >> Joe Stein > >> http://www.linkedin.com/in/charmalloc > >> */ > >> > > > > > > -- > /* > Joe Stein > http://www.linkedin.com/in/charmalloc > */ > > -- Andrew Klochkov