I think that this might be the way to go. In general, folding updates and deletes into datasets is a difficult problem due to the append-only nature of datasets.
Something that might help you here is to partition your tables in Hive based on some well-distributed key. Then if you have a relatively small number of partitions affected by an incremental import (perhaps more recently-imported records are more likely to be updated? in this case, partition the tables by the month/week you imported them?) you can only perform the fold-in of the new deltas on the affected partitions. This should be much faster than a full table scan. Have you seen the Sqoop tool? It handles imports and exports between HDFS (and Hive) and RDBMS systems -- but currently can only import new records (and subsequent INSERTs); it can't handle updates/deletes. Sqoop is available at http://github.com/cloudera/sqoop -- it doesn't run on Apache 0.20.3, but works on CDH (Cloudera's Distribution for Hadoop) and Hadoop 0.21/trunk. This sort of capability is something I'm really interested in adding to Sqoop. If you've got a well-run process for doing this, I'd really appreciate your help adding this feature :) Send me an email off-list if you're interested. At the very least, I'd urge you to try out the tool. Cheers, - Aaron Kimball On Tue, Jun 8, 2010 at 8:54 PM, atreju <[email protected]> wrote: > To generate smart output from base data we need to copy some base tables > from relational database into Hadoop. Some of them are big. To dump the > entire table into Hadoop everyday is not an option since there are like 30+ > tables and each would take several hours. > > The methodology that we approached is to get the entire table dump first. > Then each day or every 4-6 hours get only insert/update/delete since the > last copy from RDBMS (based on a date field in the table). Using Hive do > outer join + union the new data with existing data and write into a new > file. For example, if there are a 100 rows in Hadoop, and in RDBMS 3 records > inserted, 2 records updated and 1 deleted since the last Hadoop copy, then > the Hive query will get 97 of the not changed data + 3 inserts + 2 updates > and write into a new file. The other applications like Pig or Hive will pick > the most recent file to use when selecting/loading data from those base > table data files. > > This logic is working fine in lower environments for small size tables. > With production data, for about 30GB size table, the incremental > re-generation of the file in Hadoop is still taking several hours. I tried > using zipped version and it took even longer time. I am not convinced that > this is the best we can do to handle updates and deletes since we had to > re-write 29GB unchanged data of the 30GB file again into a new file. ...and > this is not the biggest table. > > I am thinking that this should be problem for many companies. What are the > other approaches to apply updates and deletes on base tables to the > Hadoop data files? > > We have 4 data nodes and using version 20.3. > > Thanks! > >
