Hi, I think hive¹s join + transform could be helpful here.
Thanks Yongqiang On 6/8/10 3:58 PM, "Aaron Kimball" <[email protected]> wrote: > I think that this might be the way to go. In general, folding updates and > deletes into datasets is a difficult problem due to the append-only nature of > datasets. > > Something that might help you here is to partition your tables in Hive based > on some well-distributed key. Then if you have a relatively small number of > partitions affected by an incremental import (perhaps more recently-imported > records are more likely to be updated? in this case, partition the tables by > the month/week you imported them?) you can only perform the fold-in of the new > deltas on the affected partitions. This should be much faster than a full > table scan. > > Have you seen the Sqoop tool? It handles imports and exports between HDFS (and > Hive) and RDBMS systems -- but currently can only import new records (and > subsequent INSERTs); it can't handle updates/deletes. Sqoop is available at > http://github.com/cloudera/sqoop -- it doesn't run on Apache 0.20.3, but works > on CDH (Cloudera's Distribution for Hadoop) and Hadoop 0.21/trunk. > > This sort of capability is something I'm really interested in adding to Sqoop. > If you've got a well-run process for doing this, I'd really appreciate your > help adding this feature :) Send me an email off-list if you're interested. At > the very least, I'd urge you to try out the tool. > > Cheers, > - Aaron Kimball > > On Tue, Jun 8, 2010 at 8:54 PM, atreju <[email protected]> wrote: >> To generate smart output from base data we need to copy some base tables from >> relational database into Hadoop. Some of them are big. To dump the entire >> table into Hadoop everyday is not an option since there are like 30+ tables >> and each would take several hours. >> >> The methodology that we approached is to get the entire table dump first. >> Then each day or every 4-6 hours get only insert/update/delete since the last >> copy from RDBMS (based on a date field in the table). Using Hive do outer >> join + union the new data with existing data and write into a new file. For >> example, if there are a 100 rows in Hadoop, and in RDBMS 3 records inserted, >> 2 records updated and 1 deleted since the last Hadoop copy, then the Hive >> query will get 97 of the not changed data + 3 inserts + 2 updates and write >> into a new file. The other applications like Pig or Hive will pick the most >> recent file to use when selecting/loading data from those base table data >> files. >> >> This logic is working fine in lower environments for small size tables. With >> production data, for about 30GB size table, the incremental re-generation of >> the file in Hadoop is still taking several hours. I tried using zipped >> version and it took even longer time. I am not convinced that this is the >> best we can do to handle updates and deletes since we had to re-write 29GB >> unchanged data of the 30GB file again into a new file. ...and this is not the >> biggest table. >> >> I am thinking that this should be problem for many companies. What are the >> other approaches to apply updates and deletes on base tables to the >> Hadoop data files? >> >> We have 4 data nodes and using version 20.3. >> >> Thanks! >> > >
