Hi,

I think hive¹s join + transform could be helpful here.

Thanks
Yongqiang
On 6/8/10 3:58 PM, "Aaron Kimball" <[email protected]> wrote:

> I think that this might be the way to go. In general, folding updates and
> deletes into datasets is a difficult problem due to the append-only nature of
> datasets.
> 
> Something that might help you here is to partition your tables in Hive based
> on some well-distributed key. Then if you have a relatively small number of
> partitions affected by an incremental import (perhaps more recently-imported
> records are more likely to be updated? in this case, partition the tables by
> the month/week you imported them?) you can only perform the fold-in of the new
> deltas on the affected partitions. This should be much faster than a full
> table scan.
> 
> Have you seen the Sqoop tool? It handles imports and exports between HDFS (and
> Hive) and RDBMS systems --  but currently can only import new records (and
> subsequent INSERTs); it can't handle updates/deletes. Sqoop is available at
> http://github.com/cloudera/sqoop -- it doesn't run on Apache 0.20.3, but works
> on CDH (Cloudera's Distribution for Hadoop) and Hadoop 0.21/trunk.
> 
> This sort of capability is something I'm really interested in adding to Sqoop.
> If you've got a well-run process for doing this, I'd really appreciate your
> help adding this feature :) Send me an email off-list if you're interested. At
> the very least, I'd urge you to try out the tool.
> 
> Cheers,
> - Aaron Kimball
> 
> On Tue, Jun 8, 2010 at 8:54 PM, atreju <[email protected]> wrote:
>> To generate smart output from base data we need to copy some base tables from
>> relational database into Hadoop. Some of them are big. To dump the entire
>> table into Hadoop everyday is not an option since there are like 30+ tables
>> and each would take several hours.
>>  
>> The methodology that we approached is to get the entire table dump first.
>> Then each day or every 4-6 hours get only insert/update/delete since the last
>> copy from RDBMS (based on a date field in the table). Using Hive do outer
>> join + union the new data with existing data and write into a new file. For
>> example, if there are a 100 rows in Hadoop, and in RDBMS 3 records inserted,
>> 2 records updated and 1 deleted since the last Hadoop copy, then the Hive
>> query will get 97 of the not changed data + 3 inserts + 2 updates and write
>> into a new file. The other applications like Pig or Hive will pick the most
>> recent file to use when selecting/loading data from those base table data
>> files.
>>  
>> This logic is working fine in lower environments for small size tables. With
>> production data, for about 30GB size table, the incremental re-generation of
>> the file in Hadoop is still taking several hours. I tried using zipped
>> version and it took even longer time. I am not convinced that this is the
>> best we can do to handle updates and deletes since we had to re-write 29GB
>> unchanged data of the 30GB file again into a new file. ...and this is not the
>> biggest table.
>>  
>> I am thinking that this should be problem for many companies. What are the
>> other approaches to apply updates and deletes on base tables to the
>> Hadoop data files?
>>  
>> We have 4 data nodes and using version 20.3.
>>  
>> Thanks!
>>  
> 
> 

Reply via email to