On Tue, Jun 8, 2010 at 2:54 PM, atreju <[email protected]> wrote:

> To generate smart output from base data we need to copy some base tables
> from relational database into Hadoop. Some of them are big. To dump the
> entire table into Hadoop everyday is not an option since there are like 30+
> tables and each would take several hours.
>
> The methodology that we approached is to get the entire table dump first.
> Then each day or every 4-6 hours get only insert/update/delete since the
> last copy from RDBMS (based on a date field in the table). Using Hive do
> outer join + union the new data with existing data and write into a new
> file. For example, if there are a 100 rows in Hadoop, and in RDBMS 3 records
> inserted, 2 records updated and 1 deleted since the last Hadoop copy, then
> the Hive query will get 97 of the not changed data + 3 inserts + 2 updates
> and write into a new file. The other applications like Pig or Hive will pick
> the most recent file to use when selecting/loading data from those base
> table data files.
>
> This logic is working fine in lower environments for small size tables.
> With production data, for about 30GB size table, the incremental
> re-generation of the file in Hadoop is still taking several hours. I tried
> using zipped version and it took even longer time. I am not convinced that
> this is the best we can do to handle updates and deletes since we had to
> re-write 29GB unchanged data of the 30GB file again into a new file. ...and
> this is not the biggest table.
>
> I am thinking that this should be problem for many companies. What are the
> other approaches to apply updates and deletes on base tables to the
> Hadoop data files?
>
> We have 4 data nodes and using version 20.3.
>
> Thanks!
>
>

Very interesting. An important note about compression is you have to
carefully chose the codec(gzip, lzo), blocksize(1-N), compression type
(block,record) etc. Chosing the wrong compression variables results in
anti-compression, large files and slow performance.

You said:
for about 30GB size table, the incremental re-generation of the file in
Hadoop is still taking several hours.

It sounds like you have worked up a solution that will scale. With the
exception of your backend Database (mysql/whatever), you should be able to
up the number of Data Nodes Task Trackers and get (linear) performans gains.
Add some nodes and run see if you run time shrinks.

I am intereted to know. Great work. Very cool.

Reply via email to