On Tue, Jun 8, 2010 at 2:54 PM, atreju <[email protected]> wrote: > To generate smart output from base data we need to copy some base tables > from relational database into Hadoop. Some of them are big. To dump the > entire table into Hadoop everyday is not an option since there are like 30+ > tables and each would take several hours. > > The methodology that we approached is to get the entire table dump first. > Then each day or every 4-6 hours get only insert/update/delete since the > last copy from RDBMS (based on a date field in the table). Using Hive do > outer join + union the new data with existing data and write into a new > file. For example, if there are a 100 rows in Hadoop, and in RDBMS 3 records > inserted, 2 records updated and 1 deleted since the last Hadoop copy, then > the Hive query will get 97 of the not changed data + 3 inserts + 2 updates > and write into a new file. The other applications like Pig or Hive will pick > the most recent file to use when selecting/loading data from those base > table data files. > > This logic is working fine in lower environments for small size tables. > With production data, for about 30GB size table, the incremental > re-generation of the file in Hadoop is still taking several hours. I tried > using zipped version and it took even longer time. I am not convinced that > this is the best we can do to handle updates and deletes since we had to > re-write 29GB unchanged data of the 30GB file again into a new file. ...and > this is not the biggest table. > > I am thinking that this should be problem for many companies. What are the > other approaches to apply updates and deletes on base tables to the > Hadoop data files? > > We have 4 data nodes and using version 20.3. > > Thanks! > >
Very interesting. An important note about compression is you have to carefully chose the codec(gzip, lzo), blocksize(1-N), compression type (block,record) etc. Chosing the wrong compression variables results in anti-compression, large files and slow performance. You said: for about 30GB size table, the incremental re-generation of the file in Hadoop is still taking several hours. It sounds like you have worked up a solution that will scale. With the exception of your backend Database (mysql/whatever), you should be able to up the number of Data Nodes Task Trackers and get (linear) performans gains. Add some nodes and run see if you run time shrinks. I am intereted to know. Great work. Very cool.
