Thanks for sharing sounds like a nice system - I always advise people to avoid direct SQL inserts for batch jobs / large amounts of data and use MySQL's optimized LOAD utility like you did. Same goes for Oracle... Nothing brings a DB server to its knees like a ton of individual inserts on indexed tables..
On Tue, Apr 28, 2009 at 6:46 AM, Ankur Goel <ankur.g...@corp.aol.com> wrote: > > hello hadoop users, > Recently I had a chance to lead a team building a log-processing system > that uses Hadoop and MySQL. The system's goal was to process the incoming > information as quickly as possible (real time or near real time), and make > it available for querying in MySQL. I thought it would be good to share the > experience and the challenges with the community. Couldn't think of a better > place than these mailing lists as I am not much of a blogger :-) > > The information flow in the system looks something like > > [Apache-Servers] -> [Hadoop] -> [MySQL-shards] -> [Query-Tools] > > Transferring from Apache-Servers to Hadoop was quite easy as we just had to > organize the data in timely buckets (directories). Once that was running > smooth we had to make sure that map-reduce jobs are fired at regular > intervals and they pick up the right data. The jobs would then > process/aggregate the date and dump the info into MySQL shards from the > reducers [we have our own DB partioning set up]. This is where we hit major > bottlenecks [any surprises? :-)] > > The table engine used was InnoDB as there was a need for fast replication > and writes but only moderate reads (should eventually support high read > rates). The data would take up quite a while to load completely far away > from being near-real time. And so our optimization journey begin. > > 1. We tried to optimize/tune InnoDB parameters like increasing the buffer > pool size to 75 % of available RAM. This helped but only till the time DBs > were lightly loaded i.e. innoDB had sufficient buffer pool to host the data > and indexes. > > 2. We also realized that InnoDB has considerable locking overhead because > of which write concurrency is really bad when you have a large number of > concurrent threads doing writes. The default thread concurrency for us was > set to no_of_cpu * 2 = 8 which is what the official documentation advises as > the optimal limit. So we limited the number of reduce tasks and consequently > the number of concurrent writes and boy the performance improved 4x. We were > almost there :-) > > 3. Next thing we tried is the standard DB optimzation techniques like > de-normalizing the schema and dropping constraints. This gave only a minor > performance improvement, nothing earth shattering. Note that we were already > caching connections in reducers to each MySQL shard and partionining logic > was embedded into reducers. > > 4. Falling still short of our performance objectives, we finally we decided > to get rid of JDBC writes from reducers and work on an alternative that uses > MySQLs LOAD utility. > - The processing would partition the data into MySQL shard specific files > resident in HDFS. > - A script would then spawn processes via ssh on different physical > machines to download this data. > - Each spawned process just downloads the data for the shard it should > upload to. > - All the processes then start uploading data in parallel into their > respective MySQL shards using LOAD DATA infile. > > This proved to be the fastest approach, even in the wake of increasing data > loads. The enitre processing/loading would complete in less than 6 min. The > system has been holding up quite well so far, even though we've had to limit > the number of days for which we keep the data or else the MySQLs get > overwhelmed. > > Hope this is helpful to people. > > Regards > -Ankur > -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch