hello hadoop users, 
Recently I had a chance to lead a team building a log-processing system that 
uses Hadoop and MySQL. The system's goal was to process the incoming 
information as quickly as possible (real time or near real time), and make it 
available for querying in MySQL. I thought it would be good to share the 
experience and the challenges with the community. Couldn't think of a better 
place than these mailing lists as I am not much of a blogger :-) 

The information flow in the system looks something like 

[Apache-Servers] -> [Hadoop] -> [MySQL-shards] -> [Query-Tools] 

Transferring from Apache-Servers to Hadoop was quite easy as we just had to 
organize the data in timely buckets (directories). Once that was running smooth 
we had to make sure that map-reduce jobs are fired at regular intervals and 
they pick up the right data. The jobs would then process/aggregate the date and 
dump the info into MySQL shards from the reducers [we have our own DB 
partioning set up]. This is where we hit major bottlenecks [any surprises? :-)] 

The table engine used was InnoDB as there was a need for fast replication and 
writes but only moderate reads (should eventually support high read rates). The 
data would take up quite a while to load completely far away from being 
near-real time. And so our optimization journey begin. 

1. We tried to optimize/tune InnoDB parameters like increasing the buffer pool 
size to 75 % of available RAM. This helped but only till the time DBs were 
lightly loaded i.e. innoDB had sufficient buffer pool to host the data and 
indexes. 

2. We also realized that InnoDB has considerable locking overhead because of 
which write concurrency is really bad when you have a large number of 
concurrent threads doing writes. The default thread concurrency for us was set 
to no_of_cpu * 2 = 8 which is what the official documentation advises as the 
optimal limit. So we limited the number of reduce tasks and consequently the 
number of concurrent writes and boy the performance improved 4x. We were almost 
there :-) 

3. Next thing we tried is the standard DB optimzation techniques like 
de-normalizing the schema and dropping constraints. This gave only a minor 
performance improvement, nothing earth shattering. Note that we were already 
caching connections in reducers to each MySQL shard and partionining logic was 
embedded into reducers. 

4. Falling still short of our performance objectives, we finally we decided to 
get rid of JDBC writes from reducers and work on an alternative that uses 
MySQLs LOAD utility. 
- The processing would partition the data into MySQL shard specific files 
resident in HDFS. 
- A script would then spawn processes via ssh on different physical machines to 
download this data. 
- Each spawned process just downloads the data for the shard it should upload 
to. 
- All the processes then start uploading data in parallel into their respective 
MySQL shards using LOAD DATA infile. 

This proved to be the fastest approach, even in the wake of increasing data 
loads. The enitre processing/loading would complete in less than 6 min. The 
system has been holding up quite well so far, even though we've had to limit 
the number of days for which we keep the data or else the MySQLs get 
overwhelmed. 

Hope this is helpful to people. 

Regards 
-Ankur 

Reply via email to