Re: Hadoop / MySQL

Peter Skomoroch Tue, 28 Apr 2009 07:21:29 -0700

Thanks for sharing sounds like a nice system - I always advise people to
avoid direct SQL inserts for batch jobs / large amounts of data and use
MySQL's optimized LOAD utility like you did.  Same goes for Oracle...
Nothing brings a DB server to its knees like a ton of individual inserts on
indexed tables..


On Tue, Apr 28, 2009 at 6:46 AM, Ankur Goel <ankur.g...@corp.aol.com> wrote:

>
> hello hadoop users,
> Recently I had a chance to lead a team building a log-processing system
> that uses Hadoop and MySQL. The system's goal was to process the incoming
> information as quickly as possible (real time or near real time), and make
> it available for querying in MySQL. I thought it would be good to share the
> experience and the challenges with the community. Couldn't think of a better
> place than these mailing lists as I am not much of a blogger :-)
>
> The information flow in the system looks something like
>
> [Apache-Servers] -> [Hadoop] -> [MySQL-shards] -> [Query-Tools]
>
> Transferring from Apache-Servers to Hadoop was quite easy as we just had to
> organize the data in timely buckets (directories). Once that was running
> smooth we had to make sure that map-reduce jobs are fired at regular
> intervals and they pick up the right data. The jobs would then
> process/aggregate the date and dump the info into MySQL shards from the
> reducers [we have our own DB partioning set up]. This is where we hit major
> bottlenecks [any surprises? :-)]
>
> The table engine used was InnoDB as there was a need for fast replication
> and writes but only moderate reads (should eventually support high read
> rates). The data would take up quite a while to load completely far away
> from being near-real time. And so our optimization journey begin.
>
> 1. We tried to optimize/tune InnoDB parameters like increasing the buffer
> pool size to 75 % of available RAM. This helped but only till the time DBs
> were lightly loaded i.e. innoDB had sufficient buffer pool to host the data
> and indexes.
>
> 2. We also realized that InnoDB has considerable locking overhead because
> of which write concurrency is really bad when you have a large number of
> concurrent threads doing writes. The default thread concurrency for us was
> set to no_of_cpu * 2 = 8 which is what the official documentation advises as
> the optimal limit. So we limited the number of reduce tasks and consequently
> the number of concurrent writes and boy the performance improved 4x. We were
> almost there :-)
>
> 3. Next thing we tried is the standard DB optimzation techniques like
> de-normalizing the schema and dropping constraints. This gave only a minor
> performance improvement, nothing earth shattering. Note that we were already
> caching connections in reducers to each MySQL shard and partionining logic
> was embedded into reducers.
>
> 4. Falling still short of our performance objectives, we finally we decided
> to get rid of JDBC writes from reducers and work on an alternative that uses
> MySQLs LOAD utility.
> - The processing would partition the data into MySQL shard specific files
> resident in HDFS.
> - A script would then spawn processes via ssh on different physical
> machines to download this data.
> - Each spawned process just downloads the data for the shard it should
> upload to.
> - All the processes then start uploading data in parallel into their
> respective MySQL shards using LOAD DATA infile.
>
> This proved to be the fastest approach, even in the wake of increasing data
> loads. The enitre processing/loading would complete in less than 6 min. The
> system has been holding up quite well so far, even though we've had to limit
> the number of days for which we keep the data or else the MySQLs get
> overwhelmed.
>
> Hope this is helpful to people.
>
> Regards
> -Ankur
>



-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Hadoop / MySQL

Reply via email to