Tip when migrating your data loading MR jobs from 0.19 to 0.20

Jean-Daniel Cryans Tue, 11 Aug 2009 10:41:25 -0700

Hi users,

This weekend at the second HBase Hackathon (held by StumbleUpon, thx!)
we helped someone migrating a data loading MapReduce job from 0.19 to
0.20 because of a performance problem. It was something like 20x
slower.


How we solved it, short answer:
After instantiating the Put that you give to the TableOutputFormat, do
put.writeToWAL(false).

Long answer:
As you may know, HDFS still does not support appends. That means that
the write ahead logs or WAL that HBase uses are only helpful if synced
on disk. That means that you lose some data during a region server
crash or a kill -9. In 0.19 the logs could be opened forever if they
had under 100000 edits. Now in 0.20 we fixed that by capping the WAL
to ~62MB and we also rotate the logs after 1 hour. This is all good
because it means far less data loss until we are able to append to
files in HDFS.

Now to why this may slow down your import, the job I was talking about
had huge rows so the logs got rotated much more often whereas in 0.19
only the number of rows triggered a log rotation. Not writing to the
WAL has the advantage of using far less disk IO but, as you can guess,
it means huge data loss in the case of a region server crash. But, in
many cases, a RS crash still means that you must restart your job
because log splitting can take more than 10 minutes so many tasks
times out (I am currently working on that for 0.21 to make it really
faster btw).

Hopes this helps someone,

J-D

Tip when migrating your data loading MR jobs from 0.19 to 0.20

Reply via email to