Hi users, This weekend at the second HBase Hackathon (held by StumbleUpon, thx!) we helped someone migrating a data loading MapReduce job from 0.19 to 0.20 because of a performance problem. It was something like 20x slower.
How we solved it, short answer: After instantiating the Put that you give to the TableOutputFormat, do put.writeToWAL(false). Long answer: As you may know, HDFS still does not support appends. That means that the write ahead logs or WAL that HBase uses are only helpful if synced on disk. That means that you lose some data during a region server crash or a kill -9. In 0.19 the logs could be opened forever if they had under 100000 edits. Now in 0.20 we fixed that by capping the WAL to ~62MB and we also rotate the logs after 1 hour. This is all good because it means far less data loss until we are able to append to files in HDFS. Now to why this may slow down your import, the job I was talking about had huge rows so the logs got rotated much more often whereas in 0.19 only the number of rows triggered a log rotation. Not writing to the WAL has the advantage of using far less disk IO but, as you can guess, it means huge data loss in the case of a region server crash. But, in many cases, a RS crash still means that you must restart your job because log splitting can take more than 10 minutes so many tasks times out (I am currently working on that for 0.21 to make it really faster btw). Hopes this helps someone, J-D
