Thanks for the responses. I'm running HBase 0.92.1 (Cloudera CDH4).

The program is very simple, it inserts batches of rows into a table via 
multiple threads. I've tried running it with different parameters (column 
count, threads, batch size, etc.), but throughput didn't improve. I've pasted 
the code here: http://pastebin.com/gPXfdkPy

I have auto flush on (default) as I am inserting rows in batch so don't need to 
use the internal HTable write buffer.

I've posted my config as well: http://pastebin.com/LVG9h6Z4

The regionservers have 12 cores (24 with HT), 128 GB RAM, 6 SCSI drives Max 
throughput is 90-100mb/sec on a drive. I've also tested this on an EC2 High I/O 
instance type with 2 SSDs, 64GB RAM, and 8 cores (16 with HT). Both the EC2 and 
my colo cluster have the same issue of seemingly underutilizing resources.

I measure disk usage using iostat and measured the theoretical max using hdparm 
dd. I use iftop to monitor network bandwidth usage, and used iperf to test 
theoretical max. CPU usage I use top and iostat.

The maximum write performance I'm getting is usually around 20mb/sec on a drive 
(this is my colo cluster) on each of the 2 data nodes. That's about 20% of the 
max, and it is only sporadic, not a consistent 20mb/sec per drive. Network 
usage also seems to top out around 20% (200mbit/sec) to each node. CPU usage on 
each node is around 10%. The problem is more pronounced on EC2 which has much 
higher theoretical limits for storage and network I/O.

Copying a 133gb file to HDFS looks like it gives similar performance as HBase 
(sporadic disk usage topping out at 20%, low CPU, 30-40% network I/O) so it 
seems this is more of an HDFS issue than an HBase issue.

Reply via email to