Hey, I wrote a reply to a different thread which encapsulates most my recent learning and understanding of how GC and the JVM impacts large scale data import.
At this point, I have a 19 machine cluster, with 30 TB of aggregate storage on raid0 (2 disks/box). I've devoted them to hbase 0.20 testing, and I've been able to load a massive set of (real) data in. Unlike previous data sets, this one is both (a) huge and (b) tiny rows. One thing I am finding is I end up with weird bottlenecks: - The clients dont always seem to be able to push maximal speed - GC pauses are death - Compaction thread limit might be holding things up, but I'm not sure about this one yet. - In-memory complexity and size is pretty much stressing the JVM significantly. The bottom line is we are fighting against the JVM now - both with GC problems, and as well as general efficiency. For example, a typical regionserver can carry a memcache load of 1000-1500m. That is a lot of outstanding writes. As for numbers, I generally want to see the following import performance to be happy: - 100-130k ops/sec across 19 nodes - 125-200 MB/sec of network traffic across all nodes - 76 map reads reading from mysql -> hbase This is currently sustainable with 3k regions for prolonged periods of time. I have an import that has run for 12 hours at these speeds. Speed problems start to manifest themselves as dips in the network performance graph. The bigger dips (when I was having maximal GC pause problems) would bounce performance between 0 and 175MB/sec. Smaller ones could be due to io-wait or other inefficiencies. It's all about the GC pause! -ryan On Wed, Apr 29, 2009 at 2:56 PM, Jim Twensky <[email protected]> wrote: > Hi Ryan, > > Have you got your new hardware? I was keeping an eye on your blog for the > past few days but I haven't seen any updates there so I just decided to ask > you on the list. If you have some results, would you like to give us some > numbers along with hardware details? > > Thanks, > Jim > > On Thu, Jan 15, 2009 at 2:28 PM, Larry Compton > <[email protected]>wrote: > > > That explains it. Thanks! > > > > On Thu, Jan 15, 2009 at 2:11 PM, Jean-Daniel Cryans <[email protected] > > >wrote: > > > > > Larry, > > > > > > This feature was done for 0.19.0 for which a release candidate is on > the > > > way. > > > > > > J-D > > > > > > On Thu, Jan 15, 2009 at 2:03 PM, Larry Compton > > > <[email protected]>wrote: > > > > > > > I'm interested in trying this, but I'm not seeing "setAutoFlush()" > and > > > > "setWriteBufferSize()" in the "HTable" API (I'm using HBase 0.18.1). > > > > > > > > Larry > > > > > > > > On Sun, Jan 11, 2009 at 5:11 PM, Ryan Rawson <[email protected]> > > wrote: > > > > > > > > > Hi all, > > > > > > > > > > New user of hbase here. I've been trolling about in IRC for a few > > days, > > > > and > > > > > been getting great help all around so far. > > > > > > > > > > The topic turns to importing data into hbase - I have largeish > > datasets > > > I > > > > > want to evaluate hbase performance on, so I've been working at > > > importing > > > > > said data. I've managed to get some impressive performance > speedups, > > > and > > > > I > > > > > chronicled them here: > > > > > > > > > > > > > > > > > > > > > > > > > http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html > > > > > > > > > > To summarize: > > > > > - Use the Native HBASE API in Java or Jython (or presumably any JVM > > > > > language) > > > > > - Disable table auto flush, set write buffer large (12M for me) > > > > > > > > > > At this point I can import a 18 GB, 440m row comma-seperated flat > > file > > > in > > > > > about 72 minutes using map-reduce. This is on a 3 node cluster all > > > > running > > > > > hdfs,hbase,mapred with 12 map tasks (4 per). This hardware is > loaner > > > DB > > > > > hardware, so once I get my real cluster I'll revise/publish new > data. > > > > > > > > > > I look forward to meeting some of you next week at the hbase meetup > > at > > > > > powerset! > > > > > > > > > > -ryan > > > > > > > > > > > > > > > > > > > > -- > > Larry Compton > > SRA International > > 240.373.5312 (APL) > > 443.742.2762 (cell) > > >
