Hey,

I wrote a reply to a different thread which encapsulates most my recent
learning and understanding of how GC and the JVM impacts large scale data
import.

At this point, I have a 19 machine cluster, with 30 TB of aggregate storage
on raid0 (2 disks/box).   I've devoted them to hbase 0.20 testing, and I've
been able to load a massive set of (real) data in.  Unlike previous data
sets, this one is both (a) huge and (b) tiny rows.

One thing I am finding is I end up with weird bottlenecks:
- The clients dont always seem to be able to push maximal speed
- GC pauses are death
- Compaction thread limit might be holding things up, but I'm not sure about
this one yet.
- In-memory complexity and size is pretty much stressing the JVM
significantly.

The bottom line is we are fighting against the JVM now - both with GC
problems, and as well as general efficiency.  For example, a typical
regionserver can carry a memcache load of 1000-1500m.  That is a lot of
outstanding writes.

As for numbers, I generally want to see the following import performance to
be happy:
- 100-130k ops/sec across 19 nodes
- 125-200 MB/sec of network traffic across all nodes
- 76 map reads reading from mysql -> hbase

This is currently sustainable with 3k regions for prolonged periods of
time.  I have an import that has run for 12 hours at these speeds.

Speed problems start to manifest themselves as dips in the network
performance graph.  The bigger dips (when I was having maximal GC pause
problems) would bounce performance between 0 and 175MB/sec.  Smaller ones
could be due to io-wait or other inefficiencies.

It's all about the GC pause!

-ryan

On Wed, Apr 29, 2009 at 2:56 PM, Jim Twensky <[email protected]> wrote:

> Hi Ryan,
>
> Have you got your new hardware? I was keeping an eye on your blog for the
> past few days but I haven't seen any updates there so I just decided to ask
> you on the list. If you have some results, would you like to give us some
> numbers along with hardware details?
>
> Thanks,
> Jim
>
> On Thu, Jan 15, 2009 at 2:28 PM, Larry Compton
> <[email protected]>wrote:
>
> > That explains it. Thanks!
> >
> > On Thu, Jan 15, 2009 at 2:11 PM, Jean-Daniel Cryans <[email protected]
> > >wrote:
> >
> > > Larry,
> > >
> > > This feature was done for 0.19.0 for which a release candidate is on
> the
> > > way.
> > >
> > > J-D
> > >
> > > On Thu, Jan 15, 2009 at 2:03 PM, Larry Compton
> > > <[email protected]>wrote:
> > >
> > > > I'm interested in trying this, but I'm not seeing "setAutoFlush()"
> and
> > > > "setWriteBufferSize()" in the "HTable" API (I'm using HBase 0.18.1).
> > > >
> > > > Larry
> > > >
> > > > On Sun, Jan 11, 2009 at 5:11 PM, Ryan Rawson <[email protected]>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > New user of hbase here. I've been trolling about in IRC for a few
> > days,
> > > > and
> > > > > been getting great help all around so far.
> > > > >
> > > > > The topic turns to importing data into hbase - I have largeish
> > datasets
> > > I
> > > > > want to evaluate hbase performance on, so I've been working at
> > > importing
> > > > > said data.  I've managed to get some impressive performance
> speedups,
> > > and
> > > > I
> > > > > chronicled them here:
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> http://ryantwopointoh.blogspot.com/2009/01/performance-of-hbase-importing.html
> > > > >
> > > > > To summarize:
> > > > > - Use the Native HBASE API in Java or Jython (or presumably any JVM
> > > > > language)
> > > > > - Disable table auto flush, set write buffer large (12M for me)
> > > > >
> > > > > At this point I can import a 18 GB, 440m row comma-seperated flat
> > file
> > > in
> > > > > about 72 minutes using map-reduce.  This is on a 3 node cluster all
> > > > running
> > > > > hdfs,hbase,mapred with 12 map tasks (4 per).  This hardware is
> loaner
> > > DB
> > > > > hardware, so once I get my real cluster I'll revise/publish new
> data.
> > > > >
> > > > > I look forward to meeting some of you next week at the hbase meetup
> > at
> > > > > powerset!
> > > > >
> > > > > -ryan
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Larry Compton
> > SRA International
> > 240.373.5312 (APL)
> > 443.742.2762 (cell)
> >
>

Reply via email to