Thanks Jon for sharing your methodology. I have backported LoadTestTool to 0.92 (HBASE-5124) and added two more config parameters. I ran it against a 5 node cluster. More testing is underway.
Cheers On Fri, Jan 6, 2012 at 9:43 AM, Jonathan Hsieh <[email protected]> wrote: > Hey all, > > I'm curious about the kinds of testing you all do and which paths you > exercise before +1'ing a release candidate (or putting it into production). > Do you just simulate the expected workloads you have at your > installations? How much testing do you all do on error recovery paths or > when HBase gets into a stressful place? > > Jimmy and I've been doing long-running TestLoadAndVerify from Bigtop using > different configurations including a stressful (flush/split/compact heavy, > properties below) configuration with the recent 0.92 release > candidates. TestLoadAndVerify basically are two sequentially executed MR > jobs -- one that loads data that have "dependency chains" on previous > writes, and one that verifies that all chains are satisfied (link below). > At the moment we've been manually injecting faults (killing meta, masters, > root, random rs's, pausing them to simulate GC's) but will be likely > injecting faults and exercising recovery paths more regularly and > systematically. > > This approach has resulted in some of the recent dist log splitting > deadlocks Jimmy's been working. > > I've encountered a few "transient" data missing problems that I'm still > trying to duplicate and isolate. Best I can say now is that it seems to > happen if/when region servers have a large number of regions (roughly > 900-2000 regions per range). More specifically, in these particular cases > it seems that the verify job return a list of sequential rows indicating > that a region is was temporarily unavailable or not returning data. > Interestingly, when I run just the verify job again later on the same > table, all rows are present. Since the Load and Verify jobs are > two consecutively run MR jobs, my guess is that there is a related in > something time delayed (balancing, splitting, compaction?). > > Thanks, > Jon. > > Here's how to setup bigtop: > > https://cwiki.apache.org/confluence/display/BIGTOP/Setting+up+Bigtop+to+run+HBase+system+tests > > > Here's the patch I've been using. > https://issues.apache.org/jira/browse/BIGTOP-321 > > Here's part of the stress configuration that stresses flushing, splitting, > and balancing operations. > > ---- > <!-- stress settings --> > <property> > <name>io.file.buffer.size</name> > <value>131072</value> > <description>Hadoop setting </description> > </property> > <property> > <name>hbase.hregion.max.filesize</name> > <value>4194304</value> <!-- 4MB --> > <!-- <value>268435456</value> 256MB, for lots of flushes without splits > --> > <description> > Maximum HStoreFile size. If any one of a column families' HStoreFiles > has > grown to exceed this value, the hosting HRegion is split in two. > Default: 256M. > </description> > </property> > <property> > <name>hbase.balancer.period > </name> > <value>2000</value> > <description>Period at which the region balancer runs in the Master. > </description> > </property> > <property> > <name>hbase.hregion.max.filesize</name> > <value>4194304</value> <!-- 4MB --> > <!-- <value>268435456</value> 256MB, for lots of flushes without splits > --> > <description> > Maximum HStoreFile size. If any one of a column families' HStoreFiles > has > grown to exceed this value, the hosting HRegion is split in two. > Default: 256M. > </description> > </property> > <property> > <name>hbase.balancer.period > </name> > <value>2000</value> > <description>Period at which the region balancer runs in the Master. > </description> > </property> > <property> > <name>hbase.hregion.memstore.flush.size</name> > <value>262144</value> <!-- 256KB --> > <description> > Memstore will be flushed to disk if size of the memstore > exceeds this number of bytes. Value is checked by a thread that runs > every hbase.server.thread.wakefrequency. (normally 64 MB) > </description> > </property> > ---- > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // [email protected] >
