Re: Stress testing releases candidates

Ted Yu Fri, 06 Jan 2012 10:06:51 -0800

Thanks Jon for sharing your methodology.

I have backported LoadTestTool to 0.92 (HBASE-5124) and added two more
config parameters.
I ran it against a 5 node cluster.
More testing is underway.


Cheers

On Fri, Jan 6, 2012 at 9:43 AM, Jonathan Hsieh <[email protected]> wrote:

> Hey all,
>
> I'm curious about the kinds of testing you all do and which paths you
> exercise before +1'ing a release candidate (or putting it into production).
>  Do you just simulate the expected workloads you have at your
> installations?  How much testing do you all do on error recovery paths or
> when HBase gets into a stressful place?
>
> Jimmy and I've been doing long-running TestLoadAndVerify from Bigtop using
> different configurations including a stressful (flush/split/compact heavy,
> properties below) configuration with the recent 0.92 release
> candidates. TestLoadAndVerify basically are two sequentially executed MR
> jobs -- one that loads data that have "dependency chains" on previous
> writes, and one that verifies that all chains are satisfied (link below).
> At the moment we've been manually injecting faults (killing meta, masters,
> root, random rs's, pausing them to simulate GC's) but will be likely
> injecting faults and exercising recovery paths more regularly and
> systematically.
>
> This approach has resulted in some of the recent dist log splitting
> deadlocks Jimmy's been working.
>
> I've encountered a few "transient" data missing problems that I'm still
> trying to duplicate and isolate.  Best I can say now is that it seems to
> happen if/when region servers have a large number of regions (roughly
> 900-2000 regions per range). More specifically, in these particular cases
> it seems that the verify job return a list of sequential rows indicating
> that a region is was temporarily unavailable or not returning data.
> Interestingly, when I run just the verify job again later on the same
> table, all rows are present.  Since the Load and Verify jobs are
> two consecutively run MR jobs, my guess is that there is a related in
> something time delayed (balancing, splitting, compaction?).
>
> Thanks,
> Jon.
>
> Here's how to setup bigtop:
>
> https://cwiki.apache.org/confluence/display/BIGTOP/Setting+up+Bigtop+to+run+HBase+system+tests
>
>
> Here's the patch I've been using.
> https://issues.apache.org/jira/browse/BIGTOP-321
>
> Here's part of the stress configuration that stresses flushing, splitting,
> and balancing operations.
>
> ----
>  <!-- stress settings -->
>  <property>
>    <name>io.file.buffer.size</name>
>    <value>131072</value>
>    <description>Hadoop setting </description>
>  </property>
>   <property>
>    <name>hbase.hregion.max.filesize</name>
>    <value>4194304</value>  <!-- 4MB -->
>    <!-- <value>268435456</value> 256MB, for lots of flushes without splits
> -->
>    <description>
>    Maximum HStoreFile size. If any one of a column families' HStoreFiles
> has
>    grown to exceed this value, the hosting HRegion is split in two.
>    Default: 256M.
>    </description>
>  </property>
>  <property>
>    <name>hbase.balancer.period
>    </name>
>    <value>2000</value>
>    <description>Period at which the region balancer runs in the Master.
>    </description>
>  </property>
>   <property>
>    <name>hbase.hregion.max.filesize</name>
>    <value>4194304</value>  <!-- 4MB -->
>    <!-- <value>268435456</value> 256MB, for lots of flushes without splits
> -->
>    <description>
>    Maximum HStoreFile size. If any one of a column families' HStoreFiles
> has
>    grown to exceed this value, the hosting HRegion is split in two.
>    Default: 256M.
>    </description>
>  </property>
>  <property>
>    <name>hbase.balancer.period
>    </name>
>    <value>2000</value>
>    <description>Period at which the region balancer runs in the Master.
>    </description>
>  </property>
>  <property>
>    <name>hbase.hregion.memstore.flush.size</name>
>    <value>262144</value> <!-- 256KB -->
>    <description>
>    Memstore will be flushed to disk if size of the memstore
>    exceeds this number of bytes.  Value is checked by a thread that runs
>    every hbase.server.thread.wakefrequency. (normally 64 MB)
>    </description>
>  </property>
> ----
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // [email protected]
>

Re: Stress testing releases candidates

Reply via email to