Stress testing releases candidates

Jonathan Hsieh Fri, 06 Jan 2012 09:44:51 -0800

Hey all,

I'm curious about the kinds of testing you all do and which paths you
exercise before +1'ing a release candidate (or putting it into production).
 Do you just simulate the expected workloads you have at your
installations?  How much testing do you all do on error recovery paths or
when HBase gets into a stressful place?


Jimmy and I've been doing long-running TestLoadAndVerify from Bigtop using
different configurations including a stressful (flush/split/compact heavy,
properties below) configuration with the recent 0.92 release
candidates. TestLoadAndVerify basically are two sequentially executed MR
jobs -- one that loads data that have "dependency chains" on previous
writes, and one that verifies that all chains are satisfied (link below).
At the moment we've been manually injecting faults (killing meta, masters,
root, random rs's, pausing them to simulate GC's) but will be likely
injecting faults and exercising recovery paths more regularly and
systematically.

This approach has resulted in some of the recent dist log splitting
deadlocks Jimmy's been working.

I've encountered a few "transient" data missing problems that I'm still
trying to duplicate and isolate.  Best I can say now is that it seems to
happen if/when region servers have a large number of regions (roughly
900-2000 regions per range). More specifically, in these particular cases
it seems that the verify job return a list of sequential rows indicating
that a region is was temporarily unavailable or not returning data.
Interestingly, when I run just the verify job again later on the same
table, all rows are present.  Since the Load and Verify jobs are
two consecutively run MR jobs, my guess is that there is a related in
something time delayed (balancing, splitting, compaction?).

Thanks,
Jon.

Here's how to setup bigtop:
https://cwiki.apache.org/confluence/display/BIGTOP/Setting+up+Bigtop+to+run+HBase+system+tests


Here's the patch I've been using.
https://issues.apache.org/jira/browse/BIGTOP-321

Here's part of the stress configuration that stresses flushing, splitting,
and balancing operations.

----
 <!-- stress settings -->
  <property>
    <name>io.file.buffer.size</name>
    <value>131072</value>
    <description>Hadoop setting </description>
  </property>
   <property>
    <name>hbase.hregion.max.filesize</name>
    <value>4194304</value>  <!-- 4MB -->
    <!-- <value>268435456</value> 256MB, for lots of flushes without splits
-->
    <description>
    Maximum HStoreFile size. If any one of a column families' HStoreFiles
has
    grown to exceed this value, the hosting HRegion is split in two.
    Default: 256M.
    </description>
  </property>
  <property>
    <name>hbase.balancer.period
    </name>
    <value>2000</value>
    <description>Period at which the region balancer runs in the Master.
    </description>
  </property>
   <property>
    <name>hbase.hregion.max.filesize</name>
    <value>4194304</value>  <!-- 4MB -->
    <!-- <value>268435456</value> 256MB, for lots of flushes without splits
-->
    <description>
    Maximum HStoreFile size. If any one of a column families' HStoreFiles
has
    grown to exceed this value, the hosting HRegion is split in two.
    Default: 256M.
    </description>
  </property>
  <property>
    <name>hbase.balancer.period
    </name>
    <value>2000</value>
    <description>Period at which the region balancer runs in the Master.
    </description>
  </property>
  <property>
    <name>hbase.hregion.memstore.flush.size</name>
    <value>262144</value> <!-- 256KB -->
    <description>
    Memstore will be flushed to disk if size of the memstore
    exceeds this number of bytes.  Value is checked by a thread that runs
    every hbase.server.thread.wakefrequency. (normally 64 MB)
    </description>
  </property>
----

-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [email protected]

Stress testing releases candidates

Reply via email to