Author: buildbot
Date: Mon Sep 29 00:47:11 2014
New Revision: 923971
Log:
Staging update by buildbot for gora
Modified:
websites/staging/gora/trunk/content/ (props changed)
websites/staging/gora/trunk/content/current/index.html
Propchange: websites/staging/gora/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon Sep 29 00:47:11 2014
@@ -1 +1 @@
-1627190
+1628108
Modified: websites/staging/gora/trunk/content/current/index.html
==============================================================================
--- websites/staging/gora/trunk/content/current/index.html (original)
+++ websites/staging/gora/trunk/content/current/index.html Mon Sep 29 00:47:11
2014
@@ -163,10 +163,20 @@ under the License.
<li><a href="#gora-modules">Gora Modules</a></li>
<li><a href="#gora-testing">Gora Testing</a><ul>
<li><a href="#junit-tests">JUnit Tests</a></li>
-<li><a href="#goraci-integration-testsing-suite">GoraCI Integration Testsing
Suite</a></li>
+<li><a href="#goraci-integration-testsing-suite">GoraCI Integration Testsing
Suite</a><ul>
+<li><a href="#background">Background</a></li>
+<li><a href="#the-anatomy-of-goraci-tests">The Anatomy of GoraCI tests</a></li>
+<li><a href="#building-goraci">Building GoraCI</a></li>
+<li><a href="#java-class-description">Java Class Description</a></li>
</ul>
</li>
</ul>
+</li>
+<li><a href="#gora-and-hadoop">GORA AND HADOOP</a></li>
+<li><a href="#goraci-and-hbase">GORACI AND HBASE</a></li>
+<li><a href="#concurrency">CONCURRENCY</a></li>
+<li><a href="#conclusions">CONCLUSIONS</a></li>
+</ul>
</div>
<p>This is the main entry point for Gora documentation. Here are some pointers
for further info:</p>
<ul>
@@ -217,6 +227,204 @@ modules contain a <code>/src/examples/</
classes can be found. Specifically, there are some classes that are used for
tests
under <a
href="https://github.com/apache/gora/tree/master/gora-core/src/examples">gora-core/src/examples/</a>.</p>
<h3 id="goraci-integration-testsing-suite">GoraCI Integration Testsing
Suite</h3>
+<h4 id="background">Background</h4>
+<p>Since Gora 0.5, the GoraCI suite has been part of the mainstream Gora
codebase.</p>
+<p>Credit for GoraCI can be handed to Keith Turner (Gora PMC member) for his
foresight
+in developing GoraCI which we have now extended from gora-accumulo to the
entire suite
+of Gora modules.</p>
+<p><a href="http://accumulo.apache.org">Apache Accumulo</a> has a test suite
that verifies that data is not lost
+at scale. This test suite is called
+<a
href="http://svn.apache.org/viewvc/accumulo/tags/1.4.0/test/system/continuous/ScaleTest.odp?view=co">continuous
ingest</a>.<br />
+Essentially the test runs many ingest clients that continually create linked
lists containing <strong>25 million</strong>
+nodes. At some point the clients are stopped and a map reduce job is run to
+ensure no linked list has a hole. A hole indicates data was lost. </p>
+<p>The nodes in the linked list are random. This causes each linked list to
+spread across the table. Therefore if one part of a table loses data, then it
+will be detected by references in another part of the table.</p>
+<p>This project is a version of the test suite written using Apache Gora [1].
+Goraci has been tested against Accumulo and HBase. </p>
+<h4 id="the-anatomy-of-goraci-tests">The Anatomy of GoraCI tests</h4>
+<p>Below is rough sketch of how data is written. For specific details look at
the
+<a
href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Generator.java">Generator
code</a></p>
+<ol>
+<li>Write out 1 million nodes </li>
+<li>Flush the client </li>
+<li>Write out 1 million that reference previous million </li>
+<li>If this is the 25th set of 1 million nodes, then update 1st set of million
+ to point to last </li>
+<li>goto 1</li>
+</ol>
+<p>The key is that nodes only reference flushed nodes. Therefore a node should
+never reference a missing node, even if the ingest client is killed at any
+point in time.</p>
+<p>When running this test suite w/ Accumulo there is a script running in
parallel
+called the Aggitator that randomly and continuously kills server processes.<br
/>
+The outcome was that many data loss bugs were found in Accumulo by doing this.
+This test suite can also help find bugs that impact uptime and stability when
+run for days or weeks. </p>
+<p>This test suite consists the following </p>
+<ul>
+<li>a few Java programs </li>
+<li>a little helper script to run the java programs</li>
+<li>a maven script to build it. </li>
+</ul>
+<p>When generating data, its best to have each map task generate a multiple of
25
+million. The reason for this is that circular linked list are generated every
+25M. Not generating a multiple in 25M will result in some nodes in the linked
+list not having references. The loss of an unreferenced node can not be
+detected.</p>
+<h4 id="building-goraci">Building GoraCI</h4>
+<p>As GoraCI is packaged with the Gora master branch source it is
automatically
+built every time you execute</p>
+<p><code>mvn install</code></p>
+<p>The maven pom file has some profiles that attempt to make it easier to run
+GoraCI against different Gora backends by copying the jars you need into
<code>lib</code>.
+Before packaging its important to edit <code>gora.properties</code> and set it
correctly
+for your datastore. To run against Accumulo do the following.</p>
+<p><code>
+ vim src/main/resources/gora.properties //set Accumulo properties
+ mvn package -Paccumulo-1.4
+</code></p>
+<p>To run against HBase, do the following.</p>
+<p><code>
+ vim src/main/resources/gora.properties //set HBase properties
+ mvn package -Phbase-0.92
+</code></p>
+<p>To run against Cassandra, do the following.</p>
+<p><code>
+ vim src/main/resources/gora.properties //set Cassandra properties
+ mvn package -Pcassandra-1.1.2
+</code></p>
+<p>For other datastores mentioned in <code>gora.properties</code>, you will
need to copy the
+appropriate deps into <code>lib</code>. Feel free to update the pom with
other profiles, <a href="https://issues.apache.org/jira/browse/GORA/">open
+a ticket</a> or just <a href="https://github.com/apache/gora/">send us a pull
request</a>.</p>
+<h4 id="java-class-description">Java Class Description</h4>
+<p>Below is a description of the Java programs</p>
+<ul>
+<li><a
href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Generator.java">org.apache.gora.goraci.Generator</a>
- A map only job that generates data. As stated previously,
+ its best to generate data in multiples of 25M.</li>
+<li><a
href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Verify.java">org.apache.gora.goraci.Verify</a>
- A map reduce job that looks for holes. Look at the
+ counts after running. REFERENCED and UNREFERENCED are
+ ok, any UNDEFINED counts are bad. Do not run at the
+ same time as the Generator.</li>
+<li><a
href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Walker.java">org.apache.gora.goraci.Walker</a>
- A standalong program that start following a linked list
+ and emits timing info. </li>
+<li><a
href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Print.java">org.apache.gora.goraci.Print</a>
- A standalone program that prints nodes in the linked list</li>
+<li><a
href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Delete.java">org.apache.gora.goraci.Delete</a>
- A standalone program that deletes a single node</li>
+<li><a
href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Loop.java">org.apache.gora.goraci.Loop</a>
- Runs generation and verify in a loop</li>
+</ul>
+<p>org.apache.gora.goraci.sh is a helper script that you can use to run the
above programs. It
+assumes all needed jars are in the lib dir. It does not need the package name.
+You can just run "./org.apache.gora.goraci.sh Generator", below is an
example.</p>
+<p>$ ./org.apache.gora.goraci.sh Generator
+ Usage : Generator <num mappers> <num nodes></p>
+<p>For Gora to work, it needs a gora.properties file on the classpath and a
+mapping file on the classpath, the contents of both are datastore specific,
+more details can be found here [2]. You can edit the ones in src/main/resources
+and build the org.apache.gora.goraci-${version}-SNAPSHOT.jar with those.
Alternatively remove
+those and put them on the classpath through some other means.</p>
+<h2 id="gora-and-hadoop">GORA AND HADOOP</h2>
+<p>Gora uses Avro which uses a Json library that Hadoop has an old version of.
+The two libraries jackson-core and jackson-mapper need to be updated in
+<HADOOP_HOME>/lib and <HADOOP_HOME>/share/hadoop/lib/. I updated these to
+jackson-core-asl-1.4.2.jar and jackson-mapper-asl-1.4.2.jar. For details see
+HADOOP-6945 [3]. </p>
+<h2 id="goraci-and-hbase">GORACI AND HBASE</h2>
+<p>To improve performance running read jobs such as the Verify step, enable
+scanner caching on the command line. For example:</p>
+<div class="codehilite"><pre>$ <span class="o">./</span><span
class="n">gorachi</span><span class="p">.</span><span class="n">sh</span> <span
class="n">Verify</span> <span class="o">-</span><span
class="n">Dhbase</span><span class="p">.</span><span
class="n">client</span><span class="p">.</span><span
class="n">scanner</span><span class="p">.</span><span
class="n">caching</span><span class="p">=</span>1000 <span class="o">\</span>
+ <span class="o">-</span><span class="n">Dmapred</span><span
class="p">.</span><span class="n">map</span><span class="p">.</span><span
class="n">tasks</span><span class="p">.</span><span
class="n">speculative</span><span class="p">.</span><span
class="n">execution</span><span class="p">=</span><span class="n">false</span>
<span class="n">verify_dir</span> 1000
+</pre></div>
+
+
+<p>Dependent on how you have your hadoop and hbase deployed, you may need to
+change the gorachi.sh script around some. Here is one suggestion that may help
+in the case where your hadoop and hbase configuration are other than under the
+hadoop and hbase home directories.</p>
+<p>diff --git a/org.apache.gora.goraci.sh b/org.apache.gora.goraci.sh
+ index db1562a..31c3c94 100755
+ --- a/org.apache.gora.goraci.sh
+ +++ b/org.apache.gora.goraci.sh
+ @@ -95,6 +95,4 @@ done
+ #run it
+ export HADOOP_CLASSPATH="$CLASSPATH"
+ LIBJARS=<code>echo $HADOOP_CLASSPATH | tr : ,</code>
+ -hadoop jar "$GORACI_HOME/lib/org.apache.gora.goraci-0.0.1-SNAPSHOT.jar"
$CLASS -libjars "$LIBJARS" "$@"
+ -
+ -
+ +CLASSPATH="${HBASE_CONF_DIR}" hadoop --config "${HADOOP_CONF_DIR} jar
"$GORACI_HOME/lib/org.apache.gora.goraci-0.0.1-SNAPSHOT.jar" $CLASS -files
"${HBASE_CONF_DIR}/hbase-site.xml" -libjars "$LIBJARS" "$@"</p>
+<p>You will need to define HBASE_CONF_DIR and HADOOP_CONF_DIR before you run
your
+org.apache.gora.goraci jobs. For example:</p>
+<p>$ export HADOOP_CONF_DIR=/home/you/hadoop-conf
+ $ export HBASE_CONF_DIR=/home/you/hbase-conf
+ $ PATH=/home/you/hadoop-1.0.2/bin:$PATH ./org.apache.gora.goraci.sh
Generator 1000 1000000</p>
+<h2 id="concurrency">CONCURRENCY</h2>
+<p>Its possible to run verification at the same time as generation. To do this
+supply the -c option to Generator and Verify. This will cause Genertor to
+create a secondary table which holds information about what verification can
+safely verify. Running Verify with the -c option will make it run slower
+because more information must be brought back to the client side for filtering
+purposes. The Loop program also supports the -c option, which will cause it to
+run verification concurrently with generation.</p>
+<p>If verification is run at the same time as generation without the -c option,
+then it will inevitably fail. This is because verification mappers read
+different parts of the table at different times and giving an inconsistent view
+of the table. So one mapper may read a part of a table before a node is
+written, when the node is later referenced it will appear to be missing. The
+-c option basically filters out newer information using data written to the
+secondary table.</p>
+<h2 id="conclusions">CONCLUSIONS</h2>
+<p>This test suite does not do everything that the Accumulo test suite does,
+mainly it does not collect statistics and generate reports. The reports
+are useful for assesing performance.</p>
+<p>Below shows running a test of the test. Ingest one linked list, deleted a
node
+in it, ensure the verifaction map reduce job notices that the node is missing.
+Not all output is shown, just the important parts.</p>
+<p>$ ./org.apache.gora.goraci.sh Generator 1 25000000
+ $ ./org.apache.gora.goraci.sh Print -s 2000000000000000 -l 1
+
2000001f65dbd238:30350f9ae6f6e8f7:000004265852:ef09f9dd-75b1-4c16-9f14-0fa84f3029b6
+ $ ./org.apache.gora.goraci.sh Print -s 30350f9ae6f6e8f7 -l 1
+
30350f9ae6f6e8f7:4867fe03de6ea6c8:000003265852:ef09f9dd-75b1-4c16-9f14-0fa84f3029b6
+ $ ./org.apache.gora.goraci.sh Delete 30350f9ae6f6e8f7
+ Delete returned true
+ $ ./org.apache.gora.goraci.sh Verify gci_verify_1 2
+ 11/12/20 17:12:31 INFO mapred.JobClient:
org.apache.gora.goraci.Verify$Counts
+ 11/12/20 17:12:31 INFO mapred.JobClient: UNDEFINED=1
+ 11/12/20 17:12:31 INFO mapred.JobClient: REFERENCED=24999998
+ 11/12/20 17:12:31 INFO mapred.JobClient: UNREFERENCED=1
+ $ hadoop fs -cat gci_verify_1/part*
+ 30350f9ae6f6e8f7 2000001f65dbd238</p>
+<p>The map reduce job found the one undefined node and gave the node that
+referenced it.</p>
+<p>Below are some timing statistics for running org.apache.gora.goraci on a 10
node cluster. </p>
+<p>Store | Task | Time | Undef | Unref | Ref
<br />
+
----------------+------------------------+---------+--------+-------+------------
+ accumulo-1.4.0 | Generator 10 100000000 | 40m 16s | N/A | N/A |
N/A <br />
+ accumulo-1.4.0 | Verify /tmp/goraci1 40 | 6m 7s | 0 | 0 |
1000000000<br />
+ hbase-0.92.1 | Generator 10 100000000 | 2h 44m | N/A | N/A |
N/A <br />
+ hbase-0.92.1 | Verify /tmp/goraci2 40 | 6m 34s | 0 | 0 |
1000000000</p>
+<p>Hbase and Accumulo are configured differently out-of-the-box. We used the
Accumulo
+3G, native configuration examples in the conf/examples directory.</p>
+<p>To provide a comparable memory footprint, we increased the HBase jvm to
"-Xmx4000m",
+and turned on compression for the ci table:</p>
+<p>create 'ci', {NAME=>'meta', COMPRESSION=>'GZ'}</p>
+<p>We also turned down the replication of write-ahead logs to be comparable to
Accumulo:</p>
+<p><property>
+ <name>hbase.regionserver.hlog.replication</name>
+ <value>2</value>
+ </property></p>
+<p>For the accumulo run, we set the split threshold to 512M:</p>
+<p>shell> config -t ci -s table.split.threshold=512M</p>
+<p>This was done so that Accumulo would end up with 64 tablets, which is the
+number of regions hbase had. The number of tablets/regions determines how
+much parallelism there is in the map phase of the verify step.</p>
+<p>Sometimes when this test suite is run against HBase data is lost. This
issue
+is being tracked under HBASE-5754 [4].</p>
+<p>[0] http://accumulo.apache.org
+[1] http://gora.apache.org
+[2] http://gora.apache.org/docs/current/gora-conf.html
+[3] https://issues.apache.org/jira/browse/HADOOP-6945
+[4] https://issues.apache.org/jira/browse/HBASE-5754</p>
</div> <!-- /container (main block) -->