Author: buildbot
Date: Sat Oct 24 21:17:13 2015
New Revision: 970116
Log:
Staging update by buildbot for gora
Modified:
websites/staging/gora/trunk/content/ (props changed)
websites/staging/gora/trunk/content/current/gora-core.html
websites/staging/gora/trunk/content/current/index.html
websites/staging/gora/trunk/content/current/tutorial.html
Propchange: websites/staging/gora/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Sat Oct 24 21:17:13 2015
@@ -1 +1 @@
-1710368
+1710388
Modified: websites/staging/gora/trunk/content/current/gora-core.html
==============================================================================
--- websites/staging/gora/trunk/content/current/gora-core.html (original)
+++ websites/staging/gora/trunk/content/current/gora-core.html Sat Oct 24
21:17:13 2015
@@ -157,7 +157,18 @@ under the License.
<div class="container" id="Gora_Gora Core Module">
-<h1 id="overview">Overview</h1>
+<style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="overview">Overview<a class="headerlink" href="#overview"
title="Permanent link">¶</a></h1>
<p>This is the main documentation for DataStore's contained within the
<code>gora-core</code> module which (as it's name implies)
holds most of the core functionality for the gora project. </p>
@@ -165,7 +176,7 @@ holds most of the core functionality for
in gora depends on gora-core therefore most of the generic documentation
about the project is gathered here as well as the documentation for
<code>AvroStore</code>,
<code>DataFileAvroStore</code> and <code>MemStore</code>. In addition to this,
gora-core holds all of the
-core <strong>MapReduce</strong>, <strong>Persistency</strong>,
<strong>Query</strong>, <strong>DataStoreBase</strong> and
<strong>Utility</strong> functionality.</p>
+core <strong>MapReduce</strong>, <strong>GoraSparkEngine</strong>,
<strong>Persistency</strong>, <strong>Query</strong>,
<strong>DataStoreBase</strong> and <strong>Utility</strong> functionality.</p>
<div class="toc">
<ul>
<li><a href="#overview">Overview</a></li>
@@ -187,12 +198,16 @@ core <strong>MapReduce</strong>, <strong
<li><a href="#memstore-xml-mappings">MemStore XML mappings</a></li>
</ul>
</li>
+<li><a href="#gorasparkengine">GoraSparkEngine</a><ul>
+<li><a href="#description_3">Description</a></li>
+</ul>
+</li>
</ul>
</div>
-<h1 id="avrostore">AvroStore</h1>
-<h2 id="description">Description</h2>
+<h1 id="avrostore">AvroStore<a class="headerlink" href="#avrostore"
title="Permanent link">¶</a></h1>
+<h2 id="description">Description<a class="headerlink" href="#description"
title="Permanent link">¶</a></h2>
<p>AvroStore can be used for binary-compatible Avro serializations. It
supports Binary and JSON serializations.</p>
-<h2 id="goraproperties">gora.properties</h2>
+<h2 id="goraproperties">gora.properties<a class="headerlink"
href="#goraproperties" title="Permanent link">¶</a></h2>
<table class="table">
<thead>
<tr>
@@ -230,13 +245,13 @@ core <strong>MapReduce</strong>, <strong
</tboday>
</table>
-<h2 id="avrostore-xml-mappings">AvroStore XML mappings</h2>
+<h2 id="avrostore-xml-mappings">AvroStore XML mappings<a class="headerlink"
href="#avrostore-xml-mappings" title="Permanent link">¶</a></h2>
<p>In the stores covered within the gora-core module, no physical mappings are
required.</p>
-<h1 id="datafileavrostore">DataFileAvroStore</h1>
-<h2 id="description_1">Description</h2>
+<h1 id="datafileavrostore">DataFileAvroStore<a class="headerlink"
href="#datafileavrostore" title="Permanent link">¶</a></h1>
+<h2 id="description_1">Description<a class="headerlink" href="#description_1"
title="Permanent link">¶</a></h2>
<p>DataFileAvroStore is file based store which extends <codeAvroStore</code>
to use Avro's <code>DataFile{Writer,Reader}</code>'s as a backend.
This datastore supports MapReduce.</p>
-<h2 id="goraproperties_1">gora.properties</h2>
+<h2 id="goraproperties_1">gora.properties<a class="headerlink"
href="#goraproperties_1" title="Permanent link">¶</a></h2>
<p>DataFileAvroStore would be configured exactly the same as in AvroStore
above with the following exception
<table class="table">
<thead>
@@ -256,15 +271,15 @@ This datastore supports MapReduce.</p>
</tr>
</tboday>
</table></p>
-<h2 id="gora-core-mappings">Gora Core mappings</h2>
+<h2 id="gora-core-mappings">Gora Core mappings<a class="headerlink"
href="#gora-core-mappings" title="Permanent link">¶</a></h2>
<p>In the stores covered within the gora-core module, no physical mappings are
required.</p>
-<h1 id="memstore">MemStore</h1>
-<h2 id="description_2">Description</h2>
+<h1 id="memstore">MemStore<a class="headerlink" href="#memstore"
title="Permanent link">¶</a></h1>
+<h2 id="description_2">Description<a class="headerlink" href="#description_2"
title="Permanent link">¶</a></h2>
<p>Essentially this store is a ConcurrentSkipListMap in which operations run
as follows
* put(K key, T Object) - expect average log(n)
* get(K key, String [] fields) - expect average log(n)
* delete(K key) - expect average log(n)</p>
-<h2 id="goraproperties_2">gora.properties</h2>
+<h2 id="goraproperties_2">gora.properties<a class="headerlink"
href="#goraproperties_2" title="Permanent link">¶</a></h2>
<p>MemStore would be configured exactly the same as in AvroStore above with
the following exception
<table class="table">
<thead>
@@ -284,8 +299,49 @@ This datastore supports MapReduce.</p>
</tr>
</tboday>
</table></p>
-<h2 id="memstore-xml-mappings">MemStore XML mappings</h2>
+<h2 id="memstore-xml-mappings">MemStore XML mappings<a class="headerlink"
href="#memstore-xml-mappings" title="Permanent link">¶</a></h2>
<p>In the stores covered within the gora-core module, no physical mappings are
required.</p>
+<h1 id="gorasparkengine">GoraSparkEngine<a class="headerlink"
href="#gorasparkengine" title="Permanent link">¶</a></h1>
+<h2 id="description_3">Description<a class="headerlink" href="#description_3"
title="Permanent link">¶</a></h2>
+<p>GoraSparkEngine is Spark backend of Apache Gora. Assume that input and
output data stores are:</p>
+<div class="codehilite"><pre><span class="n">DataStore</span><span
class="o"><</span><span class="n">K1</span><span class="p">,</span> <span
class="n">V1</span><span class="o">></span> <span
class="n">inStore</span><span class="p">;</span>
+<span class="n">DataStore</span><span class="o"><</span><span
class="n">K2</span><span class="p">,</span> <span class="n">V2</span><span
class="o">></span> <span class="n">outStore</span><span class="p">;</span>
+</pre></div>
+
+
+<p>First step of using GoraSparkEngine is to initialize it:</p>
+<div class="codehilite"><pre><span class="n">GoraSparkEngine</span><span
class="o"><</span><span class="n">K1</span><span class="p">,</span> <span
class="n">V1</span><span class="o">></span> <span
class="n">goraSparkEngine</span> <span class="p">=</span> <span
class="n">new</span> <span class="n">GoraSparkEngine</span><span
class="o"><></span><span class="p">(</span><span class="n">K1</span><span
class="p">.</span><span class="n">class</span><span class="p">,</span> <span
class="n">V1</span><span class="p">.</span><span class="n">class</span><span
class="p">);</span>
+</pre></div>
+
+
+<p>Construct a <code>JavaSparkContext</code>. Register input data storeâs
value class as Kryo class:</p>
+<div class="codehilite"><pre><span class="n">SparkConf</span> <span
class="n">sparkConf</span> <span class="p">=</span> <span class="n">new</span>
<span class="n">SparkConf</span><span class="p">().</span><span
class="n">setAppName</span><span class="p">(</span>"<span
class="n">Gora</span> <span class="n">Spark</span> <span
class="n">Integration</span> <span class="n">Application</span>"<span
class="p">).</span><span class="n">setMaster</span><span
class="p">(</span>"<span class="n">local</span>"<span
class="p">);</span>
+<span class="n">Class</span><span class="p">[]</span> <span class="n">c</span>
<span class="p">=</span> <span class="n">new</span> <span
class="n">Class</span><span class="p">[</span>1<span class="p">];</span>
+<span class="n">c</span><span class="p">[</span>0<span class="p">]</span>
<span class="p">=</span> <span class="n">inStore</span><span
class="p">.</span><span class="n">getPersistentClass</span><span
class="p">();</span>
+<span class="n">sparkConf</span><span class="p">.</span><span
class="n">registerKryoClasses</span><span class="p">(</span><span
class="n">c</span><span class="p">);</span>
+<span class="n">JavaSparkContext</span> <span class="n">sc</span> <span
class="p">=</span> <span class="n">new</span> <span
class="n">JavaSparkContext</span><span class="p">(</span><span
class="n">sparkConf</span><span class="p">);</span>
+</pre></div>
+
+
+<p>JavaPairRDD can be retrieved from input data store:</p>
+<div class="codehilite"><pre><span class="n">JavaPairRDD</span><span
class="o"><</span><span class="n">Long</span><span class="p">,</span> <span
class="n">Pageview</span><span class="o">></span> <span
class="n">goraRDD</span> <span class="p">=</span> <span
class="n">goraSparkEngine</span><span class="p">.</span><span
class="n">initialize</span><span class="p">(</span><span
class="n">sc</span><span class="p">,</span> <span class="n">inStore</span><span
class="p">);</span>
+</pre></div>
+
+
+<p>After that, all Spark functionality can be applied. For example running
count can be done as follows:</p>
+<div class="codehilite"><pre><span class="n">long</span> <span
class="n">count</span> <span class="p">=</span> <span
class="n">goraRDD</span><span class="p">.</span><span
class="n">count</span><span class="p">();</span>
+</pre></div>
+
+
+<p>Map and Reduce functions can be run on a <code>JavaPairRDD</code> as well.
Assume that this is the variable after map/reduce is applied:</p>
+<div class="codehilite"><pre><span class="n">JavaPairRDD</span><span
class="o"><</span><span class="n">String</span><span class="p">,</span>
<span class="n">MetricDatum</span><span class="o">></span> <span
class="n">mapReducedGoraRdd</span><span class="p">;</span>
+</pre></div>
+
+
+<p>Result can be written as follows:</p>
+<div class="codehilite"><pre><span class="n">Configuration</span> <span
class="n">sparkHadoopConf</span> <span class="p">=</span> <span
class="n">goraSparkEngine</span><span class="p">.</span><span
class="n">generateOutputConf</span><span class="p">(</span><span
class="n">outStore</span><span class="p">);</span>
+<span class="n">mapReducedGoraRdd</span><span class="p">.</span><span
class="n">saveAsNewAPIHadoopDataset</span><span class="p">(</span><span
class="n">sparkHadoopConf</span><span class="p">);</span>
+</pre></div>
</div> <!-- /container (main block) -->
Modified: websites/staging/gora/trunk/content/current/index.html
==============================================================================
--- websites/staging/gora/trunk/content/current/index.html (original)
+++ websites/staging/gora/trunk/content/current/index.html Sat Oct 24 21:17:13
2015
@@ -157,7 +157,18 @@ under the License.
<div class="container" id="Gora_Gora Module Overview">
-<h2 id="introduction">Introduction</h2>
+<style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h2 id="introduction">Introduction<a class="headerlink" href="#introduction"
title="Permanent link">¶</a></h2>
<div class="toc">
<ul>
<li><a href="#introduction">Introduction</a></li>
@@ -188,7 +199,7 @@ under the License.
<li>We are always looking for <a href="../contribute.html">Documentation
contributions</a>.</li>
</ul>
<p>You can find an abstract overview of how to configure Gora <a
href="./gora-conf.html">here</a>.</p>
-<h2 id="gora-modules">Gora Modules</h2>
+<h2 id="gora-modules">Gora Modules<a class="headerlink" href="#gora-modules"
title="Permanent link">¶</a></h2>
<p>Gora source code is organized in a modular architecture. The gora-core
module
is the main module which contains the core of the code. All other modules
depend
on the gora-core module.
@@ -204,7 +215,7 @@ following modules are currently implemen
<li><a href="./gora-shims.html">gora-shims-hadoop-1.x</a>: Module enabling us
to use Gora with Hadoop 1.X;</li>
<li><a href="./gora-shims.html">gora-shims-hadoop-2.x</a>: Module enabling us
to use Gora with Hadoop 2.X;</li>
<li><a href="./gora-shims.html">gora-shims-hadoop-distribution</a>: Packaging
container module enabling easier dependency management whilst working with Gora
Shims;</li>
-<li><a href="./gora-core.html">gora-core</a>: Module containing core
functionality, AvroStore and DataFileAvroStore stores;</li>
+<li><a href="./gora-core.html">gora-core</a>: Module containing core
functionality, AvroStore and DataFileAvroStore stores, GoraSparkEngine;</li>
<li><a href="./gora-accumulo.html">gora-accumulo</a>: Module for <a
href="http://accumulo.apache.org">Apache Accumulo</a> backend and AccumuloStore
implementation;</li>
<li><a href="./gora-camel.html">camel-gora</a>: An <a
href="http://camel.apache.org/">Apache Camel</a> component that allows you to
work with NoSQL databases using Gora;</li>
<li><a href="./gora-cassandra.html">gora-cassandra</a>: Module for <a
href="http://cassandra.apacheorg">Apache Cassandra</a> backend and
CassandraStore implementation;</li>
@@ -217,11 +228,11 @@ following modules are currently implemen
<li>gora-sources-dist: Packaging module used to build and distribute Gora
sources during project releases;</li>
</ul>
<p>We currently have modules under development for <a
href="http://www.oracle.com/technetwork/database/database-technologies/nosqldb/overview/index.html">Oracle
NoSQL</a> and <a href="http://lucene.apache.org">Apache Lucene</a>.</p>
-<h2 id="gora-testing">Gora Testing</h2>
+<h2 id="gora-testing">Gora Testing<a class="headerlink" href="#gora-testing"
title="Permanent link">¶</a></h2>
<p>Gora currently has two testing mechanisms
* JUnit Tests: These are included for every module which provides a DataStore
within Gora.
* Integration Tests: A custom testing suite called GoraCI (Continuous
Ingestion) which stress tests Gora functionality at scale.</p>
-<h3 id="junit-tests">JUnit Tests</h3>
+<h3 id="junit-tests">JUnit Tests<a class="headerlink" href="#junit-tests"
title="Permanent link">¶</a></h3>
<p>Unit tests in Gora are implemented using the popular <a
href="http://junit.org">JUnit</a> framework.
Each module which implements the <a
href="https://builds.apache.org/view/All/job/gora-trunk/javadoc/index.html?org/apache/gora/store/DataStore.html">DataStore</a>
interface similarly implements a <a
href="https://github.com/apache/gora/blob/master/gora-core/src/test/java/org/apache/gora/store/DataStoreTestBase.java">DataStoreTestBase</a>
API
@@ -235,8 +246,8 @@ API functionality is at the examples dir
modules contain a <code>/src/examples/</code> directory under which some
example
classes can be found. Specifically, there are some classes that are used for
tests
under <a
href="https://github.com/apache/gora/tree/master/gora-core/src/examples">gora-core/src/examples/</a>.</p>
-<h3 id="goraci-integration-testsing-suite">GoraCI Integration Testsing
Suite</h3>
-<h4 id="background">Background</h4>
+<h3 id="goraci-integration-testsing-suite">GoraCI Integration Testsing Suite<a
class="headerlink" href="#goraci-integration-testsing-suite" title="Permanent
link">¶</a></h3>
+<h4 id="background">Background<a class="headerlink" href="#background"
title="Permanent link">¶</a></h4>
<p>Since Gora 0.5, the GoraCI suite has been part of the mainstream Gora
codebase.</p>
<p>Credit for GoraCI can be handed to Keith Turner (Gora PMC member) for his
foresight
in developing GoraCI which we have now extended from gora-accumulo to the
entire suite
@@ -252,7 +263,7 @@ spread across the table. Therefore if o
will be detected by references in another part of the table.</p>
<p>This project is a version of the test suite written using Apache Gora [1].
Goraci has been tested against Accumulo and HBase. </p>
-<h4 id="the-anatomy-of-goraci-tests">The Anatomy of GoraCI tests</h4>
+<h4 id="the-anatomy-of-goraci-tests">The Anatomy of GoraCI tests<a
class="headerlink" href="#the-anatomy-of-goraci-tests" title="Permanent
link">¶</a></h4>
<p>Below is rough sketch of how data is written. For specific details look at
the
<a
href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Generator.java">Generator
code</a></p>
<ol>
@@ -282,7 +293,7 @@ million. The reason for this is that ci
25M. Not generating a multiple in 25M will result in some nodes in the linked
list not having references. The loss of an unreferenced node can not be
detected.</p>
-<h4 id="building-goraci">Building GoraCI</h4>
+<h4 id="building-goraci">Building GoraCI<a class="headerlink"
href="#building-goraci" title="Permanent link">¶</a></h4>
<p>As GoraCI is packaged with the Gora master branch source it is
automatically
built every time you execute</p>
<div class="codehilite"><pre><span class="n">mvn</span> <span
class="n">install</span>
@@ -313,7 +324,7 @@ for your datastore. To run against Accu
<p>For other datastores mentioned in <code>gora.properties</code>, you will
need to copy the
appropriate deps into <code>lib</code>. Feel free to update the pom with
other profiles, <a href="https://issues.apache.org/jira/browse/GORA/">open
a ticket</a> or just <a href="https://github.com/apache/gora/">send us a pull
request</a>.</p>
-<h4 id="java-class-description">Java Class Description</h4>
+<h4 id="java-class-description">Java Class Description<a class="headerlink"
href="#java-class-description" title="Permanent link">¶</a></h4>
<p>Below is a description of the Java programs</p>
<ul>
<li><a
href="https://github.com/apache/gora/blob/master/gora-goraci/src/main/java/org/apache/gora/goraci/Generator.java">org.apache.gora.goraci.Generator</a>
- A map only job that generates data. As stated previously,
@@ -341,13 +352,13 @@ You can just run <code>goraci.sh Generat
more details can be found here [2]. You can edit the ones in src/main/resources
and build the <code>goraci-${version}-SNAPSHOT.jar</code> with those.
Alternatively remove
those and put them on the classpath through some other means.</p>
-<h4 id="gora-and-hadoop">Gora and Hadoop</h4>
+<h4 id="gora-and-hadoop">Gora and Hadoop<a class="headerlink"
href="#gora-and-hadoop" title="Permanent link">¶</a></h4>
<p>Gora uses <a href="http://avro.apache.org">Apache Avro</a> which uses a
Json library that Hadoop has an old version of.
The two libraries jackson-core and jackson-mapper need to be updated in
<code>$HADOOP_HOME/lib</code> and <code>$HADOOP_HOME/share/hadoop/lib/</code>.
Currently these are updated to
jackson-core-asl-1.4.2.jar and jackson-mapper-asl-1.4.2.jar. For details see
<a href="https://issues.apache.org/jira/browse/HADOOP-6945">HADOOP-6945</a>.
</p>
-<h4 id="goraci-and-hbase">GoraCI and HBase</h4>
+<h4 id="goraci-and-hbase">GoraCI and HBase<a class="headerlink"
href="#goraci-and-hbase" title="Permanent link">¶</a></h4>
<p>To improve performance running read jobs such as the Verify step, enable
scanner caching on the command line. For example:</p>
<div class="codehilite"><pre>$ <span class="o">./</span><span
class="n">gorachi</span><span class="p">.</span><span class="n">sh</span> <span
class="n">Verify</span> <span class="o">-</span><span
class="n">Dhbase</span><span class="p">.</span><span
class="n">client</span><span class="p">.</span><span
class="n">scanner</span><span class="p">.</span><span
class="n">caching</span><span class="p">=</span>1000 <span class="o">\</span>
@@ -382,7 +393,7 @@ $ <span class="n">PATH</span><span class
</pre></div>
-<h4 id="concurrency">Concurrency</h4>
+<h4 id="concurrency">Concurrency<a class="headerlink" href="#concurrency"
title="Permanent link">¶</a></h4>
<p>Its possible to run verification at the same time as generation. To do this
supply the -c option to Generator and Verify. This will cause Genertor to
create a secondary table which holds information about what verification can
@@ -397,7 +408,7 @@ of the table. So one mapper may read a
written, when the node is later referenced it will appear to be missing. The
<strong>-c</strong> option basically filters out newer information using data
written to the
secondary table.</p>
-<h4 id="conclusions">Conclusions</h4>
+<h4 id="conclusions">Conclusions<a class="headerlink" href="#conclusions"
title="Permanent link">¶</a></h4>
<p>This test suite does not do everything that the Accumulo test suite does,
mainly it does not collect statistics and generate reports. The reports
are useful for assesing performance.</p>
Modified: websites/staging/gora/trunk/content/current/tutorial.html
==============================================================================
--- websites/staging/gora/trunk/content/current/tutorial.html (original)
+++ websites/staging/gora/trunk/content/current/tutorial.html Sat Oct 24
21:17:13 2015
@@ -157,9 +157,20 @@ under the License.
<div class="container" id="Gora_Gora Tutorial">
-<h1 id="gora-tutorial">Gora Tutorial</h1>
+<style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h1 id="gora-tutorial">Gora Tutorial<a class="headerlink"
href="#gora-tutorial" title="Permanent link">¶</a></h1>
<p>Author : Enis Söztutar, enis [at] apache [dot] org</p>
-<h2 id="introduction">Introduction</h2>
+<h2 id="introduction">Introduction<a class="headerlink" href="#introduction"
title="Permanent link">¶</a></h2>
<p>This is the official tutorial for Apache Gora. For this tutorial, we
will be implementing a system to store our web server logs in Apache HBase,
and analyze the results using Apache Hadoop and store the results either in
HSQLDB or MySQL.</p>
@@ -170,7 +181,7 @@ Next, we will go over the API of Gora to
fetching and querying objects, and deleting objects. Last, we will go over an
example
program which uses Hadoop MapReduce to analyze the web server logs, and
discuss the Gora
MapReduce API in some detail.</p>
-<h2 id="table-of-content">Table of Content</h2>
+<h2 id="table-of-content">Table of Content<a class="headerlink"
href="#table-of-content" title="Permanent link">¶</a></h2>
<div class="toc">
<ul>
<li><a href="#gora-tutorial">Gora Tutorial</a><ul>
@@ -187,7 +198,7 @@ MapReduce API in some detail.</p>
<li><a href="#hbase-mappings">HBase mappings</a></li>
</ul>
</li>
-<li><a href="#basic-api-wzxhzdk79">Basic API </title></a><ul>
+<li><a href="#basic-api">Basic API </title></a><ul>
<li><a href="#parsing-the-logs">Parsing the logs</a></li>
<li><a href="#storing-objects-in-the-datastore">Storing objects in the
DataStore</a></li>
<li><a href="#closing-the-datastore">Closing the DataStore</a></li>
@@ -224,7 +235,7 @@ MapReduce API in some detail.</p>
</li>
</ul>
</div>
-<h2 id="introduction-to-gora">Introduction to Gora</h2>
+<h2 id="introduction-to-gora">Introduction to Gora<a class="headerlink"
href="#introduction-to-gora" title="Permanent link">¶</a></h2>
<p>The Apache Gora open source framework provides an in-memory data
model and persistence for big data. Gora supports persisting to
column stores, key value stores, document stores and RDBMSs, and
@@ -241,7 +252,7 @@ has it's own module, such as gora-hbase,
and gora-sql. In your projects, you need to only include
the artifacts from the modules you use. You can consult the <a
href="/current/quickstart.html">quick start</a>
for setting up your project.</p>
-<h2 id="setting-up-gora">Setting up Gora</h2>
+<h2 id="setting-up-gora">Setting up Gora<a class="headerlink"
href="#setting-up-gora" title="Permanent link">¶</a></h2>
<p>As a first step, we need to download and compile the Gora source code. The
source codes
for the tutorial is in the gora-tutorial module. If you have
already downloaded Gora, that's cool, otherwise, please go
@@ -274,6 +285,7 @@ $ <span class="n">tree</span>
<span class="o">|</span> <span class="o">|</span>
`<span class="o">--</span> <span class="nb">log</span>
<span class="o">|</span> <span class="o">|</span>
<span class="o">|--</span> <span class="n">KeyValueWritable</span><span
class="p">.</span><span class="n">java</span>
<span class="o">|</span> <span class="o">|</span>
<span class="o">|--</span> <span class="n">LogAnalytics</span><span
class="p">.</span><span class="n">java</span>
+ <span class="o">|</span> <span class="o">|</span>
<span class="o">|--</span> <span class="n">LogAnalyticsSpark</span><span
class="p">.</span><span class="n">java</span>
<span class="o">|</span> <span class="o">|</span>
<span class="o">|--</span> <span class="n">LogManager</span><span
class="p">.</span><span class="n">java</span>
<span class="o">|</span> <span class="o">|</span>
<span class="o">|--</span> <span class="n">TextLong</span><span
class="p">.</span><span class="n">java</span>
<span class="o">|</span> <span class="o">|</span>
`<span class="o">--</span> <span class="n">generated</span>
@@ -290,7 +302,7 @@ $ <span class="n">tree</span>
<p>Since gora-tutorial is a top level module of Gora, it depends on the
directory
structure imposed by Gora's main build scripts (pom.xml for Maven). The Java
source code resides in directory
<code>src/main/java/</code>, avro schemas in <code>src/main/avro/</code>, and
data in <code>src/main/resources/</code>.</p>
-<h2 id="setting-up-hbase">Setting up HBase</h2>
+<h2 id="setting-up-hbase">Setting up HBase<a class="headerlink"
href="#setting-up-hbase" title="Permanent link">¶</a></h2>
<p>For this tutorial we will be using <a
href="http://hbase.apache.org">HBase</a> to
store the logs. For those of you not familiar with HBase, it is a NoSQL
column store with an architecture very similar to Google's BigTable.</p>
@@ -309,7 +321,7 @@ After extracting the file, cd to the hba
</pre></div>
-<h2 id="configuring-gora">Configuring Gora</h2>
+<h2 id="configuring-gora">Configuring Gora<a class="headerlink"
href="#configuring-gora" title="Permanent link">¶</a></h2>
<p>Gora is configured through a file in the classpath named gora.properties.
We will be using the following file
<code>gora-tutorial/conf/gora.properties</code></p>
<div class="codehilite"><pre> <span class="n">gora</span><span
class="p">.</span><span class="n">datastore</span><span class="p">.</span><span
class="n">default</span><span class="p">=</span><span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">gora</span><span class="p">.</span><span class="n">hbase</span><span
class="p">.</span><span class="n">store</span><span class="p">.</span><span
class="n">HBaseStore</span>
@@ -321,7 +333,7 @@ We will be using the following file <cod
and schemas(tables) should be automatically created.
More information for configuring different settings in
<code>gora.properties</code>
can be found <a href="/current/gora-conf.html">here</a>.</p>
-<h2 id="modeling-the-data">Modeling the data</h2>
+<h2 id="modeling-the-data">Modeling the data<a class="headerlink"
href="#modeling-the-data" title="Permanent link">¶</a></h2>
<p>For this tutorial, we will be parsing and storing the logs of a web server.
Some example logs are at <code>src/main/resources/access.log.tar.gz</code>,
which
belongs to the (now shutdown) server at http://www.buldinle.com/.
@@ -343,7 +355,7 @@ Some example lines from the log are:</p>
<p>The first fields in order are: User's ip, ignored, ignored, Date and
time, HTTP method, URL, HTTP Method, HTTP status code, Number of bytes
returned, Referrer, and User Agent.</p>
-<h2 id="defining-data-beans">Defining data beans</h2>
+<h2 id="defining-data-beans">Defining data beans<a class="headerlink"
href="#defining-data-beans" title="Permanent link">¶</a></h2>
<p>Data beans are the main way to hold the data in memory and persist in Gora.
Gora
needs to explicitly keep track of the status of the data in memory, so
we use <a href="http://avro.apache.org">Apache Avro</a> for defining the
beans. Using
@@ -376,7 +388,7 @@ single URL access in the logs. Let's go
are defined with type "record", with a name as the name of the class, and a
namespace which is mapped to the package name in Java. The fields
are listed in the "fields" element. Each field is given with its type. </p>
-<h2 id="compiling-avro-schemas">Compiling Avro Schemas</h2>
+<h2 id="compiling-avro-schemas">Compiling Avro Schemas<a class="headerlink"
href="#compiling-avro-schemas" title="Permanent link">¶</a></h2>
<p>The next step after defining the data beans is to compile the schemas
into Java classes. For that we will use the <a
href="/current/compiler.html">GoraCompiler</a>.
Invoke the Gora compiler from the top-level Gora directory with:</p>
@@ -458,7 +470,7 @@ user. Now, let's look at the internals o
class as a placeholder for string fields. We can also see the embedded Avro
Schema declaration and an inner enum named Field. The enum and
the _ALL_FIELDS fields will come in handy when we query the datastore for
specific fields. </p>
-<h2 id="defining-data-store-mappings">Defining data store mappings</h2>
+<h2 id="defining-data-store-mappings">Defining data store mappings<a
class="headerlink" href="#defining-data-store-mappings" title="Permanent
link">¶</a></h2>
<p>Gora is designed to flexibly work with various types of data modeling,
including column stores(such as HBase, Cassandra, etc), SQL databases, flat
files(binary,
JSON, XML encoded), and key-value stores. The mapping between the data bean
and
@@ -466,7 +478,7 @@ the data store is thus defined in XML ma
mapping format, so that data-store specific settings can be leveraged more
easily.
The mapping files declare how the fields of the classes declared in Avro
schemas
are serialized and persisted to the data store.</p>
-<h3 id="hbase-mappings">HBase mappings</h3>
+<h3 id="hbase-mappings">HBase mappings<a class="headerlink"
href="#hbase-mappings" title="Permanent link">¶</a></h3>
<p>HBase mappings are stored at file named
<code>gora-hbase-mappings.xml</code>.
For this tutorial we will be using the file
<code>gora-tutorial/conf/gora-hbase-mappings.xml</code>.</p>
<div class="codehilite"><pre><span class="nt"><gora-otd></span>
@@ -517,8 +529,8 @@ as the column qualifier. Note that map a
families, so the configuration should be list unique column families for each
map and
array type, and no qualifier should be given. The exact data model is
discussed further
at the <a href="/current/gora-hbase.html">gora-hbase</a> documentation. </p>
-<h2 id="basic-api-wzxhzdk79">Basic API </title></h2>
-<h3 id="parsing-the-logs">Parsing the logs</h3>
+<h2 id="basic-api">Basic API </title><a class="headerlink" href="#basic-api"
title="Permanent link">¶</a></h2>
+<h3 id="parsing-the-logs">Parsing the logs<a class="headerlink"
href="#parsing-the-logs" title="Permanent link">¶</a></h3>
<p>Now that we have the basic setup, we can see Gora API in action. As you can
notice below the API
is pretty simple to use. We will be using the class LogManager (which is
located at
<code>gora-tutorial/src/main/java/org/apache/gora/tutorial/log/LogManager.java</code>)
for parsing
@@ -656,7 +668,7 @@ defined earlier.</p>
<p>parseLine() uses standard StringTokenizers for the job
and constructs and returns a Pageview object.</p>
-<h3 id="storing-objects-in-the-datastore">Storing objects in the DataStore</h3>
+<h3 id="storing-objects-in-the-datastore">Storing objects in the DataStore<a
class="headerlink" href="#storing-objects-in-the-datastore" title="Permanent
link">¶</a></h3>
<p>If we look back at the parse() method above, we can see that the
Pageview objects returned by parseLine() are stored via
storePageview() method. </p>
@@ -671,7 +683,7 @@ we can see that it is dead simple.</p>
<p>All we need to do is to call the put() method, which expects a long as key
and an instance of Pageview
as a value.</p>
-<h3 id="closing-the-datastore">Closing the DataStore</h3>
+<h3 id="closing-the-datastore">Closing the DataStore<a class="headerlink"
href="#closing-the-datastore" title="Permanent link">¶</a></h3>
<p>DataStore implementations can do a lot of caching for performance.
However, this means that data is not always flushed to persistent storage all
the times.
So we need to make sure that upon finishing storing objects, we need to close
the datastore
@@ -693,7 +705,7 @@ semantics can vary by the data store bac
on the jdbc Connection object, whereas in Hb=Base, <code>HTable#flush()</code>
is called.
Also note that even if you call flush() at the end of all data manipulation
operations,
you still need to call the close() on the datastore.</p>
-<h2 id="persisted-data-in-hbase">Persisted data in HBase</h2>
+<h2 id="persisted-data-in-hbase">Persisted data in HBase<a class="headerlink"
href="#persisted-data-in-hbase" title="Permanent link">¶</a></h2>
<p>Now that we have stored the web access log data in HBase, we can look at
how the data is stored at HBase. For that, start the HBase shell.</p>
<div class="codehilite"><pre>$ cd ../hbase-<span class="cp">${</span><span
class="n">version</span><span class="cp">}</span>
@@ -744,7 +756,7 @@ have been stored.</p>
</pre></div>
-<h2 id="fetching-objects-from-data-store">Fetching objects from data store</h2>
+<h2 id="fetching-objects-from-data-store">Fetching objects from data store<a
class="headerlink" href="#fetching-objects-from-data-store" title="Permanent
link">¶</a></h2>
<p>Fetching objects from the data store is as easy as storing them. There are
essentially
two methods for fetching objects. First one is to fetch a single object given
it's key. The
second method is to run a query through the data store.</p>
@@ -779,7 +791,7 @@ from the data store and prints the resul
</pre></div>
-<h2 id="querying-objects">Querying objects</h2>
+<h2 id="querying-objects">Querying objects<a class="headerlink"
href="#querying-objects" title="Permanent link">¶</a></h2>
<p>DataStore API defines a Query interface to query the objects at the data
store.
Each data store implementation can use a specific implementation of the Query
interface. Queries are
instantiated by calling <code>DataStore#newQuery()</code>. When the query is
run through the datastore, the results
@@ -854,7 +866,7 @@ we can use:</p>
</pre></div>
-<h2 id="deleting-objects">Deleting objects</h2>
+<h2 id="deleting-objects">Deleting objects<a class="headerlink"
href="#deleting-objects" title="Permanent link">¶</a></h2>
<p>Just like fetching objects, there are two main methods to delete
objects from the data store. The first one is to delete objects one by
one using the <code>DataStore#delete(K key)</code> method, which takes the key
of the object.
@@ -888,12 +900,12 @@ Continueing from the LogManager class, t
</pre></div>
-<h2 id="mapreduce-support">MapReduce Support</h2>
+<h2 id="mapreduce-support">MapReduce Support<a class="headerlink"
href="#mapreduce-support" title="Permanent link">¶</a></h2>
<p>Gora has first class MapReduce support for <a
href="http://hadoop.apache.org">Apache Hadoop</a>.
Gora data stores can be used as inputs and outputs of jobs. Moreover, the
objects can
be serialized, and passed between tasks keeping their persistency state. For
the
serialization, Gora extends Avro DatumWriters.</p>
-<h3 id="log-analytics-in-mapreduce">Log analytics in MapReduce</h3>
+<h3 id="log-analytics-in-mapreduce">Log analytics in MapReduce<a
class="headerlink" href="#log-analytics-in-mapreduce" title="Permanent
link">¶</a></h3>
<p>For this part of the tutorial, we will be analyzing the logs that have been
stored at HBase earlier. Specifically, we will develop a MapReduce program to
calculate the number of daily pageviews for each URL in the site.</p>
@@ -904,12 +916,12 @@ For computing the analytics, the mapper
in which the pageview occurred, so that the daily pageviews are accumulated.
The reducer just sums up the values, and outputs MetricDatum objects
to be sent to the output Gora data store.</p>
-<h3 id="setting-up-the-environment">Setting up the environment</h3>
+<h3 id="setting-up-the-environment">Setting up the environment<a
class="headerlink" href="#setting-up-the-environment" title="Permanent
link">¶</a></h3>
<p>We will be using the logs stored at HBase by the LogManager class.
We will push the output of the job to an HSQL database, since it has a zero
conf
set up. However, you can also use MySQL or HBase for storing the analytics
results.
If you want to continue with HBase, you can skip the next sections. </p>
-<h3 id="setting-up-the-database">Setting up the database</h3>
+<h3 id="setting-up-the-database">Setting up the database<a class="headerlink"
href="#setting-up-the-database" title="Permanent link">¶</a></h3>
<p>First we need to download HSQL dependencies. For that, ensure that the
hsqldb
dependency is available in the Maven pom.xml.
Ofcourse MySQL users should uncomment the mysql dependency instead. </p>
@@ -924,16 +936,16 @@ Ofcourse MySQL users should uncomment th
<p>If you are using Mysql, you should also setup the database server, create
the database
and give necessary permissions to create tables, etc so that Gora can run
properly.</p>
-<h3 id="configuring-gora_1">Configuring Gora</h3>
+<h3 id="configuring-gora_1">Configuring Gora<a class="headerlink"
href="#configuring-gora_1" title="Permanent link">¶</a></h3>
<p>We will put the configuration necessary to connect to the database to
<code>gora-tutorial/conf/gora.properties</code>.</p>
-<h4 id="jdbc-properties-for-gora-sql-module-using-hsql">JDBC properties for
gora-sql module using HSQL</h4>
+<h4 id="jdbc-properties-for-gora-sql-module-using-hsql">JDBC properties for
gora-sql module using HSQL<a class="headerlink"
href="#jdbc-properties-for-gora-sql-module-using-hsql" title="Permanent
link">¶</a></h4>
<div class="codehilite"><pre><span class="n">gora</span><span
class="p">.</span><span class="n">sqlstore</span><span class="p">.</span><span
class="n">jdbc</span><span class="p">.</span><span class="n">driver</span><span
class="p">=</span><span class="n">org</span><span class="p">.</span><span
class="n">hsqldb</span><span class="p">.</span><span class="n">jdbcDriver</span>
<span class="n">gora</span><span class="p">.</span><span
class="n">sqlstore</span><span class="p">.</span><span
class="n">jdbc</span><span class="p">.</span><span class="n">url</span><span
class="p">=</span><span class="n">jdbc</span><span class="p">:</span><span
class="n">hsqldb</span><span class="p">:</span><span class="n">hsql</span><span
class="p">:</span><span class="o">//</span><span
class="n">localhost</span><span class="o">/</span><span
class="n">goratest</span>
</pre></div>
-<h4 id="jdbc-properties-for-gora-sql-module-using-mysql">JDBC properties for
gora-sql module using MySQL</h4>
+<h4 id="jdbc-properties-for-gora-sql-module-using-mysql">JDBC properties for
gora-sql module using MySQL<a class="headerlink"
href="#jdbc-properties-for-gora-sql-module-using-mysql" title="Permanent
link">¶</a></h4>
<div class="codehilite"><pre><span class="n">gora</span><span
class="p">.</span><span class="n">sqlstore</span><span class="p">.</span><span
class="n">jdbc</span><span class="p">.</span><span class="n">driver</span><span
class="p">=</span><span class="n">com</span><span class="p">.</span><span
class="n">mysql</span><span class="p">.</span><span class="n">jdbc</span><span
class="p">.</span><span class="n">Driver</span>
<span class="n">gora</span><span class="p">.</span><span
class="n">sqlstore</span><span class="p">.</span><span
class="n">jdbc</span><span class="p">.</span><span class="n">url</span><span
class="p">=</span><span class="n">jdbc</span><span class="p">:</span><span
class="n">mysql</span><span class="p">:</span><span class="o">//</span><span
class="n">localhost</span><span class="p">:</span>3306<span
class="o">/</span><span class="n">goratest</span>
<span class="n">gora</span><span class="p">.</span><span
class="n">sqlstore</span><span class="p">.</span><span
class="n">jdbc</span><span class="p">.</span><span class="n">user</span><span
class="p">=</span><span class="n">root</span>
@@ -945,7 +957,7 @@ and give necessary permissions to create
and jdbc.url is the JDBC connection URL. Moreover jdbc.user
and jdbc.password can be specific is needed. More information for these
parameters can be found at <a href="/current/gora-sql.html">gora-sql</a>
documentation. </p>
-<h3 id="modelling-the-data-data-beans-for-analytics">Modelling the data - Data
Beans for Analytics</h3>
+<h3 id="modelling-the-data-data-beans-for-analytics">Modelling the data - Data
Beans for Analytics<a class="headerlink"
href="#modelling-the-data-data-beans-for-analytics" title="Permanent
link">¶</a></h3>
<p>For web site analytics, we will be using a generic MetricDatum
data structure. It holds a string metricDimension, a long
timestamp, and a long metric fields. The first two fields
@@ -969,7 +981,7 @@ code at <code>gora-tutorial/src/main/jav
</pre></div>
-<h3 id="data-store-mappings">Data store mappings</h3>
+<h3 id="data-store-mappings">Data store mappings<a class="headerlink"
href="#data-store-mappings" title="Permanent link">¶</a></h3>
<p>We will be using the SQL backend to store the job output data, just to
demonstrate the SQL backend. </p>
<p>Similar to what we have seen with HBase, gora-sql plugin reads
configuration from the
@@ -998,7 +1010,7 @@ name attribute contains the name of the
column attribute is the name of the
column in the database. The primaryKey holds the actual key as the primary key
field. Currently,
Gora only supports tables with one primary key.</p>
-<h2 id="constructing-the-job">Constructing the job</h2>
+<h2 id="constructing-the-job">Constructing the job<a class="headerlink"
href="#constructing-the-job" title="Permanent link">¶</a></h2>
<p>In constructing the job object for Hadoop, we need to define whether we
will use
Gora as job input, output or both. Gora defines
its own GoraInputFormat, and GoraOutputFormat, which
@@ -1041,7 +1053,7 @@ rather than data store instances.</p>
</pre></div>
-<h3 id="gora-mappers-and-using-gora-an-input">Gora mappers and using Gora an
input</h3>
+<h3 id="gora-mappers-and-using-gora-an-input">Gora mappers and using Gora an
input<a class="headerlink" href="#gora-mappers-and-using-gora-an-input"
title="Permanent link">¶</a></h3>
<p>Typically, if Gora is used as job input, the Mapper class extends<br />
GoraMapper. However, currently this is not forced by the API so other class
hierarchies can be used instead.
The mapper receives the key value pairs that are the results of the input
query, and emits
@@ -1071,7 +1083,7 @@ and outputs the key as a tuple of <UR
</pre></div>
-<h3 id="gora-reducers-and-using-gora-as-output">Gora reducers and using Gora
as output</h3>
+<h3 id="gora-reducers-and-using-gora-as-output">Gora reducers and using Gora
as output<a class="headerlink" href="#gora-reducers-and-using-gora-as-output"
title="Permanent link">¶</a></h3>
<p>Similar to the input, typically, if Gora is used as job output, the Reducer
extends
GoraReducer. The values emitted by the reducer are persisted to the output
data store
as a result of the job. </p>
@@ -1103,14 +1115,14 @@ will be stored at the output data store.
</pre></div>
-<h3 id="running-the-job">Running the job</h3>
+<h3 id="running-the-job">Running the job<a class="headerlink"
href="#running-the-job" title="Permanent link">¶</a></h3>
<p>Now that the job is constructed, we can run the Hadoop job as usual. Note
that the run function
of the LogAnalytics class parses the arguments and runs the job. We can run
the program by </p>
<div class="codehilite"><pre>$ <span class="n">bin</span><span
class="o">/</span><span class="n">gora</span> <span
class="n">loganalytics</span> <span class="p">[</span><span
class="o"><</span><span class="n">input</span> <span class="n">data</span>
<span class="n">store</span><span class="o">></span> <span
class="p">[</span><span class="o"><</span><span class="n">output</span>
<span class="n">data</span> <span class="n">store</span><span
class="o">></span><span class="p">]]</span>
</pre></div>
-<h3 id="running-the-job-with-sql">Running the job with SQL</h3>
+<h3 id="running-the-job-with-sql">Running the job with SQL<a
class="headerlink" href="#running-the-job-with-sql" title="Permanent
link">¶</a></h3>
<p>Now, let's run the log analytics tools with the SQL backend(either Hsql or
MySql). The input data store will be </p>
<div class="codehilite"><pre><span class="n">org</span><span
class="p">.</span><span class="n">apache</span><span class="p">.</span><span
class="n">gora</span><span class="p">.</span><span class="n">hbase</span><span
class="p">.</span><span class="n">store</span><span class="p">.</span><span
class="n">HBaseStore</span>
</pre></div>
@@ -1155,7 +1167,7 @@ To see the most popular pages, run:</p>
<p>As you can see, the home page (/) for various days and some other pages are
listed.
In total 3033 rows are present at the metrics table. </p>
-<h3 id="running-the-job-with-hbase">Running the job with HBase</h3>
+<h3 id="running-the-job-with-hbase">Running the job with HBase<a
class="headerlink" href="#running-the-job-with-hbase" title="Permanent
link">¶</a></h3>
<p>Since HBaseStore is already defined as the default data store at
<code>gora.properties</code>
we can run the job with HBase as:</p>
<div class="codehilite"><pre>$ <span class="n">bin</span><span
class="o">/</span><span class="n">gora</span> <span
class="n">loganalytics</span>
@@ -1177,7 +1189,7 @@ we can run the job with HBase as:</p>
</pre></div>
-<h2 id="more-examples">More Examples</h2>
+<h2 id="more-examples">More Examples<a class="headerlink"
href="#more-examples" title="Permanent link">¶</a></h2>
<p>Other than this tutorial, there are several places that you can find
examples of Gora in action.</p>
<p>The first place to look at is the examples directories
@@ -1190,7 +1202,7 @@ at <code>gora-core/src/test/</code>. </p
one of the first class users of Gora; so looking into how Nutch uses Gora is
always a good idea. Gora is however also in use
in other Apache projects such as <a href="http://giraph.apache.org">Apache
Giraph</a></p>
<p>Please feel free to grab our <a
href="http://gora.apache.org/resources/img/powered-by-gora.png">poweredBy</a>
sticker and embedded it in anything backed by Apache Gora.</p>
-<h2 id="feedback">Feedback</h2>
+<h2 id="feedback">Feedback<a class="headerlink" href="#feedback"
title="Permanent link">¶</a></h2>
<p>At last, thanks for trying out Gora. If you find any bugs or you have
suggestions for improvement,
do not hesitate to give feedback on the [email protected] <a
href="../mailing_lists.html">mailing list</a>.</p>