Author: buildbot
Date: Wed Nov 18 11:49:54 2015
New Revision: 972840
Log:
Staging update by buildbot for crunch
Modified:
websites/staging/crunch/trunk/content/ (props changed)
websites/staging/crunch/trunk/content/about.html
websites/staging/crunch/trunk/content/bylaws.html
websites/staging/crunch/trunk/content/download.html
websites/staging/crunch/trunk/content/future-work.html
websites/staging/crunch/trunk/content/getting-started.html
websites/staging/crunch/trunk/content/index.html
websites/staging/crunch/trunk/content/mailing-lists.html
websites/staging/crunch/trunk/content/pipelines.html
websites/staging/crunch/trunk/content/scrunch.html
websites/staging/crunch/trunk/content/source-repository.html
websites/staging/crunch/trunk/content/user-guide.html
Propchange: websites/staging/crunch/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Nov 18 11:49:54 2015
@@ -1 +1 @@
-1680960
+1714973
Modified: websites/staging/crunch/trunk/content/about.html
==============================================================================
--- websites/staging/crunch/trunk/content/about.html (original)
+++ websites/staging/crunch/trunk/content/about.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
- <li><a href="/apidocs/0.12.0/">API (supporting HBase
0.96.x)</a></li>
+ <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,7 +145,18 @@
</h1>
- <p>The initial source code of the Apache Crunch project has been
written mostly
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>The initial source code of the Apache Crunch project has been written mostly
by Josh Wills at <a href="http://www.cloudera.com/">Cloudera</a> in 2011,
based on
Google's FlumeJava library. The project was open sourced at GitHub soon
afterwards where serveral releases up to and including 0.2.4 were made.</p>
@@ -154,7 +165,7 @@ entered the <a href="http://incubator.ap
the Incubator and three releases (0.3.0-incubating to 0.5.0-incubating), the
Apache Board of Directors established the Apache Crunch project in February
2013 as a new top level project.</p>
-<h2 id="team">Team</h2>
+<h2 id="team">Team<a class="headerlink" href="#team" title="Permanent
link">¶</a></h2>
<!--
Markdown-generated tables don't have the proper CSS classes,
so we use plain HTML tables.
Modified: websites/staging/crunch/trunk/content/bylaws.html
==============================================================================
--- websites/staging/crunch/trunk/content/bylaws.html (original)
+++ websites/staging/crunch/trunk/content/bylaws.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
- <li><a href="/apidocs/0.12.0/">API (supporting HBase
0.96.x)</a></li>
+ <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,7 +145,18 @@
</h1>
- <p>This document defines the bylaws under which the Apache Crunch
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>This document defines the bylaws under which the Apache Crunch
project operates. It defines the roles and responsibilities of the
project, who may vote, how voting works, how conflicts are resolved, etc. </p>
<p>Crunch is a project of the
@@ -159,11 +170,11 @@ of principles, known collectively as the
Apache development, please refer to the
<a href="http://incubator.apache.org/">Incubator project</a> for more
information
on how Apache projects operate. </p>
-<h2 id="roles-and-responsibilities">Roles and Responsibilities</h2>
+<h2 id="roles-and-responsibilities">Roles and Responsibilities<a
class="headerlink" href="#roles-and-responsibilities" title="Permanent
link">¶</a></h2>
<p>Apache projects define a set of roles with associated rights and
responsibilities. These roles govern what tasks an individual may
perform within the project. The roles are defined in the following sections.
</p>
-<h3 id="users">Users</h3>
+<h3 id="users">Users<a class="headerlink" href="#users" title="Permanent
link">¶</a></h3>
<p>The most important participants in the project are people who use our
software. The majority of our contributors start out as users and guide
their development efforts from the user's perspective. </p>
@@ -171,13 +182,13 @@ their development efforts from the user'
contributors in the form of bug reports and feature suggestions. As
well, users participate in the Apache community by helping other users
on mailing lists and user support forums. </p>
-<h3 id="contributors">Contributors</h3>
+<h3 id="contributors">Contributors<a class="headerlink" href="#contributors"
title="Permanent link">¶</a></h3>
<p>All of the volunteers who are contributing time, code, documentation, or
resources to the Crunch project. A contributor that makes sustained,
welcome contributions to the project may be invited to become a
committer, though the exact timing of such invitations depends on many
factors. </p>
-<h3 id="committers">Committers</h3>
+<h3 id="committers">Committers<a class="headerlink" href="#committers"
title="Permanent link">¶</a></h3>
<p>The project's committers are responsible for the project's technical
management. They have access to all of the project's code repositories
and may cast binding votes on any technical discussion regarding the
@@ -200,7 +211,7 @@ more details on the requirements for com
invited to become a member of the PMC. The form of contribution is not
limited to code. It can also include code review, helping out users on
the mailing lists, documentation, etc. </p>
-<h3 id="project-management-committee">Project Management Committee</h3>
+<h3 id="project-management-committee">Project Management Committee<a
class="headerlink" href="#project-management-committee" title="Permanent
link">¶</a></h3>
<p>The Project Management Committee (PMC) is responsible to the board and
the ASF for the management and oversight of the Apache Crunch codebase.
The responsibilities of the PMC include: </p>
@@ -232,13 +243,13 @@ Crunch project. </p>
the chair resigns before the end of his or her term, the PMC votes to
recommend a new chair using lazy consensus, but the decision must be ratified
by the Apache board. </p>
-<h2 id="decision-making">Decision Making</h2>
+<h2 id="decision-making">Decision Making<a class="headerlink"
href="#decision-making" title="Permanent link">¶</a></h2>
<p>Within the Apache Crunch project, different types of decisions require
different forms of approval. For example, the previous section describes
several decisions which require "lazy consensus" approval. This section
defines how voting is performed, the types of approvals, and which types
of decision require which type of approval. </p>
-<h3 id="voting">Voting</h3>
+<h3 id="voting">Voting<a class="headerlink" href="#voting" title="Permanent
link">¶</a></h3>
<p>Decisions regarding the project are made by votes on the primary project
development mailing list [email protected]. Where necessary, PMC
voting may take place on the private Crunch PMC mailing list
@@ -294,7 +305,7 @@ codebase. These typically take the form
commit message sent when the commit is made. Note that this should be a
rare occurrence. All efforts should be made to discuss issues when they
are still patches before the code is committed. </p>
-<h3 id="approvals">Approvals</h3>
+<h3 id="approvals">Approvals<a class="headerlink" href="#approvals"
title="Permanent link">¶</a></h3>
<p>These are the types of approvals that can be sought. Different actions
require different types of approvals. </p>
<table class="table">
@@ -339,7 +350,7 @@ require different types of approvals. </
</tbody>
</table>
-<h3 id="vetoes">Vetoes</h3>
+<h3 id="vetoes">Vetoes<a class="headerlink" href="#vetoes" title="Permanent
link">¶</a></h3>
<p>A valid, binding veto cannot be overruled. If a veto is cast, it must
be accompanied by a valid reason explaining the reasons for the
veto. The validity of a veto, if challenged, can be confirmed by
@@ -348,7 +359,7 @@ agreement with the veto - merely that th
<p>If you disagree with a valid veto, you must lobby the person casting the
veto to withdraw his or her veto. If a veto is not withdrawn, the action
that has been vetoed must be reversed in a timely manner. </p>
-<h3 id="actions">Actions</h3>
+<h3 id="actions">Actions<a class="headerlink" href="#actions" title="Permanent
link">¶</a></h3>
<p>This section describes the various actions which are undertaken within
the project, the corresponding approval required for that action and
those who have binding votes over the action. It also specifies the
Modified: websites/staging/crunch/trunk/content/download.html
==============================================================================
--- websites/staging/crunch/trunk/content/download.html (original)
+++ websites/staging/crunch/trunk/content/download.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
- <li><a href="/apidocs/0.12.0/">API (supporting HBase
0.96.x)</a></li>
+ <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,7 +145,18 @@
</h1>
- <p>The Apache Crunch libraries are distributed under the <a
href="http://www.apache.org/licenses/LICENSE-2.0.html">Apache License
2.0</a>.</p>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>The Apache Crunch libraries are distributed under the <a
href="http://www.apache.org/licenses/LICENSE-2.0.html">Apache License
2.0</a>.</p>
<p>The link in the Download column takes you to a list of mirrors based on
your location. Checksum and signature are located on Apache's main
distribution site.</p>
Modified: websites/staging/crunch/trunk/content/future-work.html
==============================================================================
--- websites/staging/crunch/trunk/content/future-work.html (original)
+++ websites/staging/crunch/trunk/content/future-work.html Wed Nov 18 11:49:54
2015
@@ -80,7 +80,7 @@
- <li><a href="/apidocs/0.12.0/">API (supporting HBase
0.96.x)</a></li>
+ <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,7 +145,18 @@
</h1>
- <p>This section contains an almost certainly incomplete list of
known limitations and plans for future work.</p>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>This section contains an almost certainly incomplete list of known
limitations and plans for future work.</p>
<ul>
<li>We would like to have easy support for reading and writing data from/to
the Hive metastore via the HCatalog
APIs.</li>
Modified: websites/staging/crunch/trunk/content/getting-started.html
==============================================================================
--- websites/staging/crunch/trunk/content/getting-started.html (original)
+++ websites/staging/crunch/trunk/content/getting-started.html Wed Nov 18
11:49:54 2015
@@ -80,7 +80,7 @@
- <li><a href="/apidocs/0.12.0/">API (supporting HBase
0.96.x)</a></li>
+ <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,11 +145,22 @@
</h1>
- <p><em>Getting Started</em> will guide you through the process of
creating a simple Crunch pipeline to count
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><em>Getting Started</em> will guide you through the process of creating a
simple Crunch pipeline to count
the words in a text document, which is the Hello World of distributed
computing. Along the way,
we'll explain the core Crunch concepts and how to use them to create effective
and efficient data
pipelines.</p>
-<h1 id="overview">Overview</h1>
+<h1 id="overview">Overview<a class="headerlink" href="#overview"
title="Permanent link">¶</a></h1>
<p>The Apache Crunch project develops and supports Java APIs that simplify the
process of creating data pipelines on top of Apache Hadoop. The
Crunch APIs are modeled after <a
href="http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf">FlumeJava
(PDF)</a>, which is the library that
Google uses for building data pipelines on top of their own implementation of
MapReduce.</p>
@@ -172,7 +183,7 @@ they represent their data, which makes C
<a
href="http://thunderheadxpler.blogspot.com/2013/05/creating-spatial-crunch-pipelines.html">geospatial</a>
and
<a
href="http://blog.cloudera.com/blog/2012/01/seismic-data-science-hadoop-use-case/">time
series</a> data, and data stored in <a href="http://hbase.apache.org">Apache
HBase</a> tables.</li>
</ol>
-<h1 id="which-version-of-crunch-do-i-need">Which Version of Crunch Do I
Need?</h1>
+<h1 id="which-version-of-crunch-do-i-need">Which Version of Crunch Do I
Need?<a class="headerlink" href="#which-version-of-crunch-do-i-need"
title="Permanent link">¶</a></h1>
<p>The core libraries are primarily developed against Hadoop 1.1.2, and are
also tested against Hadoop 2.2.0.
They should work with any version of Hadoop 1.x after 1.0.3 and any version of
Hadoop 2.x after 2.0.0-alpha,
although you should note that some of Hadoop 2.x's dependencies changed
between 2.0.4-alpha and 2.2.0 (for example,
@@ -200,7 +211,7 @@ prior versions of crunch-hbase were deve
</tr>
</table>
-<h2 id="maven-dependencies">Maven Dependencies</h2>
+<h2 id="maven-dependencies">Maven Dependencies<a class="headerlink"
href="#maven-dependencies" title="Permanent link">¶</a></h2>
<p>The Crunch project provides Maven artifacts on Maven Central of the
form:</p>
<pre>
<dependency>
@@ -221,7 +232,7 @@ pipelines. Depending on your use case, y
<li><code>crunch-examples</code>: Example MapReduce and HBase pipelines</li>
<li><code>crunch-archetype</code>: A Maven archetype for creating new Crunch
pipeline projects</li>
</ul>
-<h2 id="building-from-source">Building From Source</h2>
+<h2 id="building-from-source">Building From Source<a class="headerlink"
href="#building-from-source" title="Permanent link">¶</a></h2>
<p>You can download the most recently released Crunch libraries from the <a
href="download.html">Download</a> page or from the Maven
Central Repository.</p>
<p>If you prefer, you can also build the Crunch libraries from the source code
using Maven and install
@@ -241,7 +252,7 @@ it in your local repository:</p>
AverageBytesByIP and TotalBytesByIP take as input a file in the Common Log
Format (an example is provided in
<code>crunch-examples/src/main/resources/access_logs.tar.gz</code>.) The
WordAggregationHBase requires an Apache HBase cluster to be
available, but creates tables and loads sample data as part of its run.</p>
-<h1 id="your-first-crunch-pipeline">Your First Crunch Pipeline</h1>
+<h1 id="your-first-crunch-pipeline">Your First Crunch Pipeline<a
class="headerlink" href="#your-first-crunch-pipeline" title="Permanent
link">¶</a></h1>
<p>There are a couple of ways to get started with Crunch. If you use Git, you
can
clone this project which contains an <a
href="http://github.com/jwills/crunch-demo">example Crunch pipeline</a>:</p>
<pre>
@@ -318,7 +329,7 @@ files, while <code><out></code> is
Java applications or from unit tests. All required dependencies are on Maven's
classpath so you can run the <code>WordCount</code> class directly without any
additional
setup.</p>
-<h2 id="walking-through-the-wordcount-example">Walking Through The WordCount
Example</h2>
+<h2 id="walking-through-the-wordcount-example">Walking Through The WordCount
Example<a class="headerlink" href="#walking-through-the-wordcount-example"
title="Permanent link">¶</a></h2>
<p>Let's walk through the <code>run</code> method of the
<code>WordCount</code> example line by line and explain the
data processing concepts we encounter.</p>
<p>Our WordCount application starts out with a <code>main</code> method that
should be familiar to most
Modified: websites/staging/crunch/trunk/content/index.html
==============================================================================
--- websites/staging/crunch/trunk/content/index.html (original)
+++ websites/staging/crunch/trunk/content/index.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
- <li><a href="/apidocs/0.12.0/">API (supporting HBase
0.96.x)</a></li>
+ <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -147,7 +147,18 @@
</h1>
- <hr />
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<hr />
<blockquote>
<p>The <em>Apache Crunch</em> Java library provides a framework for writing,
testing,
and running MapReduce pipelines. Its goal is to make pipelines that are
Modified: websites/staging/crunch/trunk/content/mailing-lists.html
==============================================================================
--- websites/staging/crunch/trunk/content/mailing-lists.html (original)
+++ websites/staging/crunch/trunk/content/mailing-lists.html Wed Nov 18
11:49:54 2015
@@ -80,7 +80,7 @@
- <li><a href="/apidocs/0.12.0/">API (supporting HBase
0.96.x)</a></li>
+ <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,7 +145,18 @@
</h1>
- <!--
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<!--
Markdown-generated tables don't have the proper CSS classes,
so we use plain HTML tables.
-->
Modified: websites/staging/crunch/trunk/content/pipelines.html
==============================================================================
--- websites/staging/crunch/trunk/content/pipelines.html (original)
+++ websites/staging/crunch/trunk/content/pipelines.html Wed Nov 18 11:49:54
2015
@@ -80,7 +80,7 @@
- <li><a href="/apidocs/0.12.0/">API (supporting HBase
0.96.x)</a></li>
+ <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,11 +145,22 @@
</h1>
- <p>This section discusses the different steps of creating your own
Crunch pipelines in more detail.</p>
-<h2 id="writing-a-dofn">Writing a DoFn</h2>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>This section discusses the different steps of creating your own Crunch
pipelines in more detail.</p>
+<h2 id="writing-a-dofn">Writing a DoFn<a class="headerlink"
href="#writing-a-dofn" title="Permanent link">¶</a></h2>
<p>The DoFn class is designed to keep the complexity of the MapReduce APIs out
of your way when you
don't need them while still keeping them accessible when you do.</p>
-<h3 id="serialization">Serialization</h3>
+<h3 id="serialization">Serialization<a class="headerlink"
href="#serialization" title="Permanent link">¶</a></h3>
<p>First, all DoFn instances are required to be
<code>java.io.Serializable</code>. This is a key aspect of the library's design:
once a particular DoFn is assigned to the Map or Reduce stage of a MapReduce
job, all of the state
of that DoFn is serialized so that it may be distributed to all of the nodes
in the Hadoop cluster that
@@ -163,15 +174,15 @@ will be running that task. There are two
such as creating a non-serializable member variable, can be performed before
processing begins. Similarly, all
DoFn instances have a <code>cleanup</code> method that may be called after
processing has finished to perform any required
cleanup tasks.</p>
-<h3 id="scale-factor">Scale Factor</h3>
+<h3 id="scale-factor">Scale Factor<a class="headerlink" href="#scale-factor"
title="Permanent link">¶</a></h3>
<p>The DoFn class defines a <code>scaleFactor</code> method that can be used
to signal to the MapReduce compiler that a particular
DoFn implementation will yield an output PCollection that is larger
(scaleFactor > 1) or smaller (0 < scaleFactor < 1)
than the input PCollection it is applied to. The compiler may use this
information to determine how to optimally
split processing tasks between the Map and Reduce phases of dependent
MapReduce jobs.</p>
-<h3 id="other-utilities">Other Utilities</h3>
+<h3 id="other-utilities">Other Utilities<a class="headerlink"
href="#other-utilities" title="Permanent link">¶</a></h3>
<p>The DoFn base class provides convenience methods for accessing the
<code>Configuration</code> and <code>Counter</code> objects that
are associated with a MapReduce stage, so that they may be accessed during
initialization, processing, and cleanup.</p>
-<h3 id="performing-cogroups-and-joins">Performing Cogroups and Joins</h3>
+<h3 id="performing-cogroups-and-joins">Performing Cogroups and Joins<a
class="headerlink" href="#performing-cogroups-and-joins" title="Permanent
link">¶</a></h3>
<p>Cogroups and joins are performed on PTable instances that have the same key
type. This section walks through
the basic flow of a cogroup operation, explaining how this higher-level
operation is composed of the four primitive operations.
In general, these common operations are provided as part of the core library
or in extensions, you do not need
Modified: websites/staging/crunch/trunk/content/scrunch.html
==============================================================================
--- websites/staging/crunch/trunk/content/scrunch.html (original)
+++ websites/staging/crunch/trunk/content/scrunch.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
- <li><a href="/apidocs/0.12.0/">API (supporting HBase
0.96.x)</a></li>
+ <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -147,11 +147,22 @@
</h1>
- <h2 id="introduction">Introduction</h2>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h2 id="introduction">Introduction<a class="headerlink" href="#introduction"
title="Permanent link">¶</a></h2>
<p>Scrunch is an experimental Scala wrapper for the Apache Crunch Java API,
based on the same ideas as the
<a href="http://days2011.scala-lang.org/node/138/282">Cascade</a> project at
Google, which created a Scala wrapper for
FlumeJava.</p>
-<h2 id="why-scala">Why Scala?</h2>
+<h2 id="why-scala">Why Scala?<a class="headerlink" href="#why-scala"
title="Permanent link">¶</a></h2>
<p>In many ways, Scala is the perfect language for writing MapReduce
pipelines. Scala supports
a mixture of functional and object-oriented programming styles and has
powerful type-inference
capabilities, allowing us to create complex pipelines using very few
keystrokes. Here is
@@ -189,7 +200,7 @@ the second:</p>
</pre></div>
-<h2 id="materializing-job-outputs">Materializing Job Outputs</h2>
+<h2 id="materializing-job-outputs">Materializing Job Outputs<a
class="headerlink" href="#materializing-job-outputs" title="Permanent
link">¶</a></h2>
<p>The Scrunch API also incorporates the Java library's
<code>materialize</code> functionality, which allows us to easily read
the output of a MapReduce pipeline into the client:</p>
<div class="codehilite"><pre><span class="n">class</span> <span
class="n">WordCountExample</span> <span class="p">{</span>
@@ -198,7 +209,7 @@ the output of a MapReduce pipeline into
</pre></div>
-<h2 id="notes-and-thanks">Notes and Thanks</h2>
+<h2 id="notes-and-thanks">Notes and Thanks<a class="headerlink"
href="#notes-and-thanks" title="Permanent link">¶</a></h2>
<p>Scrunch emerged out of conversations with <a
href="http://twitter.com/#!/squarecog">Dmitriy Ryaboy</a>,
<a href="http://twitter.com/#!/posco">Oscar Boykin</a>, and <a
href="http://twitter.com/#!/avibryant">Avi Bryant</a> from Twitter.
Many thanks to them for their feedback, guidance, and encouragement. We are
also grateful to
Modified: websites/staging/crunch/trunk/content/source-repository.html
==============================================================================
--- websites/staging/crunch/trunk/content/source-repository.html (original)
+++ websites/staging/crunch/trunk/content/source-repository.html Wed Nov 18
11:49:54 2015
@@ -80,7 +80,7 @@
- <li><a href="/apidocs/0.12.0/">API (supporting HBase
0.96.x)</a></li>
+ <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,7 +145,18 @@
</h1>
- <p>The Apache Crunch Project uses <a
href="http://git-scm.com/">Git</a> for version control. Run the
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>The Apache Crunch Project uses <a href="http://git-scm.com/">Git</a> for
version control. Run the
following command to clone the repository:</p>
<div class="codehilite"><pre><span class="n">git</span> <span
class="n">clone</span> <span class="n">https</span><span
class="p">:</span><span class="o">//</span><span class="n">git</span><span
class="o">-</span><span class="n">wip</span><span class="o">-</span><span
class="n">us</span><span class="p">.</span><span class="n">apache</span><span
class="p">.</span><span class="n">org</span><span class="o">/</span><span
class="n">repos</span><span class="o">/</span><span class="n">asf</span><span
class="o">/</span><span class="n">crunch</span><span class="p">.</span><span
class="n">git</span>
</pre></div>
Modified: websites/staging/crunch/trunk/content/user-guide.html
==============================================================================
--- websites/staging/crunch/trunk/content/user-guide.html (original)
+++ websites/staging/crunch/trunk/content/user-guide.html Wed Nov 18 11:49:54
2015
@@ -80,7 +80,7 @@
- <li><a href="/apidocs/0.12.0/">API (supporting HBase
0.96.x)</a></li>
+ <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
@@ -145,7 +145,18 @@
</h1>
- <ol>
+ <style type="text/css">
+/* The following code is added by mdx_elementid.py
+ It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+ visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink,
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink,
dt:hover > .elementid-permalink { visibility: visible }</style>
+<ol>
<li><a href="#intro">Introduction to Crunch</a><ol>
<li><a href="#motivation">Motivation</a></li>
<li><a href="#datamodel">Data Model and Operators</a></li>
@@ -212,9 +223,9 @@
<li><a href="#testing">Unit Testing Pipelines</a></li>
</ol>
<p><a name="intro"></a></p>
-<h2 id="introduction-to-crunch">Introduction to Crunch</h2>
+<h2 id="introduction-to-crunch">Introduction to Crunch<a class="headerlink"
href="#introduction-to-crunch" title="Permanent link">¶</a></h2>
<p><a name="motivation"></a></p>
-<h3 id="motivation">Motivation</h3>
+<h3 id="motivation">Motivation<a class="headerlink" href="#motivation"
title="Permanent link">¶</a></h3>
<p>Let's start with a basic question: why should you use <em>any</em>
high-level tool for writing data pipelines, as opposed to developing against
the MapReduce, Spark, or Tez APIs directly? Doesn't adding another layer of
abstraction just increase the number of moving pieces you
need to worry about, ala the <a
href="http://www.joelonsoftware.com/articles/LeakyAbstractions.html">Law of
Leaky Abstractions</a>?</p>
@@ -305,7 +316,7 @@ top of Apache Hadoop:</p>
<p>In the next section, we'll give a quick overview of Crunch's version of
these abstractions and how they relate to each other before going
into more detail about their usage in the rest of the guide.</p>
<p><a name="datamodel"></a></p>
-<h3 id="data-model-and-operators">Data Model and Operators</h3>
+<h3 id="data-model-and-operators">Data Model and Operators<a
class="headerlink" href="#data-model-and-operators" title="Permanent
link">¶</a></h3>
<p>Crunch's Java API is centered around three interfaces that represent
distributed datasets: <a
href="apidocs/0.10.0/org/apache/crunch/PCollection.html">PCollection<T></a>,
<a
href="http://crunch.apache.org/apidocs/0.10.0/org/apache/crunch/PTable.html">PTable<K,
V></a>, and <a
href="apidocs/0.10.0/org/apache/crunch/PGroupedTable.html">PGroupedTable<K,
V></a>.</p>
<p>A <code>PCollection<T></code> represents a distributed, immutable
collection of elements of type T. For example, we represent a text file as a
@@ -336,12 +347,12 @@ that are available for developers to use
<li><a
href="apidocs/0.10.0/org/apache/crunch/impl/spark/SparkPipeline.html">SparkPipeline</a>:
Executes the pipeline by converting it to a series of Spark pipelines.</li>
</ol>
<p><a name="dataproc"></a></p>
-<h2 id="data-processing-with-dofns">Data Processing with DoFns</h2>
+<h2 id="data-processing-with-dofns">Data Processing with DoFns<a
class="headerlink" href="#data-processing-with-dofns" title="Permanent
link">¶</a></h2>
<p>DoFns represent the logical computations of your Crunch pipelines. They are
designed to be easy to write, easy to test, and easy to deploy
within the context of a MapReduce job. Much of your work with the Crunch APIs
will be writing DoFns, and so having a good understanding of
how to use them effectively is critical to crafting elegant and efficient
pipelines.</p>
<p><a name="dovsmap"></a></p>
-<h3 id="dofns-vs-mapper-and-reducer-classes">DoFns vs. Mapper and Reducer
Classes</h3>
+<h3 id="dofns-vs-mapper-and-reducer-classes">DoFns vs. Mapper and Reducer
Classes<a class="headerlink" href="#dofns-vs-mapper-and-reducer-classes"
title="Permanent link">¶</a></h3>
<p>Let's see how DoFns compare to the Mapper and Reducer classes that you're
used to writing when working with Hadoop's MapReduce API. When
you're creating a MapReduce job, you start by declaring an instance of the
<code>Job</code> class and using its methods to declare the implementations
of the <code>Mapper</code> and <code>Reducer</code> classes that you want to
use:</p>
@@ -431,7 +442,7 @@ of <code>static</code> methods on the cl
regardless of whether the outer class is serializable or not. Using static
methods to define your business logic in terms of a series of
DoFns can also make your code easier to test by using in-memory PCollection
implementations in your unit tests.</p>
<p><a name="runproc"></a></p>
-<h3 id="runtime-processing-steps">Runtime Processing Steps</h3>
+<h3 id="runtime-processing-steps">Runtime Processing Steps<a
class="headerlink" href="#runtime-processing-steps" title="Permanent
link">¶</a></h3>
<p>After the Crunch runtime loads the serialized DoFns into its map and reduce
tasks, the DoFns are executed on the input data via the following
sequence:</p>
<ol>
@@ -450,7 +461,7 @@ be used to emit the sum of a list of num
other cleanup task that is appropriate once the task has finished
executing.</li>
</ol>
<p><a name="mrapis"></a></p>
-<h3 id="accessing-runtime-mapreduce-apis">Accessing Runtime MapReduce APIs</h3>
+<h3 id="accessing-runtime-mapreduce-apis">Accessing Runtime MapReduce APIs<a
class="headerlink" href="#accessing-runtime-mapreduce-apis" title="Permanent
link">¶</a></h3>
<p>DoFns provide direct access to the <code>TaskInputOutputContext</code>
object that is used within a given Map or Reduce task via the
<code>getContext</code>
method. There are also a number of helper methods for working with the objects
associated with the TaskInputOutputContext, including:</p>
<ul>
@@ -475,7 +486,7 @@ objects returned by Crunch at the end of
Counter classes directly in your Crunch pipelines (the two
<code>getCounter</code> methods that were defined in DoFn are both deprecated)
so that you will not be
required to recompile your job jars when you move from a Hadoop 1.0 cluster to
a Hadoop 2.0 cluster.)</p>
<p><a name="doplan"></a></p>
-<h3
id="configuring-the-crunch-planner-and-mapreduce-jobs-with-dofns">Configuring
the Crunch Planner and MapReduce Jobs with DoFns</h3>
+<h3
id="configuring-the-crunch-planner-and-mapreduce-jobs-with-dofns">Configuring
the Crunch Planner and MapReduce Jobs with DoFns<a class="headerlink"
href="#configuring-the-crunch-planner-and-mapreduce-jobs-with-dofns"
title="Permanent link">¶</a></h3>
<p>Although most of the DoFn methods are focused on runtime execution, there
are a handful of methods that are used during the planning phase
before a pipeline is converted into MapReduce jobs. The first of these
functions is <code>float scaleFactor()</code>, which should return a floating
point
value greater than 0.0f. You can override the scaleFactor method in your
custom DoFns in order to provide a hint to the Crunch planner about
@@ -488,7 +499,7 @@ on the client before processing begins b
will require extra memory settings to run, and so you could make sure that the
value of the <code>mapred.child.java.opts</code> argument had a large enough
memory setting for the DoFn's needs before the job was launched on the
cluster.</p>
<p><a name="mapfn"></a></p>
-<h3 id="common-dofn-patterns">Common DoFn Patterns</h3>
+<h3 id="common-dofn-patterns">Common DoFn Patterns<a class="headerlink"
href="#common-dofn-patterns" title="Permanent link">¶</a></h3>
<p>The Crunch APIs contain a number of useful subclasses of DoFn that handle
common data processing scenarios and are easier
to write and test. The top-level <a
href="apidocs/0.10.0/org/apache/crunch/package-summary.html">org.apache.crunch</a>
package contains three
of the most important specializations, which we will discuss now. Each of
these specialized DoFn implementations has associated methods
@@ -519,7 +530,7 @@ interface, which is defined right alongs
interface defined via static factory methods in the <a
href="apidocs/0.10.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a>
class. We will discuss
Aggregators more in the section on <a href="#aggregators">common MapReduce
patterns</a>.</p>
<p><a name="serde"></a></p>
-<h2 id="serializing-data-with-ptypes">Serializing Data with PTypes</h2>
+<h2 id="serializing-data-with-ptypes">Serializing Data with PTypes<a
class="headerlink" href="#serializing-data-with-ptypes" title="Permanent
link">¶</a></h2>
<p>Every <code>PCollection<T></code> has an associated
<code>PType<T></code> that encapsulates the information on how to
serialize and deserialize the contents of that
PCollection. PTypes are necessary because of <a
href="http://docs.oracle.com/javase/tutorial/java/generics/erasure.html">type
erasure</a>; at runtime, when
the Crunch planner is mapping from PCollections to a series of MapReduce jobs,
the type of a PCollection (that is, the <code>T</code> in
<code>PCollection<T></code>)
@@ -548,7 +559,7 @@ to mix-and-match PCollections that use d
read in Writable data, do a shuffle using Avro, and then write the output data
as Writables), but each PCollection's PType must belong to a single
type family; for example, you cannot have a PTable whose key is serialized as
a Writable and whose value is serialized as an Avro record.</p>
<p><a name="corept"></a></p>
-<h3 id="core-ptypes">Core PTypes</h3>
+<h3 id="core-ptypes">Core PTypes<a class="headerlink" href="#core-ptypes"
title="Permanent link">¶</a></h3>
<p>Both type families support a common set of primitive types (strings, longs,
ints, floats, doubles, booleans, and bytes) as well as more complex
PTypes that can be constructed out of other PTypes:</p>
<ol>
@@ -673,7 +684,7 @@ for POJOs using Avro's reflection-based
and easy to test, but the fact that the data is written out as Avro records
means that you can use tools like Hive and Pig
to query intermediate results to aid in debugging pipeline failures.</p>
<p><a name="extendpt"></a></p>
-<h3 id="extending-ptypes">Extending PTypes</h3>
+<h3 id="extending-ptypes">Extending PTypes<a class="headerlink"
href="#extending-ptypes" title="Permanent link">¶</a></h3>
<p>The simplest way to create a new <code>PType<T></code> for a data
object is to create a <em>derived</em> PType from one of the built-in PTypes
from the Avro
and Writable type families. If we have a base <code>PType<S></code>, we
can create a derived <code>PType<T></code> by implementing an input
<code>MapFn<S, T></code> and an
output <code>MapFn<T, S></code> and then calling
<code>PTypeFamily.derived(Class<T>, MapFn<S, T> in, MapFn<T,
S> out, PType<S> base)</code>, which will return
@@ -695,7 +706,7 @@ easy to work with the POJO directly in y
</pre>
<p><a name="rwdata"></a></p>
-<h2 id="reading-and-writing-data">Reading and Writing Data</h2>
+<h2 id="reading-and-writing-data">Reading and Writing Data<a
class="headerlink" href="#reading-and-writing-data" title="Permanent
link">¶</a></h2>
<p>In the introduction to this user guide, we noted that all of the major
tools for working with data pipelines on Hadoop include some sort of abstraction
for working with the <code>InputFormat<K, V></code> and
<code>OutputFormat<K, V></code> classes defined in the MapReduce APIs.
For example, Hive includes
SerDes, and Pig requires LoadFuncs and StoreFuncs. Let's take a moment to
explain what functionality these abstractions provide for
@@ -719,13 +730,13 @@ to wrap an InputFormat and its associate
job, even if those Sources have the same InputFormat. On the output side, the
<code>Target</code> interface can be used in the same way to wrap a
Hadoop <code>OutputFormat</code> and its associated key-value pairs in a way
that can be isolated from any other outputs of a pipeline stage.</p>
<p><a name="notethis"></a></p>
-<h3 id="a-note-on-sources-targets-and-hadoop-apis">A Note on Sources, Targets,
and Hadoop APIs</h3>
+<h3 id="a-note-on-sources-targets-and-hadoop-apis">A Note on Sources, Targets,
and Hadoop APIs<a class="headerlink"
href="#a-note-on-sources-targets-and-hadoop-apis" title="Permanent
link">¶</a></h3>
<p>Crunch, like Hive and Pig, is developed against the <a
href="http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/package-summary.html">org.apache.hadoop.mapreduce</a>
API, not the older <a
href="http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/package-summary.html">org.apache.hadoop.mapred</a>
API.
This means that Crunch Sources and Targets expect subclasses of the new <a
href="http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/InputFormat.html">InputFormat</a>
and <a
href="http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/OutputFormat.html">OutputFormat</a>
classes. These new
classes are not 1:1 compatible with the InputFormat and OutputFormat classes
associated with the <code>org.apache.hadoop.mapred</code> APIs, so please be
aware of this difference when considering using existing InputFormats and
OutputFormats with Crunch's Sources and Targets.</p>
<p><a name="sources"></a></p>
-<h3 id="sources">Sources</h3>
+<h3 id="sources">Sources<a class="headerlink" href="#sources" title="Permanent
link">¶</a></h3>
<p>Crunch defines both <code>Source<T></code> and
<code>TableSource<K, V></code> interfaces that allow us to read an input
as a <code>PCollection<T></code> or a <code>PTable<K, V></code>.
You use a Source in conjunction with one of the <code>read</code> methods on
the Pipeline interface:</p>
<pre>
@@ -801,7 +812,7 @@ different files using the NLineInputForm
</table>
<p><a name="targets"></a></p>
-<h3 id="targets">Targets</h3>
+<h3 id="targets">Targets<a class="headerlink" href="#targets" title="Permanent
link">¶</a></h3>
<p>Crunch's <code>Target</code> interface is the analogue of
<code>Source<T></code> for OutputFormats. You create Targets for use with
the <code>write</code> method
defined on the <code>Pipeline</code> interface:</p>
<pre>
@@ -873,7 +884,7 @@ parameters that this Target needs:</p>
</table>
<p><a name="srctargets"></a></p>
-<h3 id="sourcetargets-and-write-modes">SourceTargets and Write Modes</h3>
+<h3 id="sourcetargets-and-write-modes">SourceTargets and Write Modes<a
class="headerlink" href="#sourcetargets-and-write-modes" title="Permanent
link">¶</a></h3>
<p>The <code>SourceTarget<T></code> interface extends both the
<code>Source<T></code> and <code>Target</code> interfaces and allows a
Path to act as both a
Target for some PCollections as well as a Source for others. SourceTargets are
convenient for any intermediate outputs within
your pipeline. Just as we have the factory methods in the From and To classes
for Sources and Targets, factory methods for
@@ -904,7 +915,7 @@ WriteModes for Crunch:</p>
</pre>
<p><a name="materialize"></a></p>
-<h3 id="materializing-data-into-the-client">Materializing Data Into the
Client</h3>
+<h3 id="materializing-data-into-the-client">Materializing Data Into the
Client<a class="headerlink" href="#materializing-data-into-the-client"
title="Permanent link">¶</a></h3>
<p>In many analytical applications, we need to use the output of one phase of
a data pipeline in order to configure subsequent pipeline
stages. For example, many machine learning applications require that we
iterate over a dataset until some convergence criteria is
met. Crunch provides API methods that make it possible to materialize the data
from a PCollection and stream the resulting data into
@@ -930,12 +941,12 @@ interface that has an associated <code>V
of elements contained in that PCollection, but the pipeline tasks required to
compute this value will not run until the <code>Long getValue()</code>
method of the returned PObject is called.</p>
<p><a name="patterns"></a></p>
-<h2 id="data-processing-patterns-in-crunch">Data Processing Patterns in
Crunch</h2>
+<h2 id="data-processing-patterns-in-crunch">Data Processing Patterns in
Crunch<a class="headerlink" href="#data-processing-patterns-in-crunch"
title="Permanent link">¶</a></h2>
<p>This section describes the various data processing patterns implemented in
Crunch's library APIs,
which are in the <a
href="apidocs/0.10.0/org/apache/crunch/lib/package-summary.html">org.apache.crunch.lib</a>
package.</p>
<p><a name="gbk"></a></p>
-<h3 id="groupbykey">groupByKey</h3>
+<h3 id="groupbykey">groupByKey<a class="headerlink" href="#groupbykey"
title="Permanent link">¶</a></h3>
<p>Most of the data processing patterns described in this section rely on
PTable's groupByKey method,
which controls how data is shuffled and aggregated by the underlying execution
engine. The groupByKey
method has three flavors on the PTable interface:</p>
@@ -973,7 +984,7 @@ same classes may also be used with other
options specified that will only be applied to the job that actually executes
that phase of the data
pipeline.</p>
<p><a name="aggregators"></a></p>
-<h3 id="combinevalues">combineValues</h3>
+<h3 id="combinevalues">combineValues<a class="headerlink"
href="#combinevalues" title="Permanent link">¶</a></h3>
<p>Calling one of the groupByKey methods on PTable returns an instance of the
PGroupedTable interface.
PGroupedTable provides a <code>combineValues</code> that can be used to signal
to the planner that we want to perform
associative aggregations on our data both before and after the shuffle.</p>
@@ -1018,7 +1029,7 @@ the average of a set of values:</p>
</pre>
<p><a name="simpleagg"></a></p>
-<h3 id="simple-aggregations">Simple Aggregations</h3>
+<h3 id="simple-aggregations">Simple Aggregations<a class="headerlink"
href="#simple-aggregations" title="Permanent link">¶</a></h3>
<p>Many of the most common aggregation patterns in Crunch are provided as
methods on the PCollection
interface, including <code>count</code>, <code>max</code>, <code>min</code>,
and <code>length</code>. The implementations of these methods,
however, are in the <a
href="apidocs/0.10.0/org/apache/crunch/lib/Aggregate.html">Aggregate</a>
library class.
@@ -1040,7 +1051,7 @@ most frequently occuring elements, you w
</pre>
<p><a name="joins"></a></p>
-<h3 id="joining-data">Joining Data</h3>
+<h3 id="joining-data">Joining Data<a class="headerlink" href="#joining-data"
title="Permanent link">¶</a></h3>
<p>Joins in Crunch are based on equal-valued keys in different PTables. Joins
have also evolved
a great deal in Crunch over the lifetime of the project. The <a
href="apidocs/0.10.0/org/apache/crunch/lib/Join.html">Join</a>
API provides simple methods for performing equijoins, left joins, right joins,
and full joins, but modern
@@ -1064,14 +1075,14 @@ a given key in a PCollection, so joining
surprising results. Using a non-null dummy value in your PCollections is a
good idea in
general.</p>
<p><a name="reducejoin"></a></p>
-<h4 id="reduce-side-joins">Reduce-side Joins</h4>
+<h4 id="reduce-side-joins">Reduce-side Joins<a class="headerlink"
href="#reduce-side-joins" title="Permanent link">¶</a></h4>
<p>Reduce-side joins are handled by the <a
href="apidocs/0.10.0/org/apache/crunch/lib/join/DefaultJoinStrategy.html">DefaultJoinStrategy</a>.
Reduce-side joins are the simplest and most robust kind of joins in Hadoop;
the keys from the two inputs are
shuffled together to the reducers, where the values from the smaller of the
two collections are collected and then
streamed over the values from the larger of the two collections. You can
control the number of reducers that is used
to perform the join by passing an integer argument to the DefaultJoinStrategy
constructor.</p>
<p><a name="mapjoin"></a></p>
-<h4 id="map-side-joins">Map-side Joins</h4>
+<h4 id="map-side-joins">Map-side Joins<a class="headerlink"
href="#map-side-joins" title="Permanent link">¶</a></h4>
<p>Map-side joins are handled by the <a
href="apidocs/0.10.0/org/apache/crunch/lib/join/MapsideJoinStrategy.html">MapsideJoinStrategy</a>.
Map-side joins require that the smaller of the two input tables is loaded into
memory on the tasks on the cluster, so
there is a requirement that at least one of the tables be relatively small so
that it can comfortably fit into memory within
@@ -1084,7 +1095,7 @@ recommend that you use the <code>Mapside
implementation of the MapsideJoinStrategy in which the left-side PTable is
loaded into
memory instead of the right-side PTable.</p>
<p><a name="shardedjoin"></a></p>
-<h4 id="sharded-joins">Sharded Joins</h4>
+<h4 id="sharded-joins">Sharded Joins<a class="headerlink"
href="#sharded-joins" title="Permanent link">¶</a></h4>
<p>Many distributed joins have skewed data that can cause regular reduce-side
joins to fail due to out-of-memory issues on
the partitions that happen to contain the keys with highest cardinality. To
handle these skew issues, Crunch has the
<a
href="apidocs/0.10.0/org/apache/crunch/lib/join/ShardedJoinStrategy.html">ShardedJoinStrategy</a>
that allows developers to shard
@@ -1092,7 +1103,7 @@ each key to multiple reducers, which pre
in exchange for sending more data over the wire. For problems with significant
skew issues, the ShardedJoinStrategy can
significantly improve performance.</p>
<p><a name="bloomjoin"></a></p>
-<h4 id="bloom-filter-joins">Bloom Filter Joins</h4>
+<h4 id="bloom-filter-joins">Bloom Filter Joins<a class="headerlink"
href="#bloom-filter-joins" title="Permanent link">¶</a></h4>
<p>Last but not least, the <a
href="apidocs/0.10.0/org/apache/crunch/lib/join/BloomFilterJoinStrategy.html">BloomFilterJoinStrategy</a>
builds
a <a href="http://en.wikipedia.org/wiki/Bloom_filter">bloom filter</a> on the
left-hand side table that is used to filter the contents
of the right-hand side table to eliminate entries from the (larger) right-hand
side table that have no hope of being joined
@@ -1100,7 +1111,7 @@ to values in the left-hand side table. T
into memory on the tasks of the job, but is still significantly smaller than
the right-hand side table, and we know that the
vast majority of the keys in the right-hand side table will not match the keys
in the left-hand side of the table.</p>
<p><a name="cogroups"></a></p>
-<h4 id="cogroups">Cogroups</h4>
+<h4 id="cogroups">Cogroups<a class="headerlink" href="#cogroups"
title="Permanent link">¶</a></h4>
<p>Some kinds of joins are richer and more complex then the typical kind of
relational join that are handled by JoinStrategy.
For example, we might want to join two datasets
together and only emit a record if each of the sets had at least two distinct
values associated
@@ -1125,12 +1136,12 @@ PTable whose values are made up of Colle
how they work, you can consult the <a
href="http://chimera.labs.oreilly.com/books/1234000001811/ch06.html">section on
cogroups</a>
in the Apache Pig book.</p>
<p><a name="sorting"></a></p>
-<h3 id="sorting">Sorting</h3>
+<h3 id="sorting">Sorting<a class="headerlink" href="#sorting" title="Permanent
link">¶</a></h3>
<p>After joins and cogroups, sorting data is the most common distributed
computing pattern. The
Crunch APIs have a number of utilities for performing fully distributed sorts
as well as
more advanced patterns like secondary sorts.</p>
<p><a name="stdsort"></a></p>
-<h4 id="standard-and-reverse-sorting">Standard and Reverse Sorting</h4>
+<h4 id="standard-and-reverse-sorting">Standard and Reverse Sorting<a
class="headerlink" href="#standard-and-reverse-sorting" title="Permanent
link">¶</a></h4>
<p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/Sort.html">Sort</a> API
methods contain utility functions
for sorting the contents of PCollections and PTables whose contents implement
the <code>Comparable</code>
interface. By default, MapReduce does not perform total sorts on its keys
during a shuffle; instead
@@ -1160,7 +1171,7 @@ the <a href="apidocs/0.10.0/org/apache/c
</pre>
<p><a name="secsort"></a></p>
-<h4 id="secondary-sorts">Secondary Sorts</h4>
+<h4 id="secondary-sorts">Secondary Sorts<a class="headerlink"
href="#secondary-sorts" title="Permanent link">¶</a></h4>
<p>Another pattern that occurs frequently in distributed processing is
<em>secondary sorts</em>, where we
want to group a set of records by one key and sort the records within each
group by a second key.
The <a
href="apidocs/0.10.0/org/apache/crunch/lib/SecondarySort.html">SecondarySort</a>
API provides a set
@@ -1169,11 +1180,11 @@ where <code>K</code> is the primary grou
method will perform the grouping and sorting and will then apply a given DoFn
to process the
grouped and sorted values.</p>
<p><a name="otheropts"></a></p>
-<h3 id="other-operations">Other Operations</h3>
+<h3 id="other-operations">Other Operations<a class="headerlink"
href="#other-operations" title="Permanent link">¶</a></h3>
<p>Crunch provides implementations of a number of other common distributed
processing patterns and
techniques throughout its library APIs.</p>
<p><a name="cartesian"></a></p>
-<h4 id="cartesian-products">Cartesian Products</h4>
+<h4 id="cartesian-products">Cartesian Products<a class="headerlink"
href="#cartesian-products" title="Permanent link">¶</a></h4>
<p>Cartesian products between PCollections are a bit tricky in distributed
processing; we usually want
one of the datasets to be small enough to fit into memory, and then do a pass
over the larger data
set where we emit an element of the smaller data set along with each element
from the larger set.</p>
@@ -1183,7 +1194,7 @@ provides methods for a reduce-side full
this is a pretty expensive operation, and you should go out of your way to
avoid these kinds of processing
steps in your pipelines.</p>
<p><a name="shard"></a></p>
-<h4 id="coalescing">Coalescing</h4>
+<h4 id="coalescing">Coalescing<a class="headerlink" href="#coalescing"
title="Permanent link">¶</a></h4>
<p>Many MapReduce jobs have the potential to generate a large number of small
files that could be used more
effectively by clients if they were all merged together into a small number of
large files. The
<a href="apidocs/0.10.0/org/apache/crunch/lib/Shard.html">Shard</a> API
provides a single method, <code>shard</code>, that allows
@@ -1196,7 +1207,7 @@ you to coalesce a given PCollection into
<p>This has the effect of running a no-op MapReduce job that shuffles the data
into the given number of
partitions. This is often a useful step at the end of a long pipeline run.</p>
<p><a name="distinct"></a></p>
-<h4 id="distinct">Distinct</h4>
+<h4 id="distinct">Distinct<a class="headerlink" href="#distinct"
title="Permanent link">¶</a></h4>
<p>Crunch's <a
href="apidocs/0.10.0/org/apache/crunch/lib/Distinct.html">Distinct</a> API has
a method, <code>distinct</code>, that
returns one copy of each unique element in a given PCollection:</p>
<pre>
@@ -1218,7 +1229,7 @@ with another method in Distinct:</p>
value for your own pipelines. The optimal value will depend on some
combination of the size of the objects (and
thus the amount of memory they consume) and the number of unique elements in
the data.</p>
<p><a name="sampling"></a></p>
-<h4 id="sampling">Sampling</h4>
+<h4 id="sampling">Sampling<a class="headerlink" href="#sampling"
title="Permanent link">¶</a></h4>
<p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/Sample.html">Sample</a>
API provides methods for two sorts of PCollection
sampling: random and reservoir.</p>
<p>Random sampling is where you include each record in the same with a fixed
probability, and is probably what you're
@@ -1244,11 +1255,11 @@ collection! You can read more about how
random number generators. Note that all of the sampling algorithms Crunch
provides, both random and reservoir,
only require a single pass over the data.</p>
<p><a name="sets"></a></p>
-<h4 id="set-operations">Set Operations</h4>
+<h4 id="set-operations">Set Operations<a class="headerlink"
href="#set-operations" title="Permanent link">¶</a></h4>
<p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/Set.html">Set</a> API
methods complement Crunch's built-in <code>union</code> methods and
provide support for finding the intersection, the difference, or the <a
href="http://en.wikipedia.org/wiki/Comm">comm</a> of two PCollections.</p>
<p><a name="splits"></a></p>
-<h4 id="splits">Splits</h4>
+<h4 id="splits">Splits<a class="headerlink" href="#splits" title="Permanent
link">¶</a></h4>
<p>Sometimes, you want to write two different outputs from the same DoFn into
different PCollections. An example of this would
be a pipeline in which you wanted to write good records to one file and bad or
corrupted records to a different file for
further examination. The <a
href="apidocs/0.10.0/org/apache/crunch/lib/Channels.html">Channels</a> class
provides a method that allows
@@ -1261,7 +1272,7 @@ you to split an input PCollection of Pai
</pre>
<p><a name="objectreuse"></a></p>
-<h3 id="retaining-objects-within-dofns">Retaining objects within DoFns</h3>
+<h3 id="retaining-objects-within-dofns">Retaining objects within DoFns<a
class="headerlink" href="#retaining-objects-within-dofns" title="Permanent
link">¶</a></h3>
<p>For reasons of efficiency, Hadoop MapReduce repeatedly passes the <a
href="https://issues.apache.org/jira/browse/HADOOP-2399">same references as
keys and values to Mappers and Reducers</a> instead of passing in new objects
for each call.
The state of the singleton key and value objects is updated between each call
to <code>Mapper.map()</code> and <code>Reducer.reduce()</code>, as well as
updating it between each
@@ -1316,7 +1327,7 @@ the maximum value encountered would be i
<p><a name="hbase"></a></p>
-<h2 id="crunch-for-hbase">Crunch for HBase</h2>
+<h2 id="crunch-for-hbase">Crunch for HBase<a class="headerlink"
href="#crunch-for-hbase" title="Permanent link">¶</a></h2>
<p>Crunch is an excellent platform for creating pipelines that involve
processing data from HBase tables. Because of Crunch's
flexible schemas for PCollections and PTables, you can write pipelines that
operate directly on HBase API classes like
<code>Put</code>, <code>KeyValue</code>, and <code>Result</code>.</p>
@@ -1334,7 +1345,7 @@ hfiles directly, which is much faster th
into HBase tables. See the utility methods in the <a
href="apidocs/0.10.0/org/apache/crunch/io/hbase/HFileUtils.html">HFileUtils</a>
class for
more details on how to work with PCollections against hfiles.</p>
<p><a name="exec"></a></p>
-<h2 id="managing-pipeline-execution">Managing Pipeline Execution</h2>
+<h2 id="managing-pipeline-execution">Managing Pipeline Execution<a
class="headerlink" href="#managing-pipeline-execution" title="Permanent
link">¶</a></h2>
<p>Crunch uses a lazy execution model. No jobs are run or outputs created
until the user explicitly invokes one of the methods on the
Pipeline interface that controls job planning and execution. The simplest of
these methods is the <code>PipelineResult run()</code> method,
which analyzes the current graph of PCollections and Target outputs and comes
up with a plan to ensure that each of the outputs is
@@ -1356,11 +1367,11 @@ If the planner detects a materialized or
PCollection to its own choice. The implementation of materialize and cache
vary slightly between the MapReduce-based and Spark-based
execution pipelines in a way that is explained in the subsequent section of
the guide.</p>
<p><a name="pipelines"></a></p>
-<h2
id="the-different-pipeline-implementations-properties-and-configuration-options">The
Different Pipeline Implementations (Properties and Configuration options)</h2>
+<h2
id="the-different-pipeline-implementations-properties-and-configuration-options">The
Different Pipeline Implementations (Properties and Configuration options)<a
class="headerlink"
href="#the-different-pipeline-implementations-properties-and-configuration-options"
title="Permanent link">¶</a></h2>
<p>This section adds some additional details about the implementation and
configuration options available for each of
the different execution engines.</p>
<p><a name="mrpipeline"></a></p>
-<h3 id="mrpipeline">MRPipeline</h3>
+<h3 id="mrpipeline">MRPipeline<a class="headerlink" href="#mrpipeline"
title="Permanent link">¶</a></h3>
<p>The <a
href="apidocs/0.10.0/org/apache/crunch/impl/mr/MRPipeline.html">MRPipeline</a>
is the oldest implementation of the Pipeline interface and
compiles and executes the DAG of PCollections into a series of MapReduce jobs.
MRPipeline has three constructors that are commonly
used:</p>
@@ -1420,7 +1431,7 @@ aware of:</p>
</table>
<p><a name="sparkpipeline"></a></p>
-<h3 id="sparkpipeline">SparkPipeline</h3>
+<h3 id="sparkpipeline">SparkPipeline<a class="headerlink"
href="#sparkpipeline" title="Permanent link">¶</a></h3>
<p>The <code>SparkPipeline</code> is the newest implementation of the Pipeline
interface, and was added in Crunch 0.10.0. It has two default constructors:</p>
<ol>
<li><code>SparkPipeline(String sparkConnection, String appName)</code> which
takes a Spark connection string, which is of the form
<code>local[numThreads]</code> for
@@ -1446,7 +1457,7 @@ get strange and unpredictable failures i
be a little rough around the edges and may not handle all of the use cases
that MRPipeline can handle, although the Crunch community is
actively working to ensure complete compatibility between the two
implementations.</p>
<p><a name="mempipeline"></a></p>
-<h3 id="mempipeline">MemPipeline</h3>
+<h3 id="mempipeline">MemPipeline<a class="headerlink" href="#mempipeline"
title="Permanent link">¶</a></h3>
<p>The <a
href="apidocs/0.10.0/org/apache/crunch/impl/mem/MemPipeline.html">MemPipeline</a>
implementation of Pipeline has a few interesting
properties. First, unlike MRPipeline, MemPipeline is a singleton; you don't
create a MemPipeline, you just get a reference to it
via the static <code>MemPipeline.getInstance()</code> method. Second, all of
the operations in the MemPipeline are executed completely in-memory,
@@ -1479,10 +1490,10 @@ on the read side. Often the best way to
<code>materialize()</code> method to get a reference to the contents of the
in-memory collection and then verify them directly,
without writing them out to disk.</p>
<p><a name="testing"></a></p>
-<h2 id="unit-testing-pipelines">Unit Testing Pipelines</h2>
+<h2 id="unit-testing-pipelines">Unit Testing Pipelines<a class="headerlink"
href="#unit-testing-pipelines" title="Permanent link">¶</a></h2>
<p>For production data pipelines, unit tests are an absolute must. The <a
href="#mempipeline">MemPipeline</a> implementation of the Pipeline
interface has several tools to help developers create effective unit tests,
which will be detailed in this section.</p>
-<h3 id="unit-testing-dofns">Unit Testing DoFns</h3>
+<h3 id="unit-testing-dofns">Unit Testing DoFns<a class="headerlink"
href="#unit-testing-dofns" title="Permanent link">¶</a></h3>
<p>Many of the DoFn implementations, such as <code>MapFn</code> and
<code>FilterFn</code>, are very easy to test, since they accept a single input
and return a single output. For general purpose DoFns, we need an instance of
the <a href="apidocs/0.10.0/org/apache/crunch/Emitter.html">Emitter</a>
interface that we can pass to the DoFn's <code>process</code> method and then
read in the values that are written by the function. Support
@@ -1497,7 +1508,7 @@ has a <code>List<T> getOutput()</c
</pre></div>
-<h3 id="testing-complex-dofns-and-pipelines">Testing Complex DoFns and
Pipelines</h3>
+<h3 id="testing-complex-dofns-and-pipelines">Testing Complex DoFns and
Pipelines<a class="headerlink" href="#testing-complex-dofns-and-pipelines"
title="Permanent link">¶</a></h3>
<p>Many of the DoFns we write involve more complex processing that require
that our DoFn be initialized and cleaned up, or that
define Counters that we use to track the inputs that we receive. In order to
ensure that our DoFns are working properly across
their entire lifecycle, it's best to use the <a
href="#mempipeline">MemPipeline</a> implementation to create in-memory
instances of
@@ -1532,7 +1543,7 @@ those Counters between test runs by call
</pre></div>
-<h3 id="designing-testable-data-pipelines">Designing Testable Data
Pipelines</h3>
+<h3 id="designing-testable-data-pipelines">Designing Testable Data Pipelines<a
class="headerlink" href="#designing-testable-data-pipelines" title="Permanent
link">¶</a></h3>
<p>In the same way that we try to <a
href="http://misko.hevery.com/code-reviewers-guide/">write testable code</a>,
we want to ensure that
our data pipelines are written in a way that makes them easy to test. In
general, you should try to break up complex pipelines
into a number of function calls that perform a small set of operations on
input PCollections and return one or more PCollections
@@ -1576,7 +1587,7 @@ is taken from one of Crunch's integratio
computations that combine custom DoFns with Crunch's built-in
<code>cogroup</code> operation by using the <a
href="#mempipeline">MemPipeline</a>
implementation to create test data sets that we can easily verify by hand, and
then this same logic can be executed on
a distributed data set using either the <a href="#mrpipeline">MRPipeline</a>
or <a href="#sparkpipeline">SparkPipeline</a> implementations.</p>
-<h3 id="pipeline-execution-plan-visualizations">Pipeline execution plan
visualizations</h3>
+<h3 id="pipeline-execution-plan-visualizations">Pipeline execution plan
visualizations<a class="headerlink"
href="#pipeline-execution-plan-visualizations" title="Permanent
link">¶</a></h3>
<p>Crunch provides tools to visualize the pipeline execution plan. The <a
href="apidocs/0.10.0/org/apache/crunch/PipelineExecution.html">PipelineExecution</a>
<code>String getPlanDotFile()</code> method returns a DOT format
visualization of the exaction plan. Furthermore if the output folder is set
then Crunch will save the dotfile diagram on each pipeline execution: </p>
<div class="codehilite"><pre> <span class="n">Configuration</span> <span
class="n">conf</span> <span class="p">=...;</span>
<span class="n">String</span> <span class="n">dotfileDir</span> <span
class="p">=...;</span>