about.html bylaws.html download.html future-work.html getting-started.html index.html mailing-lists.html pipelines.html scrunch.html source-repository.html user-guide.html

buildbot Wed, 18 Nov 2015 03:51:12 -0800

Author: buildbot
Date: Wed Nov 18 11:49:54 2015
New Revision: 972840

Log:
Staging update by buildbot for crunch


Modified:
    websites/staging/crunch/trunk/content/   (props changed)
    websites/staging/crunch/trunk/content/about.html
    websites/staging/crunch/trunk/content/bylaws.html
    websites/staging/crunch/trunk/content/download.html
    websites/staging/crunch/trunk/content/future-work.html
    websites/staging/crunch/trunk/content/getting-started.html
    websites/staging/crunch/trunk/content/index.html
    websites/staging/crunch/trunk/content/mailing-lists.html
    websites/staging/crunch/trunk/content/pipelines.html
    websites/staging/crunch/trunk/content/scrunch.html
    websites/staging/crunch/trunk/content/source-repository.html
    websites/staging/crunch/trunk/content/user-guide.html

Propchange: websites/staging/crunch/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Wed Nov 18 11:49:54 2015
@@ -1 +1 @@
-1680960
+1714973

Modified: websites/staging/crunch/trunk/content/about.html
==============================================================================
--- websites/staging/crunch/trunk/content/about.html (original)
+++ websites/staging/crunch/trunk/content/about.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
               
                 
                   
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 
0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
                   
                 
               
@@ -145,7 +145,18 @@
             
           </h1>
 
-          <p>The initial source code of the Apache Crunch project has been 
written mostly
+          <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>The initial source code of the Apache Crunch project has been written mostly
 by Josh Wills at <a href="http://www.cloudera.com/";>Cloudera</a> in 2011, 
based on
 Google's FlumeJava library. The project was open sourced at GitHub soon
 afterwards where serveral releases up to and including 0.2.4 were made.</p>
@@ -154,7 +165,7 @@ entered the <a href="http://incubator.ap
 the Incubator and three releases (0.3.0-incubating to 0.5.0-incubating), the
 Apache Board of Directors established the Apache Crunch project in February
 2013 as a new top level project.</p>
-<h2 id="team">Team</h2>
+<h2 id="team">Team<a class="headerlink" href="#team" title="Permanent 
link">&para;</a></h2>
 <!--
   Markdown-generated tables don't have the proper CSS classes,
   so we use plain HTML tables.

Modified: websites/staging/crunch/trunk/content/bylaws.html
==============================================================================
--- websites/staging/crunch/trunk/content/bylaws.html (original)
+++ websites/staging/crunch/trunk/content/bylaws.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
               
                 
                   
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 
0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
                   
                 
               
@@ -145,7 +145,18 @@
             
           </h1>
 
-          <p>This document defines the bylaws under which the Apache Crunch 
+          <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>This document defines the bylaws under which the Apache Crunch 
 project operates. It defines the roles and responsibilities of the 
 project, who may vote, how voting works, how conflicts are resolved, etc. </p>
 <p>Crunch is a project of the
@@ -159,11 +170,11 @@ of principles, known collectively as the
 Apache development, please refer to the
 <a href="http://incubator.apache.org/";>Incubator project</a> for more 
information
 on how Apache projects operate. </p>
-<h2 id="roles-and-responsibilities">Roles and Responsibilities</h2>
+<h2 id="roles-and-responsibilities">Roles and Responsibilities<a 
class="headerlink" href="#roles-and-responsibilities" title="Permanent 
link">&para;</a></h2>
 <p>Apache projects define a set of roles with associated rights and 
 responsibilities. These roles govern what tasks an individual may 
 perform within the project. The roles are defined in the following sections. 
</p>
-<h3 id="users">Users</h3>
+<h3 id="users">Users<a class="headerlink" href="#users" title="Permanent 
link">&para;</a></h3>
 <p>The most important participants in the project are people who use our 
 software. The majority of our contributors start out as users and guide 
 their development efforts from the user's perspective. </p>
@@ -171,13 +182,13 @@ their development efforts from the user'
 contributors in the form of bug reports and feature suggestions. As 
 well, users participate in the Apache community by helping other users 
 on mailing lists and user support forums. </p>
-<h3 id="contributors">Contributors</h3>
+<h3 id="contributors">Contributors<a class="headerlink" href="#contributors" 
title="Permanent link">&para;</a></h3>
 <p>All of the volunteers who are contributing time, code, documentation, or 
 resources to the Crunch project. A contributor that makes sustained, 
 welcome contributions to the project may be invited to become a 
 committer, though the exact timing of such invitations depends on many 
 factors. </p>
-<h3 id="committers">Committers</h3>
+<h3 id="committers">Committers<a class="headerlink" href="#committers" 
title="Permanent link">&para;</a></h3>
 <p>The project's committers are responsible for the project's technical 
 management. They have access to all of the project's code repositories
 and may cast binding votes on any technical discussion regarding the
@@ -200,7 +211,7 @@ more details on the requirements for com
 invited to become a member of the PMC. The form of contribution is not 
 limited to code. It can also include code review, helping out users on 
 the mailing lists, documentation, etc. </p>
-<h3 id="project-management-committee">Project Management Committee</h3>
+<h3 id="project-management-committee">Project Management Committee<a 
class="headerlink" href="#project-management-committee" title="Permanent 
link">&para;</a></h3>
 <p>The Project Management Committee (PMC) is responsible to the board and
 the ASF for the management and oversight of the Apache Crunch codebase.
 The responsibilities of the PMC include: </p>
@@ -232,13 +243,13 @@ Crunch project. </p>
 the chair resigns before the end of his or her term, the PMC votes to
 recommend a new chair using lazy consensus, but the decision must be ratified
 by the Apache board. </p>
-<h2 id="decision-making">Decision Making</h2>
+<h2 id="decision-making">Decision Making<a class="headerlink" 
href="#decision-making" title="Permanent link">&para;</a></h2>
 <p>Within the Apache Crunch project, different types of decisions require 
 different forms of approval. For example, the previous section describes 
 several decisions which require "lazy consensus" approval. This section 
 defines how voting is performed, the types of approvals, and which types 
 of decision require which type of approval. </p>
-<h3 id="voting">Voting</h3>
+<h3 id="voting">Voting<a class="headerlink" href="#voting" title="Permanent 
link">&para;</a></h3>
 <p>Decisions regarding the project are made by votes on the primary project 
 development mailing list [email protected]. Where necessary, PMC 
 voting may take place on the private Crunch PMC mailing list 
@@ -294,7 +305,7 @@ codebase. These typically take the form
 commit message sent when the commit is made. Note that this should be a 
 rare occurrence. All efforts should be made to discuss issues when they 
 are still patches before the code is committed. </p>
-<h3 id="approvals">Approvals</h3>
+<h3 id="approvals">Approvals<a class="headerlink" href="#approvals" 
title="Permanent link">&para;</a></h3>
 <p>These are the types of approvals that can be sought. Different actions 
 require different types of approvals. </p>
 <table class="table">
@@ -339,7 +350,7 @@ require different types of approvals. </
   </tbody>
 </table>
 
-<h3 id="vetoes">Vetoes</h3>
+<h3 id="vetoes">Vetoes<a class="headerlink" href="#vetoes" title="Permanent 
link">&para;</a></h3>
 <p>A valid, binding veto cannot be overruled. If a veto is cast, it must
 be accompanied by a valid reason explaining the reasons for the
 veto. The validity of a veto, if challenged, can be confirmed by
@@ -348,7 +359,7 @@ agreement with the veto - merely that th
 <p>If you disagree with a valid veto, you must lobby the person casting the 
 veto to withdraw his or her veto. If a veto is not withdrawn, the action 
 that has been vetoed must be reversed in a timely manner. </p>
-<h3 id="actions">Actions</h3>
+<h3 id="actions">Actions<a class="headerlink" href="#actions" title="Permanent 
link">&para;</a></h3>
 <p>This section describes the various actions which are undertaken within 
 the project, the corresponding approval required for that action and 
 those who have binding votes over the action. It also specifies the 

Modified: websites/staging/crunch/trunk/content/download.html
==============================================================================
--- websites/staging/crunch/trunk/content/download.html (original)
+++ websites/staging/crunch/trunk/content/download.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
               
                 
                   
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 
0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
                   
                 
               
@@ -145,7 +145,18 @@
             
           </h1>
 
-          <p>The Apache Crunch libraries are distributed under the <a 
href="http://www.apache.org/licenses/LICENSE-2.0.html";>Apache License 
2.0</a>.</p>
+          <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>The Apache Crunch libraries are distributed under the <a 
href="http://www.apache.org/licenses/LICENSE-2.0.html";>Apache License 
2.0</a>.</p>
 <p>The link in the Download column takes you to a list of mirrors based on
 your location. Checksum and signature are located on Apache's main
 distribution site.</p>

Modified: websites/staging/crunch/trunk/content/future-work.html
==============================================================================
--- websites/staging/crunch/trunk/content/future-work.html (original)
+++ websites/staging/crunch/trunk/content/future-work.html Wed Nov 18 11:49:54 
2015
@@ -80,7 +80,7 @@
               
                 
                   
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 
0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
                   
                 
               
@@ -145,7 +145,18 @@
             
           </h1>
 
-          <p>This section contains an almost certainly incomplete list of 
known limitations and plans for future work.</p>
+          <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>This section contains an almost certainly incomplete list of known 
limitations and plans for future work.</p>
 <ul>
 <li>We would like to have easy support for reading and writing data from/to 
the Hive metastore via the HCatalog
 APIs.</li>

Modified: websites/staging/crunch/trunk/content/getting-started.html
==============================================================================
--- websites/staging/crunch/trunk/content/getting-started.html (original)
+++ websites/staging/crunch/trunk/content/getting-started.html Wed Nov 18 
11:49:54 2015
@@ -80,7 +80,7 @@
               
                 
                   
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 
0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
                   
                 
               
@@ -145,11 +145,22 @@
             
           </h1>
 
-          <p><em>Getting Started</em> will guide you through the process of 
creating a simple Crunch pipeline to count
+          <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p><em>Getting Started</em> will guide you through the process of creating a 
simple Crunch pipeline to count
 the words in a text document, which is the Hello World of distributed 
computing. Along the way,
 we'll explain the core Crunch concepts and how to use them to create effective 
and efficient data
 pipelines.</p>
-<h1 id="overview">Overview</h1>
+<h1 id="overview">Overview<a class="headerlink" href="#overview" 
title="Permanent link">&para;</a></h1>
 <p>The Apache Crunch project develops and supports Java APIs that simplify the 
process of creating data pipelines on top of Apache Hadoop. The
 Crunch APIs are modeled after <a 
href="http://pages.cs.wisc.edu/~akella/CS838/F12/838-CloudPapers/FlumeJava.pdf";>FlumeJava
 (PDF)</a>, which is the library that
 Google uses for building data pipelines on top of their own implementation of 
MapReduce.</p>
@@ -172,7 +183,7 @@ they represent their data, which makes C
 <a 
href="http://thunderheadxpler.blogspot.com/2013/05/creating-spatial-crunch-pipelines.html";>geospatial</a>
 and
 <a 
href="http://blog.cloudera.com/blog/2012/01/seismic-data-science-hadoop-use-case/";>time
 series</a> data, and data stored in <a href="http://hbase.apache.org";>Apache 
HBase</a> tables.</li>
 </ol>
-<h1 id="which-version-of-crunch-do-i-need">Which Version of Crunch Do I 
Need?</h1>
+<h1 id="which-version-of-crunch-do-i-need">Which Version of Crunch Do I 
Need?<a class="headerlink" href="#which-version-of-crunch-do-i-need" 
title="Permanent link">&para;</a></h1>
 <p>The core libraries are primarily developed against Hadoop 1.1.2, and are 
also tested against Hadoop 2.2.0.
 They should work with any version of Hadoop 1.x after 1.0.3 and any version of 
Hadoop 2.x after 2.0.0-alpha,
 although you should note that some of Hadoop 2.x's dependencies changed 
between 2.0.4-alpha and 2.2.0 (for example,
@@ -200,7 +211,7 @@ prior versions of crunch-hbase were deve
   </tr>
 </table>
 
-<h2 id="maven-dependencies">Maven Dependencies</h2>
+<h2 id="maven-dependencies">Maven Dependencies<a class="headerlink" 
href="#maven-dependencies" title="Permanent link">&para;</a></h2>
 <p>The Crunch project provides Maven artifacts on Maven Central of the 
form:</p>
 <pre>
   &lt;dependency&gt;
@@ -221,7 +232,7 @@ pipelines. Depending on your use case, y
 <li><code>crunch-examples</code>: Example MapReduce and HBase pipelines</li>
 <li><code>crunch-archetype</code>: A Maven archetype for creating new Crunch 
pipeline projects</li>
 </ul>
-<h2 id="building-from-source">Building From Source</h2>
+<h2 id="building-from-source">Building From Source<a class="headerlink" 
href="#building-from-source" title="Permanent link">&para;</a></h2>
 <p>You can download the most recently released Crunch libraries from the <a 
href="download.html">Download</a> page or from the Maven
 Central Repository.</p>
 <p>If you prefer, you can also build the Crunch libraries from the source code 
using Maven and install
@@ -241,7 +252,7 @@ it in your local repository:</p>
 AverageBytesByIP and TotalBytesByIP take as input a file in the Common Log 
Format (an example is provided in
 <code>crunch-examples/src/main/resources/access_logs.tar.gz</code>.) The 
WordAggregationHBase requires an Apache HBase cluster to be
 available, but creates tables and loads sample data as part of its run.</p>
-<h1 id="your-first-crunch-pipeline">Your First Crunch Pipeline</h1>
+<h1 id="your-first-crunch-pipeline">Your First Crunch Pipeline<a 
class="headerlink" href="#your-first-crunch-pipeline" title="Permanent 
link">&para;</a></h1>
 <p>There are a couple of ways to get started with Crunch. If you use Git, you 
can
 clone this project which contains an <a 
href="http://github.com/jwills/crunch-demo";>example Crunch pipeline</a>:</p>
 <pre>
@@ -318,7 +329,7 @@ files, while <code>&lt;out&gt;</code> is
 Java applications or from unit tests. All required dependencies are on Maven's
 classpath so you can run the <code>WordCount</code> class directly without any 
additional
 setup.</p>
-<h2 id="walking-through-the-wordcount-example">Walking Through The WordCount 
Example</h2>
+<h2 id="walking-through-the-wordcount-example">Walking Through The WordCount 
Example<a class="headerlink" href="#walking-through-the-wordcount-example" 
title="Permanent link">&para;</a></h2>
 <p>Let's walk through the <code>run</code> method of the 
<code>WordCount</code> example line by line and explain the
 data processing concepts we encounter.</p>
 <p>Our WordCount application starts out with a <code>main</code> method that 
should be familiar to most

Modified: websites/staging/crunch/trunk/content/index.html
==============================================================================
--- websites/staging/crunch/trunk/content/index.html (original)
+++ websites/staging/crunch/trunk/content/index.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
               
                 
                   
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 
0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
                   
                 
               
@@ -147,7 +147,18 @@
             
           </h1>
 
-          <hr />
+          <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<hr />
 <blockquote>
 <p>The <em>Apache Crunch</em> Java library provides a framework for writing, 
testing,
 and running MapReduce pipelines. Its goal is to make pipelines that are

Modified: websites/staging/crunch/trunk/content/mailing-lists.html
==============================================================================
--- websites/staging/crunch/trunk/content/mailing-lists.html (original)
+++ websites/staging/crunch/trunk/content/mailing-lists.html Wed Nov 18 
11:49:54 2015
@@ -80,7 +80,7 @@
               
                 
                   
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 
0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
                   
                 
               
@@ -145,7 +145,18 @@
             
           </h1>
 
-          <!--
+          <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<!--
   Markdown-generated tables don't have the proper CSS classes,
   so we use plain HTML tables.
 -->

Modified: websites/staging/crunch/trunk/content/pipelines.html
==============================================================================
--- websites/staging/crunch/trunk/content/pipelines.html (original)
+++ websites/staging/crunch/trunk/content/pipelines.html Wed Nov 18 11:49:54 
2015
@@ -80,7 +80,7 @@
               
                 
                   
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 
0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
                   
                 
               
@@ -145,11 +145,22 @@
             
           </h1>
 
-          <p>This section discusses the different steps of creating your own 
Crunch pipelines in more detail.</p>
-<h2 id="writing-a-dofn">Writing a DoFn</h2>
+          <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>This section discusses the different steps of creating your own Crunch 
pipelines in more detail.</p>
+<h2 id="writing-a-dofn">Writing a DoFn<a class="headerlink" 
href="#writing-a-dofn" title="Permanent link">&para;</a></h2>
 <p>The DoFn class is designed to keep the complexity of the MapReduce APIs out 
of your way when you
 don't need them while still keeping them accessible when you do.</p>
-<h3 id="serialization">Serialization</h3>
+<h3 id="serialization">Serialization<a class="headerlink" 
href="#serialization" title="Permanent link">&para;</a></h3>
 <p>First, all DoFn instances are required to be 
<code>java.io.Serializable</code>. This is a key aspect of the library's design:
 once a particular DoFn is assigned to the Map or Reduce stage of a MapReduce 
job, all of the state
 of that DoFn is serialized so that it may be distributed to all of the nodes 
in the Hadoop cluster that
@@ -163,15 +174,15 @@ will be running that task. There are two
 such as creating a non-serializable member variable, can be performed before 
processing begins. Similarly, all
 DoFn instances have a <code>cleanup</code> method that may be called after 
processing has finished to perform any required
 cleanup tasks.</p>
-<h3 id="scale-factor">Scale Factor</h3>
+<h3 id="scale-factor">Scale Factor<a class="headerlink" href="#scale-factor" 
title="Permanent link">&para;</a></h3>
 <p>The DoFn class defines a <code>scaleFactor</code> method that can be used 
to signal to the MapReduce compiler that a particular
 DoFn implementation will yield an output PCollection that is larger 
(scaleFactor &gt; 1) or smaller (0 &lt; scaleFactor &lt; 1)
 than the input PCollection it is applied to. The compiler may use this 
information to determine how to optimally
 split processing tasks between the Map and Reduce phases of dependent 
MapReduce jobs.</p>
-<h3 id="other-utilities">Other Utilities</h3>
+<h3 id="other-utilities">Other Utilities<a class="headerlink" 
href="#other-utilities" title="Permanent link">&para;</a></h3>
 <p>The DoFn base class provides convenience methods for accessing the 
<code>Configuration</code> and <code>Counter</code> objects that
 are associated with a MapReduce stage, so that they may be accessed during 
initialization, processing, and cleanup.</p>
-<h3 id="performing-cogroups-and-joins">Performing Cogroups and Joins</h3>
+<h3 id="performing-cogroups-and-joins">Performing Cogroups and Joins<a 
class="headerlink" href="#performing-cogroups-and-joins" title="Permanent 
link">&para;</a></h3>
 <p>Cogroups and joins are performed on PTable instances that have the same key 
type. This section walks through
 the basic flow of a cogroup operation, explaining how this higher-level 
operation is composed of the four primitive operations.
 In general, these common operations are provided as part of the core library 
or in extensions, you do not need

Modified: websites/staging/crunch/trunk/content/scrunch.html
==============================================================================
--- websites/staging/crunch/trunk/content/scrunch.html (original)
+++ websites/staging/crunch/trunk/content/scrunch.html Wed Nov 18 11:49:54 2015
@@ -80,7 +80,7 @@
               
                 
                   
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 
0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
                   
                 
               
@@ -147,11 +147,22 @@
             
           </h1>
 
-          <h2 id="introduction">Introduction</h2>
+          <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<h2 id="introduction">Introduction<a class="headerlink" href="#introduction" 
title="Permanent link">&para;</a></h2>
 <p>Scrunch is an experimental Scala wrapper for the Apache Crunch Java API, 
based on the same ideas as the
 <a href="http://days2011.scala-lang.org/node/138/282";>Cascade</a> project at 
Google, which created a Scala wrapper for
 FlumeJava.</p>
-<h2 id="why-scala">Why Scala?</h2>
+<h2 id="why-scala">Why Scala?<a class="headerlink" href="#why-scala" 
title="Permanent link">&para;</a></h2>
 <p>In many ways, Scala is the perfect language for writing MapReduce 
pipelines. Scala supports
 a mixture of functional and object-oriented programming styles and has 
powerful type-inference
 capabilities, allowing us to create complex pipelines using very few 
keystrokes. Here is
@@ -189,7 +200,7 @@ the second:</p>
 </pre></div>
 
 
-<h2 id="materializing-job-outputs">Materializing Job Outputs</h2>
+<h2 id="materializing-job-outputs">Materializing Job Outputs<a 
class="headerlink" href="#materializing-job-outputs" title="Permanent 
link">&para;</a></h2>
 <p>The Scrunch API also incorporates the Java library's 
<code>materialize</code> functionality, which allows us to easily read
 the output of a MapReduce pipeline into the client:</p>
 <div class="codehilite"><pre><span class="n">class</span> <span 
class="n">WordCountExample</span> <span class="p">{</span>
@@ -198,7 +209,7 @@ the output of a MapReduce pipeline into
 </pre></div>
 
 
-<h2 id="notes-and-thanks">Notes and Thanks</h2>
+<h2 id="notes-and-thanks">Notes and Thanks<a class="headerlink" 
href="#notes-and-thanks" title="Permanent link">&para;</a></h2>
 <p>Scrunch emerged out of conversations with <a 
href="http://twitter.com/#!/squarecog";>Dmitriy Ryaboy</a>,
 <a href="http://twitter.com/#!/posco";>Oscar Boykin</a>, and <a 
href="http://twitter.com/#!/avibryant";>Avi Bryant</a> from Twitter.
 Many thanks to them for their feedback, guidance, and encouragement. We are 
also grateful to

Modified: websites/staging/crunch/trunk/content/source-repository.html
==============================================================================
--- websites/staging/crunch/trunk/content/source-repository.html (original)
+++ websites/staging/crunch/trunk/content/source-repository.html Wed Nov 18 
11:49:54 2015
@@ -80,7 +80,7 @@
               
                 
                   
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 
0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
                   
                 
               
@@ -145,7 +145,18 @@
             
           </h1>
 
-          <p>The Apache Crunch Project uses <a 
href="http://git-scm.com/";>Git</a> for version control. Run the
+          <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<p>The Apache Crunch Project uses <a href="http://git-scm.com/";>Git</a> for 
version control. Run the
 following command to clone the repository:</p>
 <div class="codehilite"><pre><span class="n">git</span> <span 
class="n">clone</span> <span class="n">https</span><span 
class="p">:</span><span class="o">//</span><span class="n">git</span><span 
class="o">-</span><span class="n">wip</span><span class="o">-</span><span 
class="n">us</span><span class="p">.</span><span class="n">apache</span><span 
class="p">.</span><span class="n">org</span><span class="o">/</span><span 
class="n">repos</span><span class="o">/</span><span class="n">asf</span><span 
class="o">/</span><span class="n">crunch</span><span class="p">.</span><span 
class="n">git</span>
 </pre></div>

Modified: websites/staging/crunch/trunk/content/user-guide.html
==============================================================================
--- websites/staging/crunch/trunk/content/user-guide.html (original)
+++ websites/staging/crunch/trunk/content/user-guide.html Wed Nov 18 11:49:54 
2015
@@ -80,7 +80,7 @@
               
                 
                   
-                    <li><a href="/apidocs/0.12.0/">API (supporting HBase 
0.96.x)</a></li>
+                    <li><a href="/apidocs/0.12.0/">API Documentation</a></li>
                   
                 
               
@@ -145,7 +145,18 @@
             
           </h1>
 
-          <ol>
+          <style type="text/css">
+/* The following code is added by mdx_elementid.py
+   It was originally lifted from http://subversion.apache.org/style/site.css */
+/*
+ * Hide class="elementid-permalink", except when an enclosing heading
+ * has the :hover property.
+ */
+.headerlink, .elementid-permalink {
+  visibility: hidden;
+}
+h2:hover > .headerlink, h3:hover > .headerlink, h1:hover > .headerlink, 
h6:hover > .headerlink, h4:hover > .headerlink, h5:hover > .headerlink, 
dt:hover > .elementid-permalink { visibility: visible }</style>
+<ol>
 <li><a href="#intro">Introduction to Crunch</a><ol>
 <li><a href="#motivation">Motivation</a></li>
 <li><a href="#datamodel">Data Model and Operators</a></li>
@@ -212,9 +223,9 @@
 <li><a href="#testing">Unit Testing Pipelines</a></li>
 </ol>
 <p><a name="intro"></a></p>
-<h2 id="introduction-to-crunch">Introduction to Crunch</h2>
+<h2 id="introduction-to-crunch">Introduction to Crunch<a class="headerlink" 
href="#introduction-to-crunch" title="Permanent link">&para;</a></h2>
 <p><a name="motivation"></a></p>
-<h3 id="motivation">Motivation</h3>
+<h3 id="motivation">Motivation<a class="headerlink" href="#motivation" 
title="Permanent link">&para;</a></h3>
 <p>Let's start with a basic question: why should you use <em>any</em> 
high-level tool for writing data pipelines, as opposed to developing against
 the MapReduce, Spark, or Tez APIs directly? Doesn't adding another layer of 
abstraction just increase the number of moving pieces you
 need to worry about, ala the <a 
href="http://www.joelonsoftware.com/articles/LeakyAbstractions.html";>Law of 
Leaky Abstractions</a>?</p>
@@ -305,7 +316,7 @@ top of Apache Hadoop:</p>
 <p>In the next section, we'll give a quick overview of Crunch's version of 
these abstractions and how they relate to each other before going
 into more detail about their usage in the rest of the guide.</p>
 <p><a name="datamodel"></a></p>
-<h3 id="data-model-and-operators">Data Model and Operators</h3>
+<h3 id="data-model-and-operators">Data Model and Operators<a 
class="headerlink" href="#data-model-and-operators" title="Permanent 
link">&para;</a></h3>
 <p>Crunch's Java API is centered around three interfaces that represent 
distributed datasets: <a 
href="apidocs/0.10.0/org/apache/crunch/PCollection.html">PCollection<T></a>,
 <a 
href="http://crunch.apache.org/apidocs/0.10.0/org/apache/crunch/PTable.html";>PTable<K,
 V></a>, and <a 
href="apidocs/0.10.0/org/apache/crunch/PGroupedTable.html">PGroupedTable<K, 
V></a>.</p>
 <p>A <code>PCollection&lt;T&gt;</code> represents a distributed, immutable 
collection of elements of type T. For example, we represent a text file as a
@@ -336,12 +347,12 @@ that are available for developers to use
 <li><a 
href="apidocs/0.10.0/org/apache/crunch/impl/spark/SparkPipeline.html">SparkPipeline</a>:
 Executes the pipeline by converting it to a series of Spark pipelines.</li>
 </ol>
 <p><a name="dataproc"></a></p>
-<h2 id="data-processing-with-dofns">Data Processing with DoFns</h2>
+<h2 id="data-processing-with-dofns">Data Processing with DoFns<a 
class="headerlink" href="#data-processing-with-dofns" title="Permanent 
link">&para;</a></h2>
 <p>DoFns represent the logical computations of your Crunch pipelines. They are 
designed to be easy to write, easy to test, and easy to deploy
 within the context of a MapReduce job. Much of your work with the Crunch APIs 
will be writing DoFns, and so having a good understanding of
 how to use them effectively is critical to crafting elegant and efficient 
pipelines.</p>
 <p><a name="dovsmap"></a></p>
-<h3 id="dofns-vs-mapper-and-reducer-classes">DoFns vs. Mapper and Reducer 
Classes</h3>
+<h3 id="dofns-vs-mapper-and-reducer-classes">DoFns vs. Mapper and Reducer 
Classes<a class="headerlink" href="#dofns-vs-mapper-and-reducer-classes" 
title="Permanent link">&para;</a></h3>
 <p>Let's see how DoFns compare to the Mapper and Reducer classes that you're 
used to writing when working with Hadoop's MapReduce API. When
 you're creating a MapReduce job, you start by declaring an instance of the 
<code>Job</code> class and using its methods to declare the implementations
 of the <code>Mapper</code> and <code>Reducer</code> classes that you want to 
use:</p>
@@ -431,7 +442,7 @@ of <code>static</code> methods on the cl
 regardless of whether the outer class is serializable or not. Using static 
methods to define your business logic in terms of a series of
 DoFns can also make your code easier to test by using in-memory PCollection 
implementations in your unit tests.</p>
 <p><a name="runproc"></a></p>
-<h3 id="runtime-processing-steps">Runtime Processing Steps</h3>
+<h3 id="runtime-processing-steps">Runtime Processing Steps<a 
class="headerlink" href="#runtime-processing-steps" title="Permanent 
link">&para;</a></h3>
 <p>After the Crunch runtime loads the serialized DoFns into its map and reduce 
tasks, the DoFns are executed on the input data via the following
 sequence:</p>
 <ol>
@@ -450,7 +461,7 @@ be used to emit the sum of a list of num
 other cleanup task that is appropriate once the task has finished 
executing.</li>
 </ol>
 <p><a name="mrapis"></a></p>
-<h3 id="accessing-runtime-mapreduce-apis">Accessing Runtime MapReduce APIs</h3>
+<h3 id="accessing-runtime-mapreduce-apis">Accessing Runtime MapReduce APIs<a 
class="headerlink" href="#accessing-runtime-mapreduce-apis" title="Permanent 
link">&para;</a></h3>
 <p>DoFns provide direct access to the <code>TaskInputOutputContext</code> 
object that is used within a given Map or Reduce task via the 
<code>getContext</code>
 method. There are also a number of helper methods for working with the objects 
associated with the TaskInputOutputContext, including:</p>
 <ul>
@@ -475,7 +486,7 @@ objects returned by Crunch at the end of
 Counter classes directly in your Crunch pipelines (the two 
<code>getCounter</code> methods that were defined in DoFn are both deprecated) 
so that you will not be
 required to recompile your job jars when you move from a Hadoop 1.0 cluster to 
a Hadoop 2.0 cluster.)</p>
 <p><a name="doplan"></a></p>
-<h3 
id="configuring-the-crunch-planner-and-mapreduce-jobs-with-dofns">Configuring 
the Crunch Planner and MapReduce Jobs with DoFns</h3>
+<h3 
id="configuring-the-crunch-planner-and-mapreduce-jobs-with-dofns">Configuring 
the Crunch Planner and MapReduce Jobs with DoFns<a class="headerlink" 
href="#configuring-the-crunch-planner-and-mapreduce-jobs-with-dofns" 
title="Permanent link">&para;</a></h3>
 <p>Although most of the DoFn methods are focused on runtime execution, there 
are a handful of methods that are used during the planning phase
 before a pipeline is converted into MapReduce jobs. The first of these 
functions is <code>float scaleFactor()</code>, which should return a floating 
point
 value greater than 0.0f. You can override the scaleFactor method in your 
custom DoFns in order to provide a hint to the Crunch planner about
@@ -488,7 +499,7 @@ on the client before processing begins b
 will require extra memory settings to run, and so you could make sure that the 
value of the <code>mapred.child.java.opts</code> argument had a large enough
 memory setting for the DoFn's needs before the job was launched on the 
cluster.</p>
 <p><a name="mapfn"></a></p>
-<h3 id="common-dofn-patterns">Common DoFn Patterns</h3>
+<h3 id="common-dofn-patterns">Common DoFn Patterns<a class="headerlink" 
href="#common-dofn-patterns" title="Permanent link">&para;</a></h3>
 <p>The Crunch APIs contain a number of useful subclasses of DoFn that handle 
common data processing scenarios and are easier
 to write and test. The top-level <a 
href="apidocs/0.10.0/org/apache/crunch/package-summary.html">org.apache.crunch</a>
 package contains three
 of the most important specializations, which we will discuss now. Each of 
these specialized DoFn implementations has associated methods
@@ -519,7 +530,7 @@ interface, which is defined right alongs
 interface defined via static factory methods in the <a 
href="apidocs/0.10.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a> 
class. We will discuss
 Aggregators more in the section on <a href="#aggregators">common MapReduce 
patterns</a>.</p>
 <p><a name="serde"></a></p>
-<h2 id="serializing-data-with-ptypes">Serializing Data with PTypes</h2>
+<h2 id="serializing-data-with-ptypes">Serializing Data with PTypes<a 
class="headerlink" href="#serializing-data-with-ptypes" title="Permanent 
link">&para;</a></h2>
 <p>Every <code>PCollection&lt;T&gt;</code> has an associated 
<code>PType&lt;T&gt;</code> that encapsulates the information on how to 
serialize and deserialize the contents of that
 PCollection. PTypes are necessary because of <a 
href="http://docs.oracle.com/javase/tutorial/java/generics/erasure.html";>type 
erasure</a>; at runtime, when
 the Crunch planner is mapping from PCollections to a series of MapReduce jobs, 
the type of a PCollection (that is, the <code>T</code> in 
<code>PCollection&lt;T&gt;</code>)
@@ -548,7 +559,7 @@ to mix-and-match PCollections that use d
 read in Writable data, do a shuffle using Avro, and then write the output data 
as Writables), but each PCollection's PType must belong to a single
 type family; for example, you cannot have a PTable whose key is serialized as 
a Writable and whose value is serialized as an Avro record.</p>
 <p><a name="corept"></a></p>
-<h3 id="core-ptypes">Core PTypes</h3>
+<h3 id="core-ptypes">Core PTypes<a class="headerlink" href="#core-ptypes" 
title="Permanent link">&para;</a></h3>
 <p>Both type families support a common set of primitive types (strings, longs, 
ints, floats, doubles, booleans, and bytes) as well as more complex
 PTypes that can be constructed out of other PTypes:</p>
 <ol>
@@ -673,7 +684,7 @@ for POJOs using Avro's reflection-based
 and easy to test, but the fact that the data is written out as Avro records 
means that you can use tools like Hive and Pig
 to query intermediate results to aid in debugging pipeline failures.</p>
 <p><a name="extendpt"></a></p>
-<h3 id="extending-ptypes">Extending PTypes</h3>
+<h3 id="extending-ptypes">Extending PTypes<a class="headerlink" 
href="#extending-ptypes" title="Permanent link">&para;</a></h3>
 <p>The simplest way to create a new <code>PType&lt;T&gt;</code> for a data 
object is to create a <em>derived</em> PType from one of the built-in PTypes 
from the Avro
 and Writable type families. If we have a base <code>PType&lt;S&gt;</code>, we 
can create a derived <code>PType&lt;T&gt;</code> by implementing an input 
<code>MapFn&lt;S, T&gt;</code> and an
 output <code>MapFn&lt;T, S&gt;</code> and then calling 
<code>PTypeFamily.derived(Class&lt;T&gt;, MapFn&lt;S, T&gt; in, MapFn&lt;T, 
S&gt; out, PType&lt;S&gt; base)</code>, which will return
@@ -695,7 +706,7 @@ easy to work with the POJO directly in y
 </pre>
 
 <p><a name="rwdata"></a></p>
-<h2 id="reading-and-writing-data">Reading and Writing Data</h2>
+<h2 id="reading-and-writing-data">Reading and Writing Data<a 
class="headerlink" href="#reading-and-writing-data" title="Permanent 
link">&para;</a></h2>
 <p>In the introduction to this user guide, we noted that all of the major 
tools for working with data pipelines on Hadoop include some sort of abstraction
 for working with the <code>InputFormat&lt;K, V&gt;</code> and 
<code>OutputFormat&lt;K, V&gt;</code> classes defined in the MapReduce APIs. 
For example, Hive includes
 SerDes, and Pig requires LoadFuncs and StoreFuncs. Let's take a moment to 
explain what functionality these abstractions provide for
@@ -719,13 +730,13 @@ to wrap an InputFormat and its associate
 job, even if those Sources have the same InputFormat. On the output side, the 
<code>Target</code> interface can be used in the same way to wrap a
 Hadoop <code>OutputFormat</code> and its associated key-value pairs in a way 
that can be isolated from any other outputs of a pipeline stage.</p>
 <p><a name="notethis"></a></p>
-<h3 id="a-note-on-sources-targets-and-hadoop-apis">A Note on Sources, Targets, 
and Hadoop APIs</h3>
+<h3 id="a-note-on-sources-targets-and-hadoop-apis">A Note on Sources, Targets, 
and Hadoop APIs<a class="headerlink" 
href="#a-note-on-sources-targets-and-hadoop-apis" title="Permanent 
link">&para;</a></h3>
 <p>Crunch, like Hive and Pig, is developed against the <a 
href="http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/package-summary.html";>org.apache.hadoop.mapreduce</a>
 API, not the older <a 
href="http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/package-summary.html";>org.apache.hadoop.mapred</a>
 API.
 This means that Crunch Sources and Targets expect subclasses of the new <a 
href="http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/InputFormat.html";>InputFormat</a>
 and <a 
href="http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/OutputFormat.html";>OutputFormat</a>
 classes. These new
 classes are not 1:1 compatible with the InputFormat and OutputFormat classes 
associated with the <code>org.apache.hadoop.mapred</code> APIs, so please be
 aware of this difference when considering using existing InputFormats and 
OutputFormats with Crunch's Sources and Targets.</p>
 <p><a name="sources"></a></p>
-<h3 id="sources">Sources</h3>
+<h3 id="sources">Sources<a class="headerlink" href="#sources" title="Permanent 
link">&para;</a></h3>
 <p>Crunch defines both <code>Source&lt;T&gt;</code> and 
<code>TableSource&lt;K, V&gt;</code> interfaces that allow us to read an input 
as a <code>PCollection&lt;T&gt;</code> or a <code>PTable&lt;K, V&gt;</code>.
 You use a Source in conjunction with one of the <code>read</code> methods on 
the Pipeline interface:</p>
 <pre>
@@ -801,7 +812,7 @@ different files using the NLineInputForm
 </table>
 
 <p><a name="targets"></a></p>
-<h3 id="targets">Targets</h3>
+<h3 id="targets">Targets<a class="headerlink" href="#targets" title="Permanent 
link">&para;</a></h3>
 <p>Crunch's <code>Target</code> interface is the analogue of 
<code>Source&lt;T&gt;</code> for OutputFormats. You create Targets for use with 
the <code>write</code> method
 defined on the <code>Pipeline</code> interface:</p>
 <pre>
@@ -873,7 +884,7 @@ parameters that this Target needs:</p>
 </table>
 
 <p><a name="srctargets"></a></p>
-<h3 id="sourcetargets-and-write-modes">SourceTargets and Write Modes</h3>
+<h3 id="sourcetargets-and-write-modes">SourceTargets and Write Modes<a 
class="headerlink" href="#sourcetargets-and-write-modes" title="Permanent 
link">&para;</a></h3>
 <p>The <code>SourceTarget&lt;T&gt;</code> interface extends both the 
<code>Source&lt;T&gt;</code> and <code>Target</code> interfaces and allows a 
Path to act as both a
 Target for some PCollections as well as a Source for others. SourceTargets are 
convenient for any intermediate outputs within
 your pipeline. Just as we have the factory methods in the From and To classes 
for Sources and Targets, factory methods for
@@ -904,7 +915,7 @@ WriteModes for Crunch:</p>
 </pre>
 
 <p><a name="materialize"></a></p>
-<h3 id="materializing-data-into-the-client">Materializing Data Into the 
Client</h3>
+<h3 id="materializing-data-into-the-client">Materializing Data Into the 
Client<a class="headerlink" href="#materializing-data-into-the-client" 
title="Permanent link">&para;</a></h3>
 <p>In many analytical applications, we need to use the output of one phase of 
a data pipeline in order to configure subsequent pipeline
 stages. For example, many machine learning applications require that we 
iterate over a dataset until some convergence criteria is
 met. Crunch provides API methods that make it possible to materialize the data 
from a PCollection and stream the resulting data into
@@ -930,12 +941,12 @@ interface that has an associated <code>V
 of elements contained in that PCollection, but the pipeline tasks required to 
compute this value will not run until the <code>Long getValue()</code>
 method of the returned PObject is called.</p>
 <p><a name="patterns"></a></p>
-<h2 id="data-processing-patterns-in-crunch">Data Processing Patterns in 
Crunch</h2>
+<h2 id="data-processing-patterns-in-crunch">Data Processing Patterns in 
Crunch<a class="headerlink" href="#data-processing-patterns-in-crunch" 
title="Permanent link">&para;</a></h2>
 <p>This section describes the various data processing patterns implemented in 
Crunch's library APIs,
 which are in the <a 
href="apidocs/0.10.0/org/apache/crunch/lib/package-summary.html">org.apache.crunch.lib</a>
 package.</p>
 <p><a name="gbk"></a></p>
-<h3 id="groupbykey">groupByKey</h3>
+<h3 id="groupbykey">groupByKey<a class="headerlink" href="#groupbykey" 
title="Permanent link">&para;</a></h3>
 <p>Most of the data processing patterns described in this section rely on 
PTable's groupByKey method,
 which controls how data is shuffled and aggregated by the underlying execution 
engine. The groupByKey
 method has three flavors on the PTable interface:</p>
@@ -973,7 +984,7 @@ same classes may also be used with other
 options specified that will only be applied to the job that actually executes 
that phase of the data
 pipeline.</p>
 <p><a name="aggregators"></a></p>
-<h3 id="combinevalues">combineValues</h3>
+<h3 id="combinevalues">combineValues<a class="headerlink" 
href="#combinevalues" title="Permanent link">&para;</a></h3>
 <p>Calling one of the groupByKey methods on PTable returns an instance of the 
PGroupedTable interface.
 PGroupedTable provides a <code>combineValues</code> that can be used to signal 
to the planner that we want to perform
 associative aggregations on our data both before and after the shuffle.</p>
@@ -1018,7 +1029,7 @@ the average of a set of values:</p>
 </pre>
 
 <p><a name="simpleagg"></a></p>
-<h3 id="simple-aggregations">Simple Aggregations</h3>
+<h3 id="simple-aggregations">Simple Aggregations<a class="headerlink" 
href="#simple-aggregations" title="Permanent link">&para;</a></h3>
 <p>Many of the most common aggregation patterns in Crunch are provided as 
methods on the PCollection
 interface, including <code>count</code>, <code>max</code>, <code>min</code>, 
and <code>length</code>. The implementations of these methods,
 however, are in the <a 
href="apidocs/0.10.0/org/apache/crunch/lib/Aggregate.html">Aggregate</a> 
library class.
@@ -1040,7 +1051,7 @@ most frequently occuring elements, you w
 </pre>
 
 <p><a name="joins"></a></p>
-<h3 id="joining-data">Joining Data</h3>
+<h3 id="joining-data">Joining Data<a class="headerlink" href="#joining-data" 
title="Permanent link">&para;</a></h3>
 <p>Joins in Crunch are based on equal-valued keys in different PTables. Joins 
have also evolved
 a great deal in Crunch over the lifetime of the project. The <a 
href="apidocs/0.10.0/org/apache/crunch/lib/Join.html">Join</a>
 API provides simple methods for performing equijoins, left joins, right joins, 
and full joins, but modern
@@ -1064,14 +1075,14 @@ a given key in a PCollection, so joining
 surprising results. Using a non-null dummy value in your PCollections is a 
good idea in
 general.</p>
 <p><a name="reducejoin"></a></p>
-<h4 id="reduce-side-joins">Reduce-side Joins</h4>
+<h4 id="reduce-side-joins">Reduce-side Joins<a class="headerlink" 
href="#reduce-side-joins" title="Permanent link">&para;</a></h4>
 <p>Reduce-side joins are handled by the <a 
href="apidocs/0.10.0/org/apache/crunch/lib/join/DefaultJoinStrategy.html">DefaultJoinStrategy</a>.
 Reduce-side joins are the simplest and most robust kind of joins in Hadoop; 
the keys from the two inputs are
 shuffled together to the reducers, where the values from the smaller of the 
two collections are collected and then
 streamed over the values from the larger of the two collections. You can 
control the number of reducers that is used
 to perform the join by passing an integer argument to the DefaultJoinStrategy 
constructor.</p>
 <p><a name="mapjoin"></a></p>
-<h4 id="map-side-joins">Map-side Joins</h4>
+<h4 id="map-side-joins">Map-side Joins<a class="headerlink" 
href="#map-side-joins" title="Permanent link">&para;</a></h4>
 <p>Map-side joins are handled by the <a 
href="apidocs/0.10.0/org/apache/crunch/lib/join/MapsideJoinStrategy.html">MapsideJoinStrategy</a>.
 Map-side joins require that the smaller of the two input tables is loaded into 
memory on the tasks on the cluster, so
 there is a requirement that at least one of the tables be relatively small so 
that it can comfortably fit into memory within
@@ -1084,7 +1095,7 @@ recommend that you use the <code>Mapside
 implementation of the MapsideJoinStrategy in which the left-side PTable is 
loaded into
 memory instead of the right-side PTable.</p>
 <p><a name="shardedjoin"></a></p>
-<h4 id="sharded-joins">Sharded Joins</h4>
+<h4 id="sharded-joins">Sharded Joins<a class="headerlink" 
href="#sharded-joins" title="Permanent link">&para;</a></h4>
 <p>Many distributed joins have skewed data that can cause regular reduce-side 
joins to fail due to out-of-memory issues on
 the partitions that happen to contain the keys with highest cardinality. To 
handle these skew issues, Crunch has the
 <a 
href="apidocs/0.10.0/org/apache/crunch/lib/join/ShardedJoinStrategy.html">ShardedJoinStrategy</a>
 that allows developers to shard
@@ -1092,7 +1103,7 @@ each key to multiple reducers, which pre
 in exchange for sending more data over the wire. For problems with significant 
skew issues, the ShardedJoinStrategy can
 significantly improve performance.</p>
 <p><a name="bloomjoin"></a></p>
-<h4 id="bloom-filter-joins">Bloom Filter Joins</h4>
+<h4 id="bloom-filter-joins">Bloom Filter Joins<a class="headerlink" 
href="#bloom-filter-joins" title="Permanent link">&para;</a></h4>
 <p>Last but not least, the <a 
href="apidocs/0.10.0/org/apache/crunch/lib/join/BloomFilterJoinStrategy.html">BloomFilterJoinStrategy</a>
 builds
 a <a href="http://en.wikipedia.org/wiki/Bloom_filter";>bloom filter</a> on the 
left-hand side table that is used to filter the contents
 of the right-hand side table to eliminate entries from the (larger) right-hand 
side table that have no hope of being joined
@@ -1100,7 +1111,7 @@ to values in the left-hand side table. T
 into memory on the tasks of the job, but is still significantly smaller than 
the right-hand side table, and we know that the
 vast majority of the keys in the right-hand side table will not match the keys 
in the left-hand side of the table.</p>
 <p><a name="cogroups"></a></p>
-<h4 id="cogroups">Cogroups</h4>
+<h4 id="cogroups">Cogroups<a class="headerlink" href="#cogroups" 
title="Permanent link">&para;</a></h4>
 <p>Some kinds of joins are richer and more complex then the typical kind of 
relational join that are handled by JoinStrategy.
 For example, we might want to join two datasets
 together and only emit a record if each of the sets had at least two distinct 
values associated
@@ -1125,12 +1136,12 @@ PTable whose values are made up of Colle
 how they work, you can consult the <a 
href="http://chimera.labs.oreilly.com/books/1234000001811/ch06.html";>section on 
cogroups</a>
 in the Apache Pig book.</p>
 <p><a name="sorting"></a></p>
-<h3 id="sorting">Sorting</h3>
+<h3 id="sorting">Sorting<a class="headerlink" href="#sorting" title="Permanent 
link">&para;</a></h3>
 <p>After joins and cogroups, sorting data is the most common distributed 
computing pattern. The
 Crunch APIs have a number of utilities for performing fully distributed sorts 
as well as
 more advanced patterns like secondary sorts.</p>
 <p><a name="stdsort"></a></p>
-<h4 id="standard-and-reverse-sorting">Standard and Reverse Sorting</h4>
+<h4 id="standard-and-reverse-sorting">Standard and Reverse Sorting<a 
class="headerlink" href="#standard-and-reverse-sorting" title="Permanent 
link">&para;</a></h4>
 <p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/Sort.html">Sort</a> API 
methods contain utility functions
 for sorting the contents of PCollections and PTables whose contents implement 
the <code>Comparable</code>
 interface. By default, MapReduce does not perform total sorts on its keys 
during a shuffle; instead
@@ -1160,7 +1171,7 @@ the <a href="apidocs/0.10.0/org/apache/c
 </pre>
 
 <p><a name="secsort"></a></p>
-<h4 id="secondary-sorts">Secondary Sorts</h4>
+<h4 id="secondary-sorts">Secondary Sorts<a class="headerlink" 
href="#secondary-sorts" title="Permanent link">&para;</a></h4>
 <p>Another pattern that occurs frequently in distributed processing is 
<em>secondary sorts</em>, where we
 want to group a set of records by one key and sort the records within each 
group by a second key.
 The <a 
href="apidocs/0.10.0/org/apache/crunch/lib/SecondarySort.html">SecondarySort</a>
 API provides a set
@@ -1169,11 +1180,11 @@ where <code>K</code> is the primary grou
 method will perform the grouping and sorting and will then apply a given DoFn 
to process the
 grouped and sorted values.</p>
 <p><a name="otheropts"></a></p>
-<h3 id="other-operations">Other Operations</h3>
+<h3 id="other-operations">Other Operations<a class="headerlink" 
href="#other-operations" title="Permanent link">&para;</a></h3>
 <p>Crunch provides implementations of a number of other common distributed 
processing patterns and
 techniques throughout its library APIs.</p>
 <p><a name="cartesian"></a></p>
-<h4 id="cartesian-products">Cartesian Products</h4>
+<h4 id="cartesian-products">Cartesian Products<a class="headerlink" 
href="#cartesian-products" title="Permanent link">&para;</a></h4>
 <p>Cartesian products between PCollections are a bit tricky in distributed 
processing; we usually want
 one of the datasets to be small enough to fit into memory, and then do a pass 
over the larger data
 set where we emit an element of the smaller data set along with each element 
from the larger set.</p>
@@ -1183,7 +1194,7 @@ provides methods for a reduce-side full
 this is a pretty expensive operation, and you should go out of your way to 
avoid these kinds of processing
 steps in your pipelines.</p>
 <p><a name="shard"></a></p>
-<h4 id="coalescing">Coalescing</h4>
+<h4 id="coalescing">Coalescing<a class="headerlink" href="#coalescing" 
title="Permanent link">&para;</a></h4>
 <p>Many MapReduce jobs have the potential to generate a large number of small 
files that could be used more
 effectively by clients if they were all merged together into a small number of 
large files. The
 <a href="apidocs/0.10.0/org/apache/crunch/lib/Shard.html">Shard</a> API 
provides a single method, <code>shard</code>, that allows
@@ -1196,7 +1207,7 @@ you to coalesce a given PCollection into
 <p>This has the effect of running a no-op MapReduce job that shuffles the data 
into the given number of
 partitions. This is often a useful step at the end of a long pipeline run.</p>
 <p><a name="distinct"></a></p>
-<h4 id="distinct">Distinct</h4>
+<h4 id="distinct">Distinct<a class="headerlink" href="#distinct" 
title="Permanent link">&para;</a></h4>
 <p>Crunch's <a 
href="apidocs/0.10.0/org/apache/crunch/lib/Distinct.html">Distinct</a> API has 
a method, <code>distinct</code>, that
 returns one copy of each unique element in a given PCollection:</p>
 <pre>
@@ -1218,7 +1229,7 @@ with another method in Distinct:</p>
 value for your own pipelines. The optimal value will depend on some 
combination of the size of the objects (and
 thus the amount of memory they consume) and the number of unique elements in 
the data.</p>
 <p><a name="sampling"></a></p>
-<h4 id="sampling">Sampling</h4>
+<h4 id="sampling">Sampling<a class="headerlink" href="#sampling" 
title="Permanent link">&para;</a></h4>
 <p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/Sample.html">Sample</a> 
API provides methods for two sorts of PCollection
 sampling: random and reservoir.</p>
 <p>Random sampling is where you include each record in the same with a fixed 
probability, and is probably what you're
@@ -1244,11 +1255,11 @@ collection! You can read more about how
 random number generators. Note that all of the sampling algorithms Crunch 
provides, both random and reservoir,
 only require a single pass over the data.</p>
 <p><a name="sets"></a></p>
-<h4 id="set-operations">Set Operations</h4>
+<h4 id="set-operations">Set Operations<a class="headerlink" 
href="#set-operations" title="Permanent link">&para;</a></h4>
 <p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/Set.html">Set</a> API 
methods complement Crunch's built-in <code>union</code> methods and
 provide support for finding the intersection, the difference, or the <a 
href="http://en.wikipedia.org/wiki/Comm";>comm</a> of two PCollections.</p>
 <p><a name="splits"></a></p>
-<h4 id="splits">Splits</h4>
+<h4 id="splits">Splits<a class="headerlink" href="#splits" title="Permanent 
link">&para;</a></h4>
 <p>Sometimes, you want to write two different outputs from the same DoFn into 
different PCollections. An example of this would
 be a pipeline in which you wanted to write good records to one file and bad or 
corrupted records to a different file for
 further examination. The <a 
href="apidocs/0.10.0/org/apache/crunch/lib/Channels.html">Channels</a> class 
provides a method that allows
@@ -1261,7 +1272,7 @@ you to split an input PCollection of Pai
 </pre>
 
 <p><a name="objectreuse"></a></p>
-<h3 id="retaining-objects-within-dofns">Retaining objects within DoFns</h3>
+<h3 id="retaining-objects-within-dofns">Retaining objects within DoFns<a 
class="headerlink" href="#retaining-objects-within-dofns" title="Permanent 
link">&para;</a></h3>
 <p>For reasons of efficiency, Hadoop MapReduce repeatedly passes the <a 
href="https://issues.apache.org/jira/browse/HADOOP-2399";>same references as 
keys and values to Mappers and Reducers</a> instead of passing in new objects 
for each call. 
 The state of the singleton key and value objects is updated between each call 
 to <code>Mapper.map()</code> and <code>Reducer.reduce()</code>, as well as 
updating it between each 
@@ -1316,7 +1327,7 @@ the maximum value encountered would be i
 
 
 <p><a name="hbase"></a></p>
-<h2 id="crunch-for-hbase">Crunch for HBase</h2>
+<h2 id="crunch-for-hbase">Crunch for HBase<a class="headerlink" 
href="#crunch-for-hbase" title="Permanent link">&para;</a></h2>
 <p>Crunch is an excellent platform for creating pipelines that involve 
processing data from HBase tables. Because of Crunch's
 flexible schemas for PCollections and PTables, you can write pipelines that 
operate directly on HBase API classes like
 <code>Put</code>, <code>KeyValue</code>, and <code>Result</code>.</p>
@@ -1334,7 +1345,7 @@ hfiles directly, which is much faster th
 into HBase tables. See the utility methods in the <a 
href="apidocs/0.10.0/org/apache/crunch/io/hbase/HFileUtils.html">HFileUtils</a> 
class for
 more details on how to work with PCollections against hfiles.</p>
 <p><a name="exec"></a></p>
-<h2 id="managing-pipeline-execution">Managing Pipeline Execution</h2>
+<h2 id="managing-pipeline-execution">Managing Pipeline Execution<a 
class="headerlink" href="#managing-pipeline-execution" title="Permanent 
link">&para;</a></h2>
 <p>Crunch uses a lazy execution model. No jobs are run or outputs created 
until the user explicitly invokes one of the methods on the
 Pipeline interface that controls job planning and execution. The simplest of 
these methods is the <code>PipelineResult run()</code> method,
 which analyzes the current graph of PCollections and Target outputs and comes 
up with a plan to ensure that each of the outputs is
@@ -1356,11 +1367,11 @@ If the planner detects a materialized or
 PCollection to its own choice. The implementation of materialize and cache 
vary slightly between the MapReduce-based and Spark-based
 execution pipelines in a way that is explained in the subsequent section of 
the guide.</p>
 <p><a name="pipelines"></a></p>
-<h2 
id="the-different-pipeline-implementations-properties-and-configuration-options">The
 Different Pipeline Implementations (Properties and Configuration options)</h2>
+<h2 
id="the-different-pipeline-implementations-properties-and-configuration-options">The
 Different Pipeline Implementations (Properties and Configuration options)<a 
class="headerlink" 
href="#the-different-pipeline-implementations-properties-and-configuration-options"
 title="Permanent link">&para;</a></h2>
 <p>This section adds some additional details about the implementation and 
configuration options available for each of
 the different execution engines.</p>
 <p><a name="mrpipeline"></a></p>
-<h3 id="mrpipeline">MRPipeline</h3>
+<h3 id="mrpipeline">MRPipeline<a class="headerlink" href="#mrpipeline" 
title="Permanent link">&para;</a></h3>
 <p>The <a 
href="apidocs/0.10.0/org/apache/crunch/impl/mr/MRPipeline.html">MRPipeline</a> 
is the oldest implementation of the Pipeline interface and
 compiles and executes the DAG of PCollections into a series of MapReduce jobs. 
MRPipeline has three constructors that are commonly
 used:</p>
@@ -1420,7 +1431,7 @@ aware of:</p>
 </table>
 
 <p><a name="sparkpipeline"></a></p>
-<h3 id="sparkpipeline">SparkPipeline</h3>
+<h3 id="sparkpipeline">SparkPipeline<a class="headerlink" 
href="#sparkpipeline" title="Permanent link">&para;</a></h3>
 <p>The <code>SparkPipeline</code> is the newest implementation of the Pipeline 
interface, and was added in Crunch 0.10.0. It has two default constructors:</p>
 <ol>
 <li><code>SparkPipeline(String sparkConnection, String appName)</code> which 
takes a Spark connection string, which is of the form 
<code>local[numThreads]</code> for
@@ -1446,7 +1457,7 @@ get strange and unpredictable failures i
 be a little rough around the edges and may not handle all of the use cases 
that MRPipeline can handle, although the Crunch community is
 actively working to ensure complete compatibility between the two 
implementations.</p>
 <p><a name="mempipeline"></a></p>
-<h3 id="mempipeline">MemPipeline</h3>
+<h3 id="mempipeline">MemPipeline<a class="headerlink" href="#mempipeline" 
title="Permanent link">&para;</a></h3>
 <p>The <a 
href="apidocs/0.10.0/org/apache/crunch/impl/mem/MemPipeline.html">MemPipeline</a>
 implementation of Pipeline has a few interesting
 properties. First, unlike MRPipeline, MemPipeline is a singleton; you don't 
create a MemPipeline, you just get a reference to it
 via the static <code>MemPipeline.getInstance()</code> method. Second, all of 
the operations in the MemPipeline are executed completely in-memory,
@@ -1479,10 +1490,10 @@ on the read side. Often the best way to
 <code>materialize()</code> method to get a reference to the contents of the 
in-memory collection and then verify them directly,
 without writing them out to disk.</p>
 <p><a name="testing"></a></p>
-<h2 id="unit-testing-pipelines">Unit Testing Pipelines</h2>
+<h2 id="unit-testing-pipelines">Unit Testing Pipelines<a class="headerlink" 
href="#unit-testing-pipelines" title="Permanent link">&para;</a></h2>
 <p>For production data pipelines, unit tests are an absolute must. The <a 
href="#mempipeline">MemPipeline</a> implementation of the Pipeline
 interface has several tools to help developers create effective unit tests, 
which will be detailed in this section.</p>
-<h3 id="unit-testing-dofns">Unit Testing DoFns</h3>
+<h3 id="unit-testing-dofns">Unit Testing DoFns<a class="headerlink" 
href="#unit-testing-dofns" title="Permanent link">&para;</a></h3>
 <p>Many of the DoFn implementations, such as <code>MapFn</code> and 
<code>FilterFn</code>, are very easy to test, since they accept a single input
 and return a single output. For general purpose DoFns, we need an instance of 
the <a href="apidocs/0.10.0/org/apache/crunch/Emitter.html">Emitter</a>
 interface that we can pass to the DoFn's <code>process</code> method and then 
read in the values that are written by the function. Support
@@ -1497,7 +1508,7 @@ has a <code>List&lt;T&gt; getOutput()</c
 </pre></div>
 
 
-<h3 id="testing-complex-dofns-and-pipelines">Testing Complex DoFns and 
Pipelines</h3>
+<h3 id="testing-complex-dofns-and-pipelines">Testing Complex DoFns and 
Pipelines<a class="headerlink" href="#testing-complex-dofns-and-pipelines" 
title="Permanent link">&para;</a></h3>
 <p>Many of the DoFns we write involve more complex processing that require 
that our DoFn be initialized and cleaned up, or that
 define Counters that we use to track the inputs that we receive. In order to 
ensure that our DoFns are working properly across
 their entire lifecycle, it's best to use the <a 
href="#mempipeline">MemPipeline</a> implementation to create in-memory 
instances of
@@ -1532,7 +1543,7 @@ those Counters between test runs by call
 </pre></div>
 
 
-<h3 id="designing-testable-data-pipelines">Designing Testable Data 
Pipelines</h3>
+<h3 id="designing-testable-data-pipelines">Designing Testable Data Pipelines<a 
class="headerlink" href="#designing-testable-data-pipelines" title="Permanent 
link">&para;</a></h3>
 <p>In the same way that we try to <a 
href="http://misko.hevery.com/code-reviewers-guide/";>write testable code</a>, 
we want to ensure that
 our data pipelines are written in a way that makes them easy to test. In 
general, you should try to break up complex pipelines
 into a number of function calls that perform a small set of operations on 
input PCollections and return one or more PCollections
@@ -1576,7 +1587,7 @@ is taken from one of Crunch's integratio
 computations that combine custom DoFns with Crunch's built-in 
<code>cogroup</code> operation by using the <a 
href="#mempipeline">MemPipeline</a>
 implementation to create test data sets that we can easily verify by hand, and 
then this same logic can be executed on
 a distributed data set using either the <a href="#mrpipeline">MRPipeline</a> 
or <a href="#sparkpipeline">SparkPipeline</a> implementations.</p>
-<h3 id="pipeline-execution-plan-visualizations">Pipeline execution plan 
visualizations</h3>
+<h3 id="pipeline-execution-plan-visualizations">Pipeline execution plan 
visualizations<a class="headerlink" 
href="#pipeline-execution-plan-visualizations" title="Permanent 
link">&para;</a></h3>
 <p>Crunch provides tools to visualize the pipeline execution plan. The <a 
href="apidocs/0.10.0/org/apache/crunch/PipelineExecution.html">PipelineExecution</a>
  <code>String getPlanDotFile()</code> method returns a DOT format 
visualization of the exaction plan. Furthermore if the output folder is set 
then Crunch will save the dotfile diagram on each pipeline execution: </p>
 <div class="codehilite"><pre>    <span class="n">Configuration</span> <span 
class="n">conf</span> <span class="p">=...;</span>     
     <span class="n">String</span> <span class="n">dotfileDir</span> <span 
class="p">=...;</span>

svn commit: r972840 - in /websites/staging/crunch/trunk/content: ./ about.html bylaws.html download.html future-work.html getting-started.html index.html mailing-lists.html pipelines.html scrunch.html source-repository.html user-guide.html

Reply via email to