Added: websites/staging/crunch/trunk/content/crunch/css/bootstrap-2.1.0.min.css
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/css/bootstrap-2.1.0.min.css
(added)
+++ websites/staging/crunch/trunk/content/crunch/css/bootstrap-2.1.0.min.css
Sun Sep 16 18:50:04 2012
@@ -0,0 +1,9 @@
+/*!
+ * Bootstrap v2.1.0
+ *
+ * Copyright 2012 Twitter, Inc
+ * Licensed under the Apache License v2.0
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Designed and built with all the love in the world @twitter by @mdo and @fat.
[... 2 lines stripped ...]
Added: websites/staging/crunch/trunk/content/crunch/css/crunch.css
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/css/crunch.css (added)
+++ websites/staging/crunch/trunk/content/crunch/css/crunch.css Sun Sep 16
18:50:04 2012
@@ -0,0 +1,4 @@
+.nav-list {
+ padding-left: 5px;
+ padding-right: 5px;
+}
Added: websites/staging/crunch/trunk/content/crunch/future-work.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/future-work.html (added)
+++ websites/staging/crunch/trunk/content/crunch/future-work.html Sun Sep 16
18:50:04 2012
@@ -0,0 +1,141 @@
+<!DOCTYPE html>
+
+
+<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
+ <head>
+ <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+ <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+ <meta http-equiv="Content-Language" content="en" />
+
+ <title>Apache Crunch - Current Limitations and Future Work</title>
+
+ <link rel="stylesheet" href="/crunch/css/bootstrap-2.1.0.min.css" />
+ <link rel="stylesheet" href="/crunch/css/crunch.css" type="text/css">
+ <script type="text/javascript"
src="/crunch/js/bootstrap-2.1.0.min.js"></script>
+ </head>
+ <body>
+
+ <div class="navbar navbar-inverse navbar-static-top">
+
+ <div class="container-fluid">
+
+ <a class="nav pull-right brand" href="http://incubator.apache.org">
+ <img src="http://incubator.apache.org/images/egg-logo.png"
alt="apache Incubator Logo" />
+ </a>
+
+ </div>
+
+ </div>
+
+ <ul class="breadcrumb">
+ <li>
+ <a href="/">Incubator</a>
+ <span class="divider">»</span>
+ </li>
+ <li>
+ <a href="/crunch/">Crunch</a>
+ </li>
+
+ </ul>
+
+ <div class="container-fluid">
+ <div class="row-fluid">
+
+ <!-- SIDEBAR AREA -->
+ <div class="span2">
+ <div class="sidebar-nav">
+ <ul class="nav nav-list">
+
+
+ <li class="nav-header">Apache Crunch</li>
+
+
+
+
+ <li><a href="/crunch/index.html">Overview</a></li>
+
+
+
+
+
+ <li><a href="/crunch/apidocs/">API</a></li>
+
+
+
+
+
+ <li><a
href="https://cwiki.apache.org/confluence/display/CRUNCH/">Wiki</a></li>
+
+
+
+
+ <li class="nav-header">Project</li>
+
+
+
+
+ <li><a href="/crunch/source-repository.html">Source
Code</a></li>
+
+
+
+
+
+ <li><a href="/crunch/mailing-lists.html">Mailing
Lists</a></li>
+
+
+
+
+
+ <li><a
href="http://issues.apache.org/jira/browse/CRUNCH">Issue Tracking</a></li>
+
+
+
+
+
+ <li><a
href="http://apache.org/licenses/LICENSE-2.0.html">License</a></li>
+
+
+
+ </ul>
+ </div> <!-- /well -->
+ </div> <!-- /span -->
+
+ <!-- CONTENT AREA -->
+ <div class="span10">
+ <h1 class="title">
+ Current Limitations and Future Work
+
+ </h1>
+
+ <p>This section contains an almost certainly incomplete list of
known limitations of Crunch and plans for future work.</p>
+<ul>
+<li>We would like to have easy support for reading and writing data from/to
HCatalog.</li>
+<li>The decision of how to split up processing tasks between dependent
MapReduce jobs is very naiive right now- we simply
+delegate all of the work to the reduce stage of the predecessor job. We should
take advantage of information about the
+expected size of different PCollections to optimize this processing.</li>
+<li>The Crunch optimizer does not yet merge different groupByKey operations
that run over the same input data into a single
+MapReduce job. Implementing this optimization will provide a major performance
benefit for a number of problems.</li>
+</ul>
+ </div> <!-- /span -->
+
+ </div> <!-- /row-fluid -->
+
+ </div>
+
+ <hr/>
+
+ <footer>
+ <div class="container-fluid">
+ <div class="row span12">Copyright © 2012
+ <a href="http://www.apache.org/">The Apache Software Foundation</a>,
+ licensed under the <a
href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version
2.0</a>.
+ <p><small>Apache Incubator, Apache Hadoop, Hadoop, Apache, and the
+ Apache feather logo are trademarks of The Apache Software Foundation.
+ Other names appearing on the site may be trademarks of their
+ respective owners.</small></p>
+ </div>
+ </div>
+ </footer>
+
+ </body>
+</html>
Modified: websites/staging/crunch/trunk/content/crunch/index.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/index.html (original)
+++ websites/staging/crunch/trunk/content/crunch/index.html Sun Sep 16 18:50:04
2012
@@ -1,56 +1,161 @@
-<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
-<html lang="en">
- <head>
- <title>Home Page</title>
+<!DOCTYPE html>
- <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
- <meta property="og:image"
content="http://www.apache.org/images/asf_logo.gif" />
- <link rel="stylesheet" type="text/css" media="screen"
href="http://www.apache.org/css/style.css">
- <link rel="stylesheet" type="text/css" media="screen"
href="http://www.apache.org/css/code.css">
+<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
+ <head>
+ <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+ <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+ <meta http-equiv="Content-Language" content="en" />
+
+ <title>Apache Crunch - Apache Crunch</title>
+
+ <link rel="stylesheet" href="/crunch/css/bootstrap-2.1.0.min.css" />
+ <link rel="stylesheet" href="/crunch/css/crunch.css" type="text/css">
+ <script type="text/javascript"
src="/crunch/js/bootstrap-2.1.0.min.js"></script>
+ </head>
+ <body>
-
+ <div class="navbar navbar-inverse navbar-static-top">
+
+ <div class="container-fluid">
+
+ <a class="nav pull-right brand" href="http://incubator.apache.org">
+ <img src="http://incubator.apache.org/images/egg-logo.png"
alt="apache Incubator Logo" />
+ </a>
-
- <!-- Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with this work
for additional information regarding copyright ownership. The ASF licenses
this file to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance with the
License. You may obtain a copy of the License at .
http://www.apache.org/licenses/LICENSE-2.0 . Unless required by applicable law
or agreed to in writing, software distributed under the License is distributed
on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
either express or implied. See the License for the specific language governing
permissions and limitations under the License. -->
- </head>
+ </div>
+
+ </div>
- <body>
- <div id="page" class="container_16">
- <div id="header" class="grid_8">
- <img src="http://www.apache.org/images/feather-small.gif" alt="The
Apache Software Foundation">
- <h1>The Apache Software Foundation</h1>
- <h2>Home Page</h2>
- </div>
- <div id="nav" class="grid_8">
- <ul>
- <!-- <li><a href="/" title="Welcome!">Home</a></li> -->
- <li><a href="http://www.apache.org/foundation/" title="The
Foundation">Foundation</a></li>
- <li><a href="http://projects.apache.org" title="The
Projects">Projects</a></li>
- <li><a href="http://people.apache.org" title="The
People">People</a></li>
- <li><a href="http://www.apache.org/foundation/getinvolved.html"
title="Get Involved">Get Involved</a></li>
- <li><a href="http://www.apache.org/dyn/closer.cgi"
title="Download">Download</a></li>
- <li><a href="http://www.apache.org/foundation/sponsorship.html"
title="Support Apache">Support Apache</a></li>
- </ul>
- <p><a href="/">Home</a> » <a
href="/crunch/">Crunch</a></p>
- <form name="search" id="search" action="http://www.google.com/search"
method="get">
- <input value="*.apache.org" name="sitesearch" type="hidden"/>
- <input type="text" name="q" id="query">
- <input type="submit" id="submit" value="Search">
- </form>
- </div>
- <div class="clear"></div>
- <div id="content" class="grid_16"><div class="section-content"><h1
id="welcome">Welcome</h1>
-<p>Welcome to the Apache CMS. Please see the following resources for further
help:</p>
+ <ul class="breadcrumb">
+ <li>
+ <a href="/">Incubator</a>
+ <span class="divider">»</span>
+ </li>
+ <li>
+ <a href="/crunch/">Crunch</a>
+ </li>
+
+ </ul>
+
+ <div class="container-fluid">
+ <div class="row-fluid">
+
+ <!-- SIDEBAR AREA -->
+ <div class="span2">
+ <div class="sidebar-nav">
+ <ul class="nav nav-list">
+
+
+ <li class="nav-header">Apache Crunch</li>
+
+
+
+
+ <li><b>Overview</b></li>
+
+
+
+
+
+ <li><a href="/crunch/apidocs/">API</a></li>
+
+
+
+
+
+ <li><a
href="https://cwiki.apache.org/confluence/display/CRUNCH/">Wiki</a></li>
+
+
+
+
+ <li class="nav-header">Project</li>
+
+
+
+
+ <li><a href="/crunch/source-repository.html">Source
Code</a></li>
+
+
+
+
+
+ <li><a href="/crunch/mailing-lists.html">Mailing
Lists</a></li>
+
+
+
+
+
+ <li><a
href="http://issues.apache.org/jira/browse/CRUNCH">Issue Tracking</a></li>
+
+
+
+
+
+ <li><a
href="http://apache.org/licenses/LICENSE-2.0.html">License</a></li>
+
+
+
+ </ul>
+ </div> <!-- /well -->
+ </div> <!-- /span -->
+
+ <!-- CONTENT AREA -->
+ <div class="span10">
+ <h1 class="title">
+ Apache Crunch
+
+ <small>Simple and Efficient MapReduce Pipelines</small>
+
+ </h1>
+
+ <hr />
+<blockquote>
+<p><em>Apache Crunch (incubating)</em> is a Java library for writing, testing,
and
+running MapReduce pipelines, based on Google's FlumeJava. Its goal is to make
+pipelines that are composed of many user-defined functions simple to write,
+easy to test, and efficient to run.</p>
+</blockquote>
+<hr />
+<p>Running on top of <a href="http://hadoop.apache.org/mapreduce/">Hadoop
MapReduce</a>, Apache
+Crunch provides a simple Java API for tasks like joining and data aggregation
+that are tedious to implement on plain MapReduce. For Scala users, there is
also
+Scrunch, an idiomatic Scala API to Crunch.</p>
+<h2 id="documentation">Documentation</h2>
<ul>
-<li><a
href="http://www.apache.org/dev/cmsref.html">http://www.apache.org/dev/cmsref.html</a></li>
-<li><a
href="http://wiki.apache.org/general/ApacheCms2010">http://wiki.apache.org/general/ApacheCms2010</a></li>
-</ul></div></div>
- <div class="clear"></div>
- </div>
+<li><a href="intro.html">Introduction to Apache Crunch</a></li>
+<li><a href="scrunch.html">Introduction to Scrunch</a></li>
+<li><a href="future-work.html">Current Limitations and Future Work</a></li>
+</ul>
+<h2 id="disclaimer">Disclaimer</h2>
+<p>Apache Crunch is an effort undergoing incubation at <a
href="http://apache.org/">The Apache Software Foundation
+(ASF)</a> sponsored by the <a href="http://incubator.apache.org/">Apache
Incubator PMC</a>.
+Incubation is required of all newly accepted projects until a further review
+indicates that the infrastructure, communications, and decision making process
+have stabilized in a manner consistent with other successful ASF projects.
+While incubation status is not necessarily a reflection of the completeness or
+stability of the code, it does indicate that the project has yet to be fully
+endorsed by the ASF.</p>
+ </div> <!-- /span -->
+
+ </div> <!-- /row-fluid -->
- <div id="copyright" class="container_16">
- <p>Copyright © 2011 The Apache Software Foundation, Licensed under
the <a href="http://www.apache.org/licenses/LICENSE-2.0">Apache License,
Version 2.0</a>.<br/>Apache and the Apache feather logo are trademarks of The
Apache Software Foundation.</p>
</div>
+
+ <hr/>
+
+ <footer>
+ <div class="container-fluid">
+ <div class="row span12">Copyright © 2012
+ <a href="http://www.apache.org/">The Apache Software Foundation</a>,
+ licensed under the <a
href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version
2.0</a>.
+ <p><small>Apache Incubator, Apache Hadoop, Hadoop, Apache, and the
+ Apache feather logo are trademarks of The Apache Software Foundation.
+ Other names appearing on the site may be trademarks of their
+ respective owners.</small></p>
+ </div>
+ </div>
+ </footer>
+
</body>
</html>
Added: websites/staging/crunch/trunk/content/crunch/intro.html
==============================================================================
--- websites/staging/crunch/trunk/content/crunch/intro.html (added)
+++ websites/staging/crunch/trunk/content/crunch/intro.html Sun Sep 16 18:50:04
2012
@@ -0,0 +1,298 @@
+<!DOCTYPE html>
+
+
+<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
+ <head>
+ <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
+ <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+ <meta http-equiv="Content-Language" content="en" />
+
+ <title>Apache Crunch - Introduction to Apache Crunch</title>
+
+ <link rel="stylesheet" href="/crunch/css/bootstrap-2.1.0.min.css" />
+ <link rel="stylesheet" href="/crunch/css/crunch.css" type="text/css">
+ <script type="text/javascript"
src="/crunch/js/bootstrap-2.1.0.min.js"></script>
+ </head>
+ <body>
+
+ <div class="navbar navbar-inverse navbar-static-top">
+
+ <div class="container-fluid">
+
+ <a class="nav pull-right brand" href="http://incubator.apache.org">
+ <img src="http://incubator.apache.org/images/egg-logo.png"
alt="apache Incubator Logo" />
+ </a>
+
+ </div>
+
+ </div>
+
+ <ul class="breadcrumb">
+ <li>
+ <a href="/">Incubator</a>
+ <span class="divider">»</span>
+ </li>
+ <li>
+ <a href="/crunch/">Crunch</a>
+ </li>
+
+ </ul>
+
+ <div class="container-fluid">
+ <div class="row-fluid">
+
+ <!-- SIDEBAR AREA -->
+ <div class="span2">
+ <div class="sidebar-nav">
+ <ul class="nav nav-list">
+
+
+ <li class="nav-header">Apache Crunch</li>
+
+
+
+
+ <li><a href="/crunch/index.html">Overview</a></li>
+
+
+
+
+
+ <li><a href="/crunch/apidocs/">API</a></li>
+
+
+
+
+
+ <li><a
href="https://cwiki.apache.org/confluence/display/CRUNCH/">Wiki</a></li>
+
+
+
+
+ <li class="nav-header">Project</li>
+
+
+
+
+ <li><a href="/crunch/source-repository.html">Source
Code</a></li>
+
+
+
+
+
+ <li><a href="/crunch/mailing-lists.html">Mailing
Lists</a></li>
+
+
+
+
+
+ <li><a
href="http://issues.apache.org/jira/browse/CRUNCH">Issue Tracking</a></li>
+
+
+
+
+
+ <li><a
href="http://apache.org/licenses/LICENSE-2.0.html">License</a></li>
+
+
+
+ </ul>
+ </div> <!-- /well -->
+ </div> <!-- /span -->
+
+ <!-- CONTENT AREA -->
+ <div class="span10">
+ <h1 class="title">
+ Introduction to Apache Crunch
+
+ </h1>
+
+ <h2 id="build-and-installation">Build and Installation</h2>
+<p>To use Crunch you first have to build the source code using Maven and
install
+it in your local repository:</p>
+<div class="codehilite"><pre><span class="n">mvn</span> <span
class="n">clean</span> <span class="n">install</span>
+</pre></div>
+
+
+<p>This also runs the integration test suite which will take a while.
Afterwards
+you can run the bundled example applications:</p>
+<div class="codehilite"><pre><span class="n">hadoop</span> <span
class="n">jar</span> <span class="n">examples</span><span
class="sr">/target/c</span><span class="n">runch</span><span
class="o">-</span><span class="n">examples</span><span
class="o">-*-</span><span class="n">job</span><span class="o">.</span><span
class="n">jar</span> <span class="n">org</span><span class="o">.</span><span
class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">examples</span><span class="o">.</span><span
class="n">WordCount</span> <span class="sr"><inputfile></span> <span
class="sr"><outputdir></span>
+</pre></div>
+
+
+<h2 id="high-level-concepts">High Level Concepts</h2>
+<h3 id="data-model-and-operators">Data Model and Operators</h3>
+<p>Crunch is centered around three interfaces that represent distributed
datasets: <code>PCollection<T></code>, <code>PTable<K, V></code>,
and <code>PGroupedTable<K, V></code>.</p>
+<p>A <code>PCollection<T></code> represents a distributed, unordered
collection of elements of type T. For example, we represent a text file in
Crunch as a
+<code>PCollection<String></code> object. PCollection provides a method,
<code>parallelDo</code>, that applies a function to each element in a
PCollection in parallel,
+and returns a new PCollection as its result.</p>
+<p>A <code>PTable<K, V></code> is a sub-interface of PCollection that
represents a distributed, unordered multimap of its key type K to its value
type V.
+In addition to the parallelDo operation, PTable provides a
<code>groupByKey</code> operation that aggregates all of the values in the
PTable that
+have the same key into a single record. It is the groupByKey operation that
triggers the sort phase of a MapReduce job.</p>
+<p>The result of a groupByKey operation is a <code>PGroupedTable<K,
V></code> object, which is a distributed, sorted map of keys of type K to an
Iterable
+collection of values of type V. In addition to parallelDo, the PGroupedTable
provides a <code>combineValues</code> operation, which allows for
+a commutative and associative aggregation operator to be applied to the values
of the PGroupedTable instance on both the map side and the
+reduce side of a MapReduce job.</p>
+<p>Finally, PCollection, PTable, and PGroupedTable all support a
<code>union</code> operation, which takes a series of distinct PCollections and
treats
+them as a single, virtual PCollection. The union operator is required for
operations that combine multiple inputs, such as cogroups and
+joins.</p>
+<h3 id="pipeline-building-and-execution">Pipeline Building and Execution</h3>
+<p>Every Crunch pipeline starts with a <code>Pipeline</code> object that is
used to coordinate building the pipeline and executing the underlying MapReduce
+jobs. For efficiency, Crunch uses lazy evaluation, so it will only construct
MapReduce jobs from the different stages of the pipelines when
+the Pipeline object's <code>run</code> or <code>done</code> methods are
called.</p>
+<h2 id="a-detailed-example">A Detailed Example</h2>
+<p>Here is the classic WordCount application using Crunch:</p>
+<div class="codehilite"><pre><span class="nb">import</span> <span
class="n">org</span><span class="o">.</span><span class="n">apache</span><span
class="o">.</span><span class="n">crunch</span><span class="o">.</span><span
class="n">DoFn</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">Emitter</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">PCollection</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">PTable</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">Pipeline</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span class="n">impl</span><span
class="o">.</span><span class="n">mr</span><span class="o">.</span><span
class="n">MRPipeline</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span class="n">lib</span><span
class="o">.</span><span class="n">Aggregate</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">types</span><span class="o">.</span><span
class="n">writable</span><span class="o">.</span><span
class="n">Writables</span><span class="p">;</span>
+
+<span class="n">public</span> <span class="n">class</span> <span
class="n">WordCount</span> <span class="p">{</span>
+ <span class="n">public</span> <span class="n">static</span> <span
class="n">void</span> <span class="n">main</span><span class="p">(</span><span
class="n">String</span><span class="o">[]</span> <span
class="n">args</span><span class="p">)</span> <span class="n">throws</span>
<span class="n">Exception</span> <span class="p">{</span>
+ <span class="n">Pipeline</span> <span class="n">pipeline</span> <span
class="o">=</span> <span class="k">new</span> <span
class="n">MRPipeline</span><span class="p">(</span><span
class="n">WordCount</span><span class="o">.</span><span
class="n">class</span><span class="p">);</span>
+ <span class="n">PCollection</span><span class="sr"><String></span>
<span class="n">lines</span> <span class="o">=</span> <span
class="n">pipeline</span><span class="o">.</span><span
class="n">readTextFile</span><span class="p">(</span><span
class="n">args</span><span class="p">[</span><span class="mi">0</span><span
class="p">]);</span>
+
+ <span class="n">PCollection</span><span class="sr"><String></span>
<span class="n">words</span> <span class="o">=</span> <span
class="n">lines</span><span class="o">.</span><span
class="n">parallelDo</span><span class="p">(</span><span class="s">"my
splitter"</span><span class="p">,</span> <span class="k">new</span> <span
class="n">DoFn</span><span class="o"><</span><span
class="n">String</span><span class="p">,</span> <span
class="n">String</span><span class="o">></span><span class="p">()</span>
<span class="p">{</span>
+ <span class="n">public</span> <span class="n">void</span> <span
class="n">process</span><span class="p">(</span><span class="n">String</span>
<span class="n">line</span><span class="p">,</span> <span
class="n">Emitter</span><span class="sr"><String></span> <span
class="n">emitter</span><span class="p">)</span> <span class="p">{</span>
+ <span class="k">for</span> <span class="p">(</span><span
class="n">String</span> <span class="n">word</span> <span class="p">:</span>
<span class="n">line</span><span class="o">.</span><span
class="nb">split</span><span class="p">(</span><span
class="s">"\\s+"</span><span class="p">))</span> <span
class="p">{</span>
+ <span class="n">emitter</span><span class="o">.</span><span
class="n">emit</span><span class="p">(</span><span class="n">word</span><span
class="p">);</span>
+ <span class="p">}</span>
+ <span class="p">}</span>
+ <span class="p">},</span> <span class="n">Writables</span><span
class="o">.</span><span class="n">strings</span><span class="p">());</span>
+
+ <span class="n">PTable</span><span class="o"><</span><span
class="n">String</span><span class="p">,</span> <span
class="n">Long</span><span class="o">></span> <span class="n">counts</span>
<span class="o">=</span> <span class="n">Aggregate</span><span
class="o">.</span><span class="n">count</span><span class="p">(</span><span
class="n">words</span><span class="p">);</span>
+
+ <span class="n">pipeline</span><span class="o">.</span><span
class="n">writeTextFile</span><span class="p">(</span><span
class="n">counts</span><span class="p">,</span> <span
class="n">args</span><span class="p">[</span><span class="mi">1</span><span
class="p">]);</span>
+ <span class="n">pipeline</span><span class="o">.</span><span
class="n">run</span><span class="p">();</span>
+ <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<p>Let's walk through the example line by line.</p>
+<h3 id="step-1-creating-a-pipeline-and-referencing-a-text-file">Step 1:
Creating a Pipeline and referencing a text file</h3>
+<p>The <code>MRPipeline</code> implementation of the Pipeline interface
compiles the individual stages of a
+pipeline into a series of MapReduce jobs. The MRPipeline constructor takes a
class argument
+that is used to tell Hadoop where to find the code that is used in the
pipeline execution.</p>
+<p>We now need to tell the Pipeline about the inputs it will be consuming. The
Pipeline interface
+defines a <code>readTextFile</code> method that takes in a String and returns
a PCollection of Strings.
+In addition to text files, Crunch supports reading data from SequenceFiles and
Avro container files,
+via the <code>SequenceFileSource</code> and <code>AvroFileSource</code>
classes defined in the org.apache.crunch.io package.</p>
+<p>Note that each PCollection is a <em>reference</em> to a source of data- no
data is actually loaded into a
+PCollection on the client machine.</p>
+<h3 id="step-2-splitting-the-lines-of-text-into-words">Step 2: Splitting the
lines of text into words</h3>
+<p>Crunch defines a small set of primitive operations that can be composed in
order to build complex data
+pipelines. The first of these primitives is the <code>parallelDo</code>
function, which applies a function (defined
+by a subclass of <code>DoFn</code>) to every record in a PCollection, and
returns a new PCollection that contains
+the results.</p>
+<p>The first argument to parallelDo is a string that is used to identify this
step in the pipeline. When
+a pipeline is composed into a series of MapReduce jobs, it is often the case
that multiple stages will
+run within the same Mapper or Reducer. Having a string that identifies each
processing step is useful
+for debugging errors that occur in a running pipeline.</p>
+<p>The second argument to parallelDo is an anonymous subclass of DoFn. Each
DoFn subclass must override
+the <code>process</code> method, which takes in a record from the input
PCollection and an <code>Emitter</code> object that
+may have any number of output values written to it. In this case, our DoFn
splits each lines up into
+words, using a blank space as a separator, and emits the words from the split
to the output PCollection.</p>
+<p>The last argument to parallelDo is an instance of the <code>PType</code>
interface, which specifies how the data
+in the output PCollection is serialized. While Crunch takes advantage of Java
Generics to provide
+compile-time type safety, the generic type information is not available at
runtime. Crunch needs to know
+how to map the records stored in each PCollection into a Hadoop-supported
serialization format in order
+to read and write data to disk. Two serialization implementations are
supported in crunch via the
+<code>PTypeFamily</code> interface: a Writable-based system that is defined in
the org.apache.crunch.types.writable
+package, and an Avro-based system that is defined in the
org.apache.crunch.types.avro package. Each
+implementation provides convenience methods for working with the common PTypes
(Strings, longs, bytes, etc.)
+as well as utility methods for creating PTypes from existing Writable classes
or Avro schemas.</p>
+<h3 id="step-3-counting-the-words">Step 3: Counting the words</h3>
+<p>Out of Crunch's simple primitive operations, we can build arbitrarily
complex chains of operations in order
+to perform higher-level operations, like aggregations and joins, that can work
on any type of input data.
+Let's look at the implementation of the <code>Aggregate.count</code>
function:</p>
+<div class="codehilite"><pre><span class="nb">package</span> <span
class="n">org</span><span class="o">.</span><span class="n">apache</span><span
class="o">.</span><span class="n">crunch</span><span class="o">.</span><span
class="n">lib</span><span class="p">;</span>
+
+<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">CombineFn</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">MapFn</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">PCollection</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">PGroupedTable</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">PTable</span><span class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span class="n">Pair</span><span
class="p">;</span>
+<span class="nb">import</span> <span class="n">org</span><span
class="o">.</span><span class="n">apache</span><span class="o">.</span><span
class="n">crunch</span><span class="o">.</span><span
class="n">types</span><span class="o">.</span><span
class="n">PTypeFamily</span><span class="p">;</span>
+
+<span class="n">public</span> <span class="n">class</span> <span
class="n">Aggregate</span> <span class="p">{</span>
+
+ <span class="n">private</span> <span class="n">static</span> <span
class="n">class</span> <span class="n">Counter</span><span
class="sr"><S></span> <span class="n">extends</span> <span
class="n">MapFn</span><span class="o"><</span><span class="n">S</span><span
class="p">,</span> <span class="n">Pair</span><span class="o"><</span><span
class="n">S</span><span class="p">,</span> <span class="n">Long</span><span
class="o">>></span> <span class="p">{</span>
+ <span class="n">public</span> <span class="n">Pair</span><span
class="o"><</span><span class="n">S</span><span class="p">,</span> <span
class="n">Long</span><span class="o">></span> <span
class="nb">map</span><span class="p">(</span><span class="n">S</span> <span
class="n">input</span><span class="p">)</span> <span class="p">{</span>
+ <span class="k">return</span> <span class="n">Pair</span><span
class="o">.</span><span class="n">of</span><span class="p">(</span><span
class="n">input</span><span class="p">,</span> <span class="mi">1</span><span
class="n">L</span><span class="p">);</span>
+ <span class="p">}</span>
+ <span class="p">}</span>
+
+ <span class="n">public</span> <span class="n">static</span> <span
class="sr"><S></span> <span class="n">PTable</span><span
class="o"><</span><span class="n">S</span><span class="p">,</span> <span
class="n">Long</span><span class="o">></span> <span
class="n">count</span><span class="p">(</span><span
class="n">PCollection</span><span class="sr"><S></span> <span
class="n">collect</span><span class="p">)</span> <span class="p">{</span>
+ <span class="n">PTypeFamily</span> <span class="n">tf</span> <span
class="o">=</span> <span class="n">collect</span><span class="o">.</span><span
class="n">getTypeFamily</span><span class="p">();</span>
+
+ <span class="sr">//</span> <span class="n">Create</span> <span
class="n">a</span> <span class="n">PTable</span> <span class="n">from</span>
<span class="n">the</span> <span class="n">PCollection</span> <span
class="n">by</span> <span class="n">mapping</span> <span class="nb">each</span>
<span class="n">element</span>
+ <span class="sr">//</span> <span class="n">to</span> <span
class="n">a</span> <span class="n">key</span> <span class="n">of</span> <span
class="n">the</span> <span class="n">PTable</span> <span class="n">with</span>
<span class="n">the</span> <span class="n">value</span> <span
class="n">equal</span> <span class="n">to</span> <span class="mi">1</span><span
class="n">L</span>
+ <span class="n">PTable</span><span class="o"><</span><span
class="n">S</span><span class="p">,</span> <span class="n">Long</span><span
class="o">></span> <span class="n">withCounts</span> <span
class="o">=</span> <span class="n">collect</span><span class="o">.</span><span
class="n">parallelDo</span><span class="p">(</span><span
class="s">"count:"</span> <span class="o">+</span> <span
class="n">collect</span><span class="o">.</span><span
class="n">getName</span><span class="p">(),</span>
+ <span class="k">new</span> <span class="n">Counter</span><span
class="sr"><S></span><span class="p">(),</span> <span
class="n">tf</span><span class="o">.</span><span class="n">tableOf</span><span
class="p">(</span><span class="n">collect</span><span class="o">.</span><span
class="n">getPType</span><span class="p">(),</span> <span
class="n">tf</span><span class="o">.</span><span class="n">longs</span><span
class="p">()));</span>
+
+ <span class="sr">//</span> <span class="n">Group</span> <span
class="n">the</span> <span class="n">records</span> <span class="n">of</span>
<span class="n">the</span> <span class="n">PTable</span> <span
class="n">based</span> <span class="n">on</span> <span class="n">their</span>
<span class="n">key</span><span class="o">.</span>
+ <span class="n">PGroupedTable</span><span class="o"><</span><span
class="n">S</span><span class="p">,</span> <span class="n">Long</span><span
class="o">></span> <span class="n">grouped</span> <span class="o">=</span>
<span class="n">withCounts</span><span class="o">.</span><span
class="n">groupByKey</span><span class="p">();</span>
+
+ <span class="sr">//</span> <span class="n">Sum</span> <span
class="n">the</span> <span class="mi">1</span><span class="n">L</span> <span
class="nb">values</span> <span class="n">associated</span> <span
class="n">with</span> <span class="n">the</span> <span class="nb">keys</span>
<span class="n">to</span> <span class="n">get</span> <span class="n">the</span>
+ <span class="sr">//</span> <span class="n">count</span> <span
class="n">of</span> <span class="nb">each</span> <span class="n">element</span>
<span class="n">in</span> <span class="n">this</span> <span
class="n">PCollection</span><span class="p">,</span> <span
class="ow">and</span> <span class="k">return</span> <span class="n">it</span>
+ <span class="sr">//</span> <span class="n">as</span> <span
class="n">a</span> <span class="n">PTable</span> <span class="n">so</span>
<span class="n">that</span> <span class="n">it</span> <span
class="n">may</span> <span class="n">be</span> <span class="n">processed</span>
<span class="n">further</span> <span class="ow">or</span> <span
class="n">written</span>
+ <span class="sr">//</span> <span class="n">out</span> <span
class="k">for</span> <span class="n">storage</span><span class="o">.</span>
+ <span class="k">return</span> <span class="n">grouped</span><span
class="o">.</span><span class="n">combineValues</span><span
class="p">(</span><span class="n">CombineFn</span><span class="o">.</span><span
class="sr"><S></span><span class="n">SUM_LONGS</span><span
class="p">());</span>
+ <span class="p">}</span>
+<span class="p">}</span>
+</pre></div>
+
+
+<p>First, we get the PTypeFamily that is associated with the PType for the
collection. The
+call to parallelDo converts each record in this PCollection into a Pair of the
input record
+and the number one by extending the <code>MapFn</code> convenience subclass of
DoFn, and uses the
+<code>tableOf</code> method of the PTypeFamily to specify that the returned
PCollection should be a
+PTable instance, with the key being the PType of the PCollection and the value
being the Long
+implementation for this PTypeFamily.</p>
+<p>The next line features the second of Crunch's four operations,
<code>groupByKey</code>. The groupByKey
+operation may only be applied to a PTable, and returns an instance of the
<code>PGroupedTable</code>
+interface, which references the grouping of all of the values in the PTable
that have the same key.
+The groupByKey operation is what triggers the reduce phase of a MapReduce
within Crunch.</p>
+<p>The last line in the function returns the output of the third of Crunch's
four operations,
+<code>combineValues</code>. The combineValues operator takes a
<code>CombineFn</code> as an argument, which is a
+specialized subclass of DoFn that operates on an implementation of Java's
Iterable interface. The
+use of combineValues (as opposed to parallelDo) signals to Crunch that the
CombineFn may be used to
+aggregate values for the same key on the map side of a MapReduce job as well
as the reduce side.</p>
+<h3 id="step-4-writing-the-output-and-running-the-pipeline">Step 4: Writing
the output and running the pipeline</h3>
+<p>The Pipeline object also provides a <code>writeTextFile</code> convenience
method for indicating that a
+PCollection should be written to a text file. There are also output targets
for SequenceFiles and
+Avro container files, available in the org.apache.crunch.io package.</p>
+<p>After you are finished constructing a pipeline and specifying the output
destinations, call the
+pipeline's blocking <code>run</code> method in order to compile the pipeline
into one or more MapReduce
+jobs and execute them.</p>
+<h2 id="more-information">More Information</h2>
+<p><a href="pipelines.html">Writing Your Own Pipelines</a></p>
+ </div> <!-- /span -->
+
+ </div> <!-- /row-fluid -->
+
+ </div>
+
+ <hr/>
+
+ <footer>
+ <div class="container-fluid">
+ <div class="row span12">Copyright © 2012
+ <a href="http://www.apache.org/">The Apache Software Foundation</a>,
+ licensed under the <a
href="http://www.apache.org/licenses/LICENSE-2.0">Apache License, Version
2.0</a>.
+ <p><small>Apache Incubator, Apache Hadoop, Hadoop, Apache, and the
+ Apache feather logo are trademarks of The Apache Software Foundation.
+ Other names appearing on the site may be trademarks of their
+ respective owners.</small></p>
+ </div>
+ </div>
+ </footer>
+
+ </body>
+</html>