user-guide.html

buildbot Mon, 04 Aug 2014 10:51:45 -0700

Author: buildbot
Date: Mon Aug  4 17:50:29 2014
New Revision: 918395

Log:
Staging update by buildbot for crunch


Modified:
    websites/staging/crunch/trunk/content/   (props changed)
    websites/staging/crunch/trunk/content/user-guide.html

Propchange: websites/staging/crunch/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon Aug  4 17:50:29 2014
@@ -1 +1 @@
-1602067
+1615711

Modified: websites/staging/crunch/trunk/content/user-guide.html
==============================================================================
--- websites/staging/crunch/trunk/content/user-guide.html (original)
+++ websites/staging/crunch/trunk/content/user-guide.html Mon Aug  4 17:50:29 
2014
@@ -187,7 +187,7 @@
 </ol>
 </li>
 <li><a href="#sorting">Sorting</a><ol>
-<li><a href="#stdsort">Standard and Reveserse Sorting</a></li>
+<li><a href="#stdsort">Standard and Reverse Sorting</a></li>
 <li><a href="#secsort">Secondary Sorts</a></li>
 </ol>
 </li>
@@ -308,34 +308,34 @@ top of Apache Hadoop:</p>
 into more detail about their usage in the rest of the guide.</p>
 <p><a name="datamodel"></a></p>
 <h3 id="data-model-and-operators">Data Model and Operators</h3>
-<p>Crunch's Java API is centered around three interfaces that represent 
distributed datasets: <a 
href="apidocs/0.9.0/org/apache/crunch/PCollection.html">PCollection<T></a>,
-<a 
href="http://crunch.apache.org/apidocs/0.9.0/org/apache/crunch/PTable.html";>PTable<K,
 V></a>, and <a 
href="apidocs/0.9.0/org/apache/crunch/PGroupedTable.html">PGroupedTable<K, 
V></a>.</p>
+<p>Crunch's Java API is centered around three interfaces that represent 
distributed datasets: <a 
href="apidocs/0.10.0/org/apache/crunch/PCollection.html">PCollection<T></a>,
+<a 
href="http://crunch.apache.org/apidocs/0.10.0/org/apache/crunch/PTable.html";>PTable<K,
 V></a>, and <a 
href="apidocs/0.10.0/org/apache/crunch/PGroupedTable.html">PGroupedTable<K, 
V></a>.</p>
 <p>A <code>PCollection&lt;T&gt;</code> represents a distributed, immutable 
collection of elements of type T. For example, we represent a text file as a
-<code>PCollection&lt;String&gt;</code> object. 
<code>PCollection&lt;T&gt;</code> provides a method, <em>parallelDo</em>, that 
applies a <a href="apidocs/0.9.0/org/apache/crunch/DoFn.html">DoFn<T, U></a>
+<code>PCollection&lt;String&gt;</code> object. 
<code>PCollection&lt;T&gt;</code> provides a method, <em>parallelDo</em>, that 
applies a <a href="apidocs/0.10.0/org/apache/crunch/DoFn.html">DoFn<T, U></a>
 to each element in the <code>PCollection&lt;T&gt;</code> in parallel, and 
returns a new <code>PCollection&lt;U&gt;</code> as its result.</p>
 <p>A <code>PTable&lt;K, V&gt;</code> is a sub-interface of 
<code>PCollection&lt;Pair&lt;K, V&gt;&gt;</code> that represents a distributed, 
unordered multimap of its key type K to its value type V.
 In addition to the parallelDo operation, PTable provides a <em>groupByKey</em> 
operation that aggregates all of the values in the PTable that
 have the same key into a single record. It is the groupByKey operation that 
triggers the sort phase of a MapReduce job. Developers can exercise
 fine-grained control over the number of reducers and the partitioning, 
grouping, and sorting strategies used during the shuffle by providing an 
instance
-of the <a 
href="apidocs/0.9.0/org/apache/crunch/GroupingOptions.html">GroupingOptions</a> 
class to the <code>groupByKey</code> function.</p>
+of the <a 
href="apidocs/0.10.0/org/apache/crunch/GroupingOptions.html">GroupingOptions</a>
 class to the <code>groupByKey</code> function.</p>
 <p>The result of a groupByKey operation is a <code>PGroupedTable&lt;K, 
V&gt;</code> object, which is a distributed, sorted map of keys of type K to an 
Iterable<V> that may
 be iterated over exactly once. In addition to <code>parallelDo</code> 
processing via DoFns, PGroupedTable provides a <em>combineValues</em> operation 
that allows a
-commutative and associative <a 
href="apidocs/0.9.0/org/apache/crunch/Aggregator.html">Aggregator<V></a> to be 
applied to the values of the PGroupedTable
+commutative and associative <a 
href="apidocs/0.10.0/org/apache/crunch/Aggregator.html">Aggregator<V></a> to be 
applied to the values of the PGroupedTable
 instance on both the map and reduce sides of the shuffle. A number of common 
<code>Aggregator&lt;V&gt;</code> implementations are provided in the
-<a href="apidocs/0.9.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a> 
class.</p>
+<a href="apidocs/0.10.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a> 
class.</p>
 <p>Finally, PCollection, PTable, and PGroupedTable all support a 
<em>union</em> operation, which takes a series of distinct PCollections that 
all have
 the same data type and treats them as a single virtual PCollection.</p>
 <p>All of the other data transformation operations supported by the Crunch 
APIs (aggregations, joins, sorts, secondary sorts, and cogrouping) are 
implemented
-in terms of these four primitives. The patterns themselves are defined in the 
<a 
href="apidocs/0.9.0/org/apache/crunch/lib/package-summary.html">org.apache.crunch.lib</a>
+in terms of these four primitives. The patterns themselves are defined in the 
<a 
href="apidocs/0.10.0/org/apache/crunch/lib/package-summary.html">org.apache.crunch.lib</a>
 package and its children, and a few of of the most common patterns have 
convenience functions defined on the PCollection and PTable interfaces.</p>
-<p>Every Crunch data pipeline is coordinated by an instance of the <a 
href="apidocs/0.9.0/org/apache/crunch/Pipeline.html">Pipeline</a> interface, 
which defines
-methods for reading data into a pipeline via <a 
href="apidocs/0.9.0/org/apache/crunch/Source.html">Source<T></a> instances and 
writing data out from a
-pipeline to <a href="apidocs/0.9.0/org/apache/crunch/Target.html">Target</a> 
instances. There are currently three implementations of the Pipeline interface
+<p>Every Crunch data pipeline is coordinated by an instance of the <a 
href="apidocs/0.10.0/org/apache/crunch/Pipeline.html">Pipeline</a> interface, 
which defines
+methods for reading data into a pipeline via <a 
href="apidocs/0.10.0/org/apache/crunch/Source.html">Source<T></a> instances and 
writing data out from a
+pipeline to <a href="apidocs/0.10.0/org/apache/crunch/Target.html">Target</a> 
instances. There are currently three implementations of the Pipeline interface
 that are available for developers to use:</p>
 <ol>
-<li><a 
href="apidocs/0.9.0/org/apache/crunch/impl/mr/MRPipeline.html">MRPipeline</a>: 
Executes the pipeline as a series of MapReduce jobs.</li>
-<li><a 
href="apidocs/0.9.0/org/apache/crunch/impl/mem/MemPipeline.html">MemPipeline</a>:
 Executes the pipeline in-memory on the client.</li>
-<li><a 
href="apidocs/0.9.0/org/apache/crunch/impl/spark/SparkPipeline.html">SparkPipeline</a>:
 Executes the pipeline by converting it to a series of Spark pipelines.</li>
+<li><a 
href="apidocs/0.10.0/org/apache/crunch/impl/mr/MRPipeline.html">MRPipeline</a>: 
Executes the pipeline as a series of MapReduce jobs.</li>
+<li><a 
href="apidocs/0.10.0/org/apache/crunch/impl/mem/MemPipeline.html">MemPipeline</a>:
 Executes the pipeline in-memory on the client.</li>
+<li><a 
href="apidocs/0.10.0/org/apache/crunch/impl/spark/SparkPipeline.html">SparkPipeline</a>:
 Executes the pipeline by converting it to a series of Spark pipelines.</li>
 </ol>
 <p><a name="dataproc"></a></p>
 <h2 id="data-processing-with-dofns">Data Processing with DoFns</h2>
@@ -465,7 +465,7 @@ framework won't kill it,</li>
 </ul>
 <p>DoFns also have a number of helper methods for working with <a 
href="http://codingwiththomas.blogspot.com/2011/04/controlling-hadoop-job-recursion.html";>Hadoop
 Counters</a>, all named <code>increment</code>. Counters are an incredibly 
useful way of keeping track of the state of long-running data pipelines and 
detecting any exceptional conditions that
 occur during processing, and they are supported in both the MapReduce-based 
and in-memory Crunch pipeline contexts. You can retrieve the value of the 
Counters
-in your client code at the end of a MapReduce pipeline by getting them from 
the <a 
href="apidocs/0.9.0/org/apache/crunch/PipelineResult.StageResult.html">StageResult</a>
+in your client code at the end of a MapReduce pipeline by getting them from 
the <a 
href="apidocs/0.10.0/org/apache/crunch/PipelineResult.StageResult.html">StageResult</a>
 objects returned by Crunch at the end of a run.</p>
 <ul>
 <li><code>increment(String groupName, String counterName)</code> increments 
the value of the given counter by 1.</li>
@@ -492,16 +492,16 @@ memory setting for the DoFn's needs befo
 <p><a name="mapfn"></a></p>
 <h3 id="common-dofn-patterns">Common DoFn Patterns</h3>
 <p>The Crunch APIs contain a number of useful subclasses of DoFn that handle 
common data processing scenarios and are easier
-to write and test. The top-level <a 
href="apidocs/0.9.0/org/apache/crunch/package-summary.html">org.apache.crunch</a>
 package contains three
+to write and test. The top-level <a 
href="apidocs/0.10.0/org/apache/crunch/package-summary.html">org.apache.crunch</a>
 package contains three
 of the most important specializations, which we will discuss now. Each of 
these specialized DoFn implementations has associated methods
 on the PCollection, PTable, and PGroupedTable interfaces to support common 
data processing steps.</p>
-<p>The simplest extension is the <a 
href="apidocs/0.9.0/org/apache/crunch/FilterFn.html">FilterFn<T></a> class, 
which defines a single abstract method, <code>boolean accept(T input)</code>.
+<p>The simplest extension is the <a 
href="apidocs/0.10.0/org/apache/crunch/FilterFn.html">FilterFn<T></a> class, 
which defines a single abstract method, <code>boolean accept(T input)</code>.
 The FilterFn can be applied to a <code>PCollection&lt;T&gt;</code> by calling 
the <code>filter(FilterFn&lt;T&gt; fn)</code> method, and will return a new 
<code>PCollection&lt;T&gt;</code> that only contains
 the elements of the input PCollection for which the accept method returned 
true. Note that the filter function does not include a PType argument in its
 signature, because there is no change in the data type of the PCollection when 
the FilterFn is applied. It is possible to compose new FilterFn
 instances by combining multiple FilterFns together using the <code>and</code>, 
<code>or</code>, and <code>not</code> factory methods defined in the
-<a href="apidocs/0.9.0/org/apache/crunch/fn/FilterFns.html">FilterFns</a> 
helper class.</p>
-<p>The second extension is the <a 
href="apidocs/0.9.0/org/apache/crunch/MapFn.html">MapFn<S, T></a> class, which 
defines a single abstract method, <code>T map(S input)</code>.
+<a href="apidocs/0.10.0/org/apache/crunch/fn/FilterFns.html">FilterFns</a> 
helper class.</p>
+<p>The second extension is the <a 
href="apidocs/0.10.0/org/apache/crunch/MapFn.html">MapFn<S, T></a> class, which 
defines a single abstract method, <code>T map(S input)</code>.
 For simple transform tasks in which every input record will have exactly one 
output, it's easy to test a MapFn by verifying that a given input returns a
 given output.</p>
 <p>MapFns are also used in specialized methods on the PCollection and PTable 
interfaces. <code>PCollection&lt;V&gt;</code> defines the method
@@ -510,15 +510,15 @@ function that extracts the key (of type 
 the key be given and constructs a <code>PTableType&lt;K, V&gt;</code> from the 
given key type and the PCollection's existing value type. <code>PTable&lt;K, 
V&gt;</code>, in turn,
 has methods <code>PTable&lt;K1, V&gt; mapKeys(MapFn&lt;K, K1&gt; mapFn)</code> 
and <code>PTable&lt;K, V2&gt; mapValues(MapFn&lt;V, V2&gt;)</code> that handle 
the common case of converting
 just one of the paired values in a PTable instance from one type to another 
while leaving the other type the same.</p>
-<p>The final top-level extension to DoFn is the <a 
href="apidocs/0.9.0/org/apache/crunch/CombineFn.html">CombineFn<K, V></a> 
class, which is used in conjunction with
+<p>The final top-level extension to DoFn is the <a 
href="apidocs/0.10.0/org/apache/crunch/CombineFn.html">CombineFn<K, V></a> 
class, which is used in conjunction with
 the <code>combineValues</code> method defined on the PGroupedTable interface. 
CombineFns are used to represent the associative operations that can be applied 
using
 the MapReduce Combiner concept in order to reduce the amount data that is 
shipped over the network during a shuffle.</p>
 <p>The CombineFn extension is different from the FilterFn and MapFn classes in 
that it does not define an abstract method for handling data
 beyond the default <code>process</code> method that any other DoFn would use; 
rather, extending the CombineFn class signals to the Crunch planner that the 
logic
 contained in this class satisfies the conditions required for use with the 
MapReduce combiner.</p>
-<p>Crunch supports many types of these associative patterns, such as sums, 
counts, and set unions, via the <a 
href="apidocs/0.9.0/org/apache/crunch/Aggregator.html">Aggregator<V></a>
+<p>Crunch supports many types of these associative patterns, such as sums, 
counts, and set unions, via the <a 
href="apidocs/0.10.0/org/apache/crunch/Aggregator.html">Aggregator<V></a>
 interface, which is defined right alongside the CombineFn class in the 
top-level <code>org.apache.crunch</code> package. There are a number of 
implementations of the Aggregator
-interface defined via static factory methods in the <a 
href="apidocs/0.9.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a> 
class. We will discuss
+interface defined via static factory methods in the <a 
href="apidocs/0.10.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a> 
class. We will discuss
 Aggregators more in the section on <a href="#aggregators">common MapReduce 
patterns</a>.</p>
 <p><a name="serde"></a></p>
 <h2 id="serializing-data-with-ptypes">Serializing Data with PTypes</h2>
@@ -539,11 +539,11 @@ against an existing PCollection, <strong
   }
 </pre>
 
-<p>Crunch supports two different <em>type families</em>, which each implement 
the <a 
href="apidocs/0.9.0/org/apache/crunch/types/PTypeFamily.html">PTypeFamily</a> 
interface:
-one for Hadoop's <a 
href="apidocs/0.9.0/org/apache/crunch/types/writable/WritableTypeFamily.html">Writable
 interface</a> and another based on
-<a 
href="apidocs/0.9.0/org/apache/crunch/types/avro/AvroTypeFamily.html">Apache 
Avro</a>. There are also classes that contain static factory methods for
-each PTypeFamily to allow for easy import and usage: one for <a 
href="apidocs/0.9.0/org/apache/crunch/types/writable/Writables.html">Writables</a>
 and one for
-<a href="apidocs/0.9.0/org/apache/crunch/types/avro/Avros.html">Avros</a>.</p>
+<p>Crunch supports two different <em>type families</em>, which each implement 
the <a 
href="apidocs/0.10.0/org/apache/crunch/types/PTypeFamily.html">PTypeFamily</a> 
interface:
+one for Hadoop's <a 
href="apidocs/0.10.0/org/apache/crunch/types/writable/WritableTypeFamily.html">Writable
 interface</a> and another based on
+<a 
href="apidocs/0.10.0/org/apache/crunch/types/avro/AvroTypeFamily.html">Apache 
Avro</a>. There are also classes that contain static factory methods for
+each PTypeFamily to allow for easy import and usage: one for <a 
href="apidocs/0.10.0/org/apache/crunch/types/writable/Writables.html">Writables</a>
 and one for
+<a href="apidocs/0.10.0/org/apache/crunch/types/avro/Avros.html">Avros</a>.</p>
 <p>The two different type families exist for historical reasons: Writables 
have long been the standard form for representing serializable data in Hadoop,
 but the Avro based serialization scheme is very compact, fast, and allows for 
complex record schemas to evolve over time. It's fine (and even encouraged)
 to mix-and-match PCollections that use different PTypes in the same Crunch 
pipeline (e.g., you could
@@ -580,7 +580,7 @@ can be used to kick off a shuffle on the
 </pre>
 
 <p>If you find yourself in a situation where you have a PCollection<Pair<K, 
V>&gt; and you need a PTable<K, V>, the
-<a href="apidocs/0.9.0/org/apache/crunch/lib/PTables.html">PTables</a> library 
class has methods that will do the conversion for you.</p>
+<a href="apidocs/0.10.0/org/apache/crunch/lib/PTables.html">PTables</a> 
library class has methods that will do the conversion for you.</p>
 <p>Let's look at some more example PTypes created using the common primitive 
and collection types. For most of your pipelines,
 you will use one type family exclusively, and so you can cut down on some of 
the boilerplate in your classes by importing
 all of the methods from the <code>Writables</code> or <code>Avros</code> 
classes into your class:</p>
@@ -648,7 +648,7 @@ includes both Avro generic and specific 
   PType<Record> avroGenericType = Avros.generics(schema);
 </pre>
 
-<p>The <a 
href="apidocs/0.9.0/org/apache/crunch/types/avro/Avros.html">Avros</a> class 
also has a <code>reflects</code> method for creating PTypes
+<p>The <a 
href="apidocs/0.10.0/org/apache/crunch/types/avro/Avros.html">Avros</a> class 
also has a <code>reflects</code> method for creating PTypes
 for POJOs using Avro's reflection-based serialization mechanism. There are a 
couple of restrictions on the structure of
 the POJO:</p>
 <ol>
@@ -685,7 +685,7 @@ to query intermediate results to aid in 
 <p>The simplest way to create a new <code>PType&lt;T&gt;</code> for a data 
object is to create a <em>derived</em> PType from one of the built-in PTypes 
from the Avro
 and Writable type families. If we have a base <code>PType&lt;S&gt;</code>, we 
can create a derived <code>PType&lt;T&gt;</code> by implementing an input 
<code>MapFn&lt;S, T&gt;</code> and an
 output <code>MapFn&lt;T, S&gt;</code> and then calling 
<code>PTypeFamily.derived(Class&lt;T&gt;, MapFn&lt;S, T&gt; in, MapFn&lt;T, 
S&gt; out, PType&lt;S&gt; base)</code>, which will return
-a new <code>PType&lt;T&gt;</code>. There are examples of derived PTypes in the 
<a href="apidocs/0.9.0/org/apache/crunch/types/PTypes.html">PTypes</a> class, 
including
+a new <code>PType&lt;T&gt;</code>. There are examples of derived PTypes in the 
<a href="apidocs/0.10.0/org/apache/crunch/types/PTypes.html">PTypes</a> class, 
including
 serialization support for protocol buffers, Thrift records, Java Enums, 
BigInteger, and UUIDs. The <a 
href="https://github.com/kevinweil/elephant-bird/tree/master/crunch";>crunch 
module</a> of <a href="https://github.com/kevinweil/elephant-bird/";>Twitter's 
ElephantBird</a> project also defines PTypes for working with
 protocol buffers and Thrift records that are serialized using ElephantBird's 
<code>BinaryWritable&lt;T&gt;</code>.</p>
 <p>A common pattern in MapReduce programs is to define a Writable type that 
wraps a regular Java POJO. You can use derived PTypes to make it
@@ -744,7 +744,7 @@ You use a Source in conjunction with one
       Writables.tableOf(Writables.longs(), Writables.bytes())));
 </pre>
 
-<p>Note that Sources usually require a PType to be specified when they are 
created. The <a href="apidocs/0.9.0/org/apache/crunch/io/From.html">From</a>
+<p>Note that Sources usually require a PType to be specified when they are 
created. The <a href="apidocs/0.10.0/org/apache/crunch/io/From.html">From</a>
 class provides a number of factory methods for literate Source creation:</p>
 <pre>
   // Note that we are passing a String "/user/crunch/text", not a Path.
@@ -780,28 +780,28 @@ different files using the NLineInputForm
   </tr>
   <tr>
     <td>Text</td>
-    <td><a 
href="apidocs/0.9.0/org/apache/crunch/io/text/TextFileSource.html">org.apache.crunch.io.text.TextFileSource</a></td>
+    <td><a 
href="apidocs/0.10.0/org/apache/crunch/io/text/TextFileSource.html">org.apache.crunch.io.text.TextFileSource</a></td>
     <td>PCollection&lt;String&gt;</td>
     <td>textFile</td>
     <td>Works for both TextInputFormat and AvroUtf8InputFormat</td>
   </tr>
   <tr>
     <td>Sequence</td>
-    <td><a 
href="apidocs/0.9.0/org/apache/crunch/io/seq/SeqFileTableSource.html">org.apache.crunch.io.seq.SeqFileTableSource</a></td>
+    <td><a 
href="apidocs/0.10.0/org/apache/crunch/io/seq/SeqFileTableSource.html">org.apache.crunch.io.seq.SeqFileTableSource</a></td>
     <td>PTable&lt;K, V&gt;</td>
     <td>sequenceFile</td>
-    <td>Also has a <a 
href="apidocs/0.9.0/org/apache/crunch/io/seq/SeqFileSource.html">SeqFileSource</a>
 which reads the value and ignores the key.</td>
+    <td>Also has a <a 
href="apidocs/0.10.0/org/apache/crunch/io/seq/SeqFileSource.html">SeqFileSource</a>
 which reads the value and ignores the key.</td>
   </tr>
   <tr>
     <td>Avro</td>
-    <td><a 
href="apidocs/0.9.0/org/apache/crunch/io/avro/AvroFileSource.html">org.apache.crunch.io.avro.AvroFileSource</a></td>
+    <td><a 
href="apidocs/0.10.0/org/apache/crunch/io/avro/AvroFileSource.html">org.apache.crunch.io.avro.AvroFileSource</a></td>
     <td>PCollection&lt;V&gt;</td>
     <td>avroFile</td>
     <td>No PTable analogue for Avro records.</td>
   </tr>
   <tr>
     <td>Parquet</td>
-    <td><a 
href="apidocs/0.9.0/org/apache/crunch/io/parquet/AvroParquetFileSource.html">org.apache.crunch.io.parquet.AvroParquetFileSource</a></td>
+    <td><a 
href="apidocs/0.10.0/org/apache/crunch/io/parquet/AvroParquetFileSource.html">org.apache.crunch.io.parquet.AvroParquetFileSource</a></td>
     <td>PCollection&lt;V&gt;</td>
     <td>N/A</td>
     <td>Reads Avro records from a parquet-formatted file; expects an Avro 
PType.</td>
@@ -826,7 +826,7 @@ defined on the <code>Pipeline</code> int
 </pre>
 
 <p>Just as the Source interface has the <code>From</code> class of factory 
methods, Target factory methods are defined in a class named
-<a href="apidocs/0.9.0/org/apache/crunch/io/To.html">To</a> to enable literate 
programming:</p>
+<a href="apidocs/0.10.0/org/apache/crunch/io/To.html">To</a> to enable 
literate programming:</p>
 <pre>
   lines.write(To.textFile("/user/crunch/textout"));
 </pre>
@@ -856,25 +856,25 @@ parameters that this Target needs:</p>
   </tr>
   <tr>
     <td>Text</td>
-    <td><a 
href="apidocs/0.9.0/org/apache/crunch/io/text/TextFileTarget.html">org.apache.crunch.io.text.TextFileTarget</a></td>
+    <td><a 
href="apidocs/0.10.0/org/apache/crunch/io/text/TextFileTarget.html">org.apache.crunch.io.text.TextFileTarget</a></td>
     <td>textFile</td>
     <td>Will write out the string version of whatever it's given, which should 
be text. See also: Pipeline.writeTextFile.</td>
   </tr>
   <tr>
     <td>Sequence</td>
-    <td><a 
href="apidocs/0.9.0/org/apache/crunch/io/seq/SeqFileTarget.html">org.apache.crunch.io.seq.SeqFileTarget</a></td>
+    <td><a 
href="apidocs/0.10.0/org/apache/crunch/io/seq/SeqFileTarget.html">org.apache.crunch.io.seq.SeqFileTarget</a></td>
     <td>sequenceFile</td>
     <td>Works on both PCollection and PTable.</td>
   </tr>
   <tr>
     <td>Avro</td>
-    <td><a 
href="apidocs/0.9.0/org/apache/crunch/io/avro/AvroFileTarget.html">org.apache.crunch.io.avro.AvroFileTarget</a></td>
+    <td><a 
href="apidocs/0.10.0/org/apache/crunch/io/avro/AvroFileTarget.html">org.apache.crunch.io.avro.AvroFileTarget</a></td>
     <td>avroFile</td>
     <td>Treats PTables as PCollections of Pairs.</td>
   </tr>
   <tr>
     <td>Parquet</td>
-    <td><a 
href="apidocs/0.9.0/org/apache/crunch/io/parquet/AvroParquetFileTarget.html">org.apache.crunch.io.parquet.AvroParquetFileTarget</a></td>
+    <td><a 
href="apidocs/0.10.0/org/apache/crunch/io/parquet/AvroParquetFileTarget.html">org.apache.crunch.io.parquet.AvroParquetFileTarget</a></td>
     <td>N/A</td>
     <td>Writes Avro records to parquet-formatted files; expects an Avro 
PType.</td>
   </tr>
@@ -885,13 +885,13 @@ parameters that this Target needs:</p>
 <p>The <code>SourceTarget&lt;T&gt;</code> interface extends both the 
<code>Source&lt;T&gt;</code> and <code>Target</code> interfaces and allows a 
Path to act as both a
 Target for some PCollections as well as a Source for others. SourceTargets are 
convenient for any intermediate outputs within
 your pipeline. Just as we have the factory methods in the From and To classes 
for Sources and Targets, factory methods for
-SourceTargets are declared in the <a 
href="apidocs/0.9.0/org/apache/crunch/io/At.html">At</a> class.</p>
+SourceTargets are declared in the <a 
href="apidocs/0.10.0/org/apache/crunch/io/At.html">At</a> class.</p>
 <p>In many pipeline applications, we want to control how any existing files in 
our target paths are handled by Crunch. For example,
 we might want the pipeline to fail quickly if an output path already exists, 
or we might want to delete the existing files
 and overwrite them with our new outputs. We might also want to use an output 
path as a <em>checkpoint</em> for our data pipeline.
 Checkpoints allow us to specify that a Path should be used as the starting 
location for our pipeline execution if the data
 it contains is newer than the data in the paths associated with any upstream 
inputs to that output location.</p>
-<p>Crunch supports these different output options via the <a 
href="apidocs/0.9.0/org/apache/crunch/Target.WriteMode.html">WriteMode</a> enum,
+<p>Crunch supports these different output options via the <a 
href="apidocs/0.10.0/org/apache/crunch/Target.WriteMode.html">WriteMode</a> 
enum,
 which can be passed along with a Target to the <code>write</code> method on 
either PCollection or Pipeline. Here are the supported
 WriteModes for Crunch:</p>
 <pre>
@@ -928,19 +928,19 @@ the Iterable returned by <code>Iterable&
 one of the <code>run</code> methods on the Pipeline interface that are used to 
manage overall pipeline execution. This means that you can instruct
 Crunch to materialize multiple PCollections and have them all created within a 
single Pipeline run.</p>
 <p>If you ask Crunch to materialize a PCollection that is returned from 
Pipeline's <code>PCollection&lt;T&gt; read(Source&lt;T&gt; source)</code> 
method, then no
-MapReduce job will be executed if the given Source implements the <a 
href="apidocs/0.9.0/org/apache/crunch/io/ReadableSource.html">ReadableSource</a>
+MapReduce job will be executed if the given Source implements the <a 
href="apidocs/0.10.0/org/apache/crunch/io/ReadableSource.html">ReadableSource</a>
 interface. If the Source is not readable, then a map-only job will be executed 
to map the data to a format that Crunch knows how to
 read from disk.</p>
 <p>Sometimes, the output of a Crunch pipeline will be a single value, such as 
the number of elements in a PCollection. In other instances,
 you may want to perform some additional client-side computations on the 
materialized contents of a PCollection in a way that is
-transparent to users of your libraries. For these situations, Crunch defines a 
<a href="apidocs/0.9.0/org/apache/crunch/PObject.html">PObject<V></a>
+transparent to users of your libraries. For these situations, Crunch defines a 
<a href="apidocs/0.10.0/org/apache/crunch/PObject.html">PObject<V></a>
 interface that has an associated <code>V getValue()</code> method. 
PCollection's <code>PObject&lt;Long&gt; length()</code> method returns a 
reference to the number
 of elements contained in that PCollection, but the pipeline tasks required to 
compute this value will not run until the <code>Long getValue()</code>
 method of the returned PObject is called.</p>
 <p><a name="patterns"></a></p>
 <h2 id="data-processing-patterns-in-crunch">Data Processing Patterns in 
Crunch</h2>
 <p>This section describes the various data processing patterns implemented in 
Crunch's library APIs,
-which are in the <a 
href="apidocs/0.9.0/org/apache/crunch/lib/package-summary.html">org.apache.crunch.lib</a>
+which are in the <a 
href="apidocs/0.10.0/org/apache/crunch/lib/package-summary.html">org.apache.crunch.lib</a>
 package.</p>
 <p><a name="gbk"></a></p>
 <h3 id="groupbykey">groupByKey</h3>
@@ -955,7 +955,7 @@ explicitly provided by the developer bas
 <li><code>groupByKey(GroupingOptions options)</code>: Complex shuffle 
operations that require custom partitions
 and comparators.</li>
 </ol>
-<p>The <a 
href="apidocs/0.9.0/org/apache/crunch/GroupingOptions.html">GroupingOptions</a> 
class allows developers
+<p>The <a 
href="apidocs/0.10.0/org/apache/crunch/GroupingOptions.html">GroupingOptions</a>
 class allows developers
 to exercise precise control over how data is partitioned, sorted, and grouped 
by the underlying
 execution engine. Crunch was originally developed on top of MapReduce, and so 
the GroupingOptions APIs
 expect instances of Hadoop's <a 
href="http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Partitioner.html";>Partitioner</a>
@@ -963,7 +963,7 @@ and <a href="http://hadoop.apache.org/do
 classes in order to support partitions and sorts. That said, Crunch has 
adapters in place so that these
 same classes may also be used with other execution engines, like Apache Spark, 
without a rewrite.</p>
 <p>The GroupingOptions class is immutable; to create a new one, take advantage 
of the
-<a 
href="apidocs/0.9.0/org/apache/crunch/GroupingOptions.Builder.html">GroupingOptions.Builder</a>
 implementation.</p>
+<a 
href="apidocs/0.10.0/org/apache/crunch/GroupingOptions.Builder.html">GroupingOptions.Builder</a>
 implementation.</p>
 <pre>
   GroupingOptions opts = GroupingOptions.builder()
       .groupingComparatorClass(MyGroupingComparator.class)
@@ -985,10 +985,10 @@ pipeline.</p>
 <p>Calling one of the groupByKey methods on PTable returns an instance of the 
PGroupedTable interface.
 PGroupedTable provides a <code>combineValues</code> that can be used to signal 
to the planner that we want to perform
 associative aggregations on our data both before and after the shuffle.</p>
-<p>There are two ways to use combineValues: you can create an extension of the 
<a href="apidocs/0.9.0/org/apache/crunch/CombineFn.html">CombineFn</a>
-abstract base class, or you can use an instance of the <a 
href="apidocs/0.9.0/org/apache/crunch/Aggregator.html">Aggregator<V></a>
+<p>There are two ways to use combineValues: you can create an extension of the 
<a href="apidocs/0.10.0/org/apache/crunch/CombineFn.html">CombineFn</a>
+abstract base class, or you can use an instance of the <a 
href="apidocs/0.10.0/org/apache/crunch/Aggregator.html">Aggregator<V></a>
 interface. Of the two, an Aggregator is probably the way you want to go; 
Crunch provides a number of
-<a href="0.9.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a>, and 
they are a bit easier to write and compose together.
+<a href="0.10.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a>, and 
they are a bit easier to write and compose together.
 Let's walk through a few example aggregations:</p>
 <pre>
   PTable&lt;String, Double&gt; data = ...;
@@ -1029,7 +1029,7 @@ the average of a set of values:</p>
 <h3 id="simple-aggregations">Simple Aggregations</h3>
 <p>Many of the most common aggregation patterns in Crunch are provided as 
methods on the PCollection
 interface, including <code>count</code>, <code>max</code>, <code>min</code>, 
and <code>length</code>. The implementations of these methods,
-however, are in the <a 
href="apidocs/0.9.0/org/apache/crunch/lib/Aggregate.html">Aggregate</a> library 
class.
+however, are in the <a 
href="apidocs/0.10.0/org/apache/crunch/lib/Aggregate.html">Aggregate</a> 
library class.
 The methods in the Aggregate class expose some additional options that you can 
use for performing
 aggregations, such as controlling the level of parallelism for count 
operations:</p>
 <pre>
@@ -1050,9 +1050,9 @@ most frequently occuring elements, you w
 <p><a name="joins"></a></p>
 <h3 id="joining-data">Joining Data</h3>
 <p>Joins in Crunch are based on equal-valued keys in different PTables. Joins 
have also evolved
-a great deal in Crunch over the lifetime of the project. The <a 
href="apidocs/0.9.0/org/apache/crunch/lib/Join.html">Join</a>
+a great deal in Crunch over the lifetime of the project. The <a 
href="apidocs/0.10.0/org/apache/crunch/lib/Join.html">Join</a>
 API provides simple methods for performing equijoins, left joins, right joins, 
and full joins, but modern
-Crunch joins are usually performed using an explicit implementation of the <a 
href="apidocs/0.9.0/org/apache/crunch/lib/join/JoinStrategy.html">JoinStrategy</a>
+Crunch joins are usually performed using an explicit implementation of the <a 
href="apidocs/0.10.0/org/apache/crunch/lib/join/JoinStrategy.html">JoinStrategy</a>
 interface, which has support for the same rich set of joins that you can use 
in tools like Apache Hive and
 Apache Pig.</p>
 <p>All of the algorithms discussed below implement the JoinStrategy interface, 
which defines a single join method:</p>
@@ -1063,36 +1063,45 @@ Apache Pig.</p>
   PTable&lt;K, Pair&lt;V1, V2&gt;&gt; joined = strategy.join(one, two, 
JoinType);
 </pre>
 
-<p>The <a 
href="apidocs/0.9.0/org/apache/crunch/lib/join/JoinType.html">JoinType</a> enum 
determines which
+<p>The <a 
href="apidocs/0.10.0/org/apache/crunch/lib/join/JoinType.html">JoinType</a> 
enum determines which
 kind of join is applied: inner, outer, left, right, or full. In general, the 
smaller of the two
-inputs should be the left-most argument to the join method. The only exception 
to this (for unfortunate
-historical reasons that the Crunch developers deeply apologize for) is for 
mapside-joins, where the
-left-most argument should be the <em>larger</em> input.</p>
+inputs should be the left-most argument to the join method.</p>
+<p>Note that the values of the PTables you join should be non-null. The join
+algorithms in Crunch use null as a placeholder to represent that there are no 
values for
+a given key in a PCollection, so joining PTables that contain null values may 
have
+surprising results. Using a non-null dummy value in your PCollections is a 
good idea in
+general.</p>
 <p><a name="reducejoin"></a></p>
 <h4 id="reduce-side-joins">Reduce-side Joins</h4>
-<p>Reduce-side joins are handled by the <a 
href="apidocs/0.9.0/org/apache/crunch/lib/join/DefaultJoinStrategy.html">DefaultJoinStrategy</a>.
+<p>Reduce-side joins are handled by the <a 
href="apidocs/0.10.0/org/apache/crunch/lib/join/DefaultJoinStrategy.html">DefaultJoinStrategy</a>.
 Reduce-side joins are the simplest and most robust kind of joins in Hadoop; 
the keys from the two inputs are
 shuffled together to the reducers, where the values from the smaller of the 
two collections are collected and then
 streamed over the values from the larger of the two collections. You can 
control the number of reducers that is used
 to perform the join by passing an integer argument to the DefaultJoinStrategy 
constructor.</p>
 <p><a name="mapjoin"></a></p>
 <h4 id="map-side-joins">Map-side Joins</h4>
-<p>Map-side joins are handled by the <a 
href="apidocs/0.9.0/org/apache/crunch/lib/join/MapsideJoinStrategy.html">MapsideJoinStrategy</a>.
+<p>Map-side joins are handled by the <a 
href="apidocs/0.10.0/org/apache/crunch/lib/join/MapsideJoinStrategy.html">MapsideJoinStrategy</a>.
 Map-side joins require that the smaller of the two input tables is loaded into 
memory on the tasks on the cluster, so
 there is a requirement that at least one of the tables be relatively small so 
that it can comfortably fit into memory within
-each task. <em>Remember, the MapsideJoinStrategy is the only JoinStrategy 
implementation where the left-most argument should
-be larger than the right-most one.</em></p>
+each task.</p>
+<p>For a long time, the MapsideJoinStrategy differed from the rest of the 
JoinStrategy
+implementations in that the left-most argument was intended to be larger than 
the right-side
+one, since the right-side PTable was loaded into memory. Since Crunch 
0.10.0/0.8.3, we
+have deprecated the old MapsideJoinStrategy constructor which had the sizes 
reversed and
+recommend that you use the <code>MapsideJoinStrategy.create()</code> factory 
method, which returns an
+implementation of the MapsideJoinStrategy in which the left-side PTable is 
loaded into
+memory instead of the right-side PTable.</p>
 <p><a name="shardedjoin"></a></p>
 <h4 id="sharded-joins">Sharded Joins</h4>
 <p>Many distributed joins have skewed data that can cause regular reduce-side 
joins to fail due to out-of-memory issues on
 the partitions that happen to contain the keys with highest cardinality. To 
handle these skew issues, Crunch has the
-<a 
href="apidocs/0.9.0/org/apache/crunch/lib/join/ShardedJoinStrategy.html">ShardedJoinStrategy</a>
 that allows developers to shard
+<a 
href="apidocs/0.10.0/org/apache/crunch/lib/join/ShardedJoinStrategy.html">ShardedJoinStrategy</a>
 that allows developers to shard
 each key to multiple reducers, which prevents a few reducers from getting 
overloaded with the values from the skewed keys
 in exchange for sending more data over the wire. For problems with significant 
skew issues, the ShardedJoinStrategy can
 significantly improve performance.</p>
 <p><a name="bloomjoin"></a></p>
 <h4 id="bloom-filter-joins">Bloom Filter Joins</h4>
-<p>Last but not least, the <a 
href="apidocs/0.9.0/org/apache/crunch/lib/join/BloomFilterJoinStrategy.html">BloomFilterJoinStrategy</a>
 builds
+<p>Last but not least, the <a 
href="apidocs/0.10.0/org/apache/crunch/lib/join/BloomFilterJoinStrategy.html">BloomFilterJoinStrategy</a>
 builds
 a <a href="http://en.wikipedia.org/wiki/Bloom_filter";>bloom filter</a> on the 
left-hand side table that is used to filter the contents
 of the right-hand side table to eliminate entries from the (larger) right-hand 
side table that have no hope of being joined
 to values in the left-hand side table. This is useful in situations in which 
the left-hand side table is too large to fit
@@ -1104,7 +1113,7 @@ vast majority of the keys in the right-h
 For example, we might want to join two datasets
 together and only emit a record if each of the sets had at least two distinct 
values associated
 with each key. For arbitrary complex join logic, we can always fall back to the
-<a href="apidocs/0.9.0/org/apache/crunch/lib/Cogroup.html">Cogroup</a> API, 
which takes in an arbitrary number
+<a href="apidocs/0.10.0/org/apache/crunch/lib/Cogroup.html">Cogroup</a> API, 
which takes in an arbitrary number
 of PTable instances that all have the same key type and combines them together 
into a single
 PTable whose values are made up of Collections of the values from each of the 
input PTables.</p>
 <pre>
@@ -1130,7 +1139,7 @@ Crunch APIs have a number of utilities f
 more advanced patterns like secondary sorts.</p>
 <p><a name="stdsort"></a></p>
 <h4 id="standard-and-reverse-sorting">Standard and Reverse Sorting</h4>
-<p>The <a href="apidocs/0.9.0/org/apache/crunch/lib/Sort.html">Sort</a> API 
methods contain utility functions
+<p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/Sort.html">Sort</a> API 
methods contain utility functions
 for sorting the contents of PCollections and PTables whose contents implement 
the <code>Comparable</code>
 interface. By default, MapReduce does not perform total sorts on its keys 
during a shuffle; instead
 a sort is done locally on each of the partitions of the data that are sent to 
each reducer. Doing
@@ -1151,7 +1160,7 @@ total order partitioner and sorting cont
 
 <p>For more complex PCollections or PTables that are made up of Tuples (Pairs, 
Tuple3, etc.), we can
 specify which columns of the Tuple should be used for sorting the contents, 
and in which order, using
-the <a 
href="apidocs/0.9.0/org/apache/crunch/lib/Sort.ColumnOrder.html">ColumnOrder</a>
 class:</p>
+the <a 
href="apidocs/0.10.0/org/apache/crunch/lib/Sort.ColumnOrder.html">ColumnOrder</a>
 class:</p>
 <pre>
   PTable&lt;String, Long&gt; table = ...;
   // Sorted by value, instead of key -- remember, a PTable is a PCollection of 
Pairs.
@@ -1162,7 +1171,7 @@ the <a href="apidocs/0.9.0/org/apache/cr
 <h4 id="secondary-sorts">Secondary Sorts</h4>
 <p>Another pattern that occurs frequently in distributed processing is 
<em>secondary sorts</em>, where we
 want to group a set of records by one key and sort the records within each 
group by a second key.
-The <a 
href="apidocs/0.9.0/org/apache/crunch/lib/SecondarySort.html">SecondarySort</a> 
API provides a set
+The <a 
href="apidocs/0.10.0/org/apache/crunch/lib/SecondarySort.html">SecondarySort</a>
 API provides a set
 of <code>sortAndApply</code> methods that can be used on input PTables of the 
form <code>PTable&lt;K, Pair&lt;K2, V&gt;&gt;</code>,
 where <code>K</code> is the primary grouping key and <code>K2</code> is the 
secondary grouping key. The <code>sortAndApply</code>
 method will perform the grouping and sorting and will then apply a given DoFn 
to process the
@@ -1177,7 +1186,7 @@ techniques throughout its library APIs.<
 one of the datasets to be small enough to fit into memory, and then do a pass 
over the larger data
 set where we emit an element of the smaller data set along with each element 
from the larger set.</p>
 <p>When this pattern isn't possible but we still need to take the cartesian 
product, we have some options,
-but they're fairly expensive. Crunch's <a 
href="apidocs/0.9.0/org/apache/crunch/lib/Cartesian.html">Cartesian</a> API
+but they're fairly expensive. Crunch's <a 
href="apidocs/0.10.0/org/apache/crunch/lib/Cartesian.html">Cartesian</a> API
 provides methods for a reduce-side full cross product between two PCollections 
(or PTables.) Note that
 this is a pretty expensive operation, and you should go out of your way to 
avoid these kinds of processing
 steps in your pipelines.</p>
@@ -1185,7 +1194,7 @@ steps in your pipelines.</p>
 <h4 id="coalescing">Coalescing</h4>
 <p>Many MapReduce jobs have the potential to generate a large number of small 
files that could be used more
 effectively by clients if they were all merged together into a small number of 
large files. The
-<a href="apidocs/0.9.0/org/apache/crunch/lib/Shard.html">Shard</a> API 
provides a single method, <code>shard</code>, that allows
+<a href="apidocs/0.10.0/org/apache/crunch/lib/Shard.html">Shard</a> API 
provides a single method, <code>shard</code>, that allows
 you to coalesce a given PCollection into a fixed number of partitions:</p>
 <pre>
   PCollection&lt;Long&gt; data = ...;
@@ -1196,7 +1205,7 @@ you to coalesce a given PCollection into
 partitions. This is often a useful step at the end of a long pipeline run.</p>
 <p><a name="distinct"></a></p>
 <h4 id="distinct">Distinct</h4>
-<p>Crunch's <a 
href="apidocs/0.9.0/org/apache/crunch/lib/Distinct.html">Distinct</a> API has a 
method, <code>distinct</code>, that
+<p>Crunch's <a 
href="apidocs/0.10.0/org/apache/crunch/lib/Distinct.html">Distinct</a> API has 
a method, <code>distinct</code>, that
 returns one copy of each unique element in a given PCollection:</p>
 <pre>
   PCollection&lt;Long&gt; data = ...;
@@ -1218,7 +1227,7 @@ value for your own pipelines. The optima
 thus the amount of memory they consume) and the number of unique elements in 
the data.</p>
 <p><a name="sampling"></a></p>
 <h4 id="sampling">Sampling</h4>
-<p>The <a href="apidocs/0.9.0/org/apache/crunch/lib/Sample.html">Sample</a> 
API provides methods for two sorts of PCollection
+<p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/Sample.html">Sample</a> 
API provides methods for two sorts of PCollection
 sampling: random and reservoir.</p>
 <p>Random sampling is where you include each record in the same with a fixed 
probability, and is probably what you're
 used to when you think of sampling from a collection:</p>
@@ -1244,13 +1253,13 @@ random number generators. Note that all 
 only require a single pass over the data.</p>
 <p><a name="sets"></a></p>
 <h4 id="set-operations">Set Operations</h4>
-<p>The <a href="apidocs/0.9.0/org/apache/crunch/lib/Set.html">Set</a> API 
methods complement Crunch's built-in <code>union</code> methods and
+<p>The <a href="apidocs/0.10.0/org/apache/crunch/lib/Set.html">Set</a> API 
methods complement Crunch's built-in <code>union</code> methods and
 provide support for finding the intersection, the difference, or the <a 
href="http://en.wikipedia.org/wiki/Comm";>comm</a> of two PCollections.</p>
 <p><a name="splits"></a></p>
 <h4 id="splits">Splits</h4>
 <p>Sometimes, you want to write two different outputs from the same DoFn into 
different PCollections. An example of this would
 be a pipeline in which you wanted to write good records to one file and bad or 
corrupted records to a different file for
-further examination. The <a 
href="apidocs/0.9.0/org/apache/crunch/lib/Channels.html">Channels</a> class 
provides a method that allows
+further examination. The <a 
href="apidocs/0.10.0/org/apache/crunch/lib/Channels.html">Channels</a> class 
provides a method that allows
 you to split an input PCollection of Pairs into a Pair of PCollections:</p>
 <pre>
   PCollection&lt;Pair&lt;L, R&gt;&gt; in = ...;
@@ -1320,31 +1329,31 @@ the maximum value encountered would be i
 flexible schemas for PCollections and PTables, you can write pipelines that 
operate directly on HBase API classes like
 <code>Put</code>, <code>KeyValue</code>, and <code>Result</code>.</p>
 <p>Be sure that the version of Crunch that you're using is compatible with the 
version of HBase that you are running. The 0.8.x
-Crunch versions and earlier ones are developed against HBase 0.94.x, while 
version 0.9.0 and after are developed against
+Crunch versions and earlier ones are developed against HBase 0.94.x, while 
version 0.10.0 and after are developed against
 HBase 0.96. There were a small number of backwards-incompatible changes made 
between HBase 0.94 and 0.96 that are reflected
 in the Crunch APIs for working with HBase. The most important of these is that 
in HBase 0.96, HBase's <code>Put</code>, <code>KeyValue</code>, and 
<code>Result</code>
-classes no longer implement the Writable interface. To support working with 
these types in Crunch 0.9.0, we added the
-<a 
href="apidocs/0.9.0/org/apache/crunch/io/hbase/HBaseTypes.html">HBaseTypes</a> 
class that has factory methods for creating PTypes that serialize the HBase 
client classes to bytes so
+classes no longer implement the Writable interface. To support working with 
these types in Crunch 0.10.0, we added the
+<a 
href="apidocs/0.10.0/org/apache/crunch/io/hbase/HBaseTypes.html">HBaseTypes</a> 
class that has factory methods for creating PTypes that serialize the HBase 
client classes to bytes so
 that they can still be used as part of MapReduce pipelines.</p>
-<p>Crunch supports working with HBase data in two ways. The <a 
href="apidocs/0.9.0/org/apache/crunch/io/hbase/HBaseSourceTarget.html">HBaseSourceTarget</a>
 and <a 
href="apidocs/0.9.0/org/apache/crunch/io/hbase/HBaseTarget.html">HBaseTarget</a>
 classes support reading and
-writing data to HBase tables directly. The <a 
href="apidocs/0.9.0/org/apache/crunch/io/hbase/HFileSource.html">HFileSource</a>
 and <a 
href="apidocs/0.9.0/org/apache/crunch/io/hbase/HFileTarget.html">HFileTarget</a>
 classes support reading and writing data
+<p>Crunch supports working with HBase data in two ways. The <a 
href="apidocs/0.10.0/org/apache/crunch/io/hbase/HBaseSourceTarget.html">HBaseSourceTarget</a>
 and <a 
href="apidocs/0.10.0/org/apache/crunch/io/hbase/HBaseTarget.html">HBaseTarget</a>
 classes support reading and
+writing data to HBase tables directly. The <a 
href="apidocs/0.10.0/org/apache/crunch/io/hbase/HFileSource.html">HFileSource</a>
 and <a 
href="apidocs/0.10.0/org/apache/crunch/io/hbase/HFileTarget.html">HFileTarget</a>
 classes support reading and writing data
 to hfiles, which are the underlying file format for HBase. HFileSource and 
HFileTarget can be used to read and write data to
 hfiles directly, which is much faster than going through the HBase APIs and 
can be used to perform efficient bulk loading of data
-into HBase tables. See the utility methods in the <a 
href="apidocs/0.9.0/org/apache/crunch/io/hbase/HFileUtils.html">HFileUtils</a> 
class for
+into HBase tables. See the utility methods in the <a 
href="apidocs/0.10.0/org/apache/crunch/io/hbase/HFileUtils.html">HFileUtils</a> 
class for
 more details on how to work with PCollections against hfiles.</p>
 <p><a name="exec"></a></p>
 <h2 id="managing-pipeline-execution">Managing Pipeline Execution</h2>
 <p>Crunch uses a lazy execution model. No jobs are run or outputs created 
until the user explicitly invokes one of the methods on the
 Pipeline interface that controls job planning and execution. The simplest of 
these methods is the <code>PipelineResult run()</code> method,
 which analyzes the current graph of PCollections and Target outputs and comes 
up with a plan to ensure that each of the outputs is
-created and then executes it, returning only when the jobs are completed. The 
<a href="apidocs/0.9.0/org/apache/crunch/PipelineResult.html">PipelineResult</a>
+created and then executes it, returning only when the jobs are completed. The 
<a 
href="apidocs/0.10.0/org/apache/crunch/PipelineResult.html">PipelineResult</a>
 returned by the <code>run</code> method contains information about what was 
run, including the number of jobs that were executed during the
-pipeline run and the values of the Hadoop Counters for each of those stages 
via the <a 
href="apidocs/0.9.0/org/apache/crunch/PipelineResult.StageResult.html">StageResult</a>
 component classes.</p>
+pipeline run and the values of the Hadoop Counters for each of those stages 
via the <a 
href="apidocs/0.10.0/org/apache/crunch/PipelineResult.StageResult.html">StageResult</a>
 component classes.</p>
 <p>The last method that should be called in <em>any</em> Crunch pipeline run 
is the Pipeline interface's <code>PipelineResult done()</code> method. The done 
method will
 ensure that any remaining outputs that have not yet been created are executed 
via the <code>run</code>, and it will clean up the temporary directories that
 Crunch creates during runs to hold serialized job information and intermediate 
outputs.</p>
 <p>Crunch also allows developers to execute finer-grained control over 
pipeline execution via Pipeline's <code>PipelineExecution runAsync()</code> 
method.
-The <code>runAsync</code> method is a non-blocking version of the 
<code>run</code> method that returns a <a 
href="apidocs/0.9.0/org/apache/crunch/PipelineExecution.html">PipelineExecution</a>
 instance that can be used to monitor the currently running Crunch pipeline. 
The PipelineExecution object is also useful for debugging
+The <code>runAsync</code> method is a non-blocking version of the 
<code>run</code> method that returns a <a 
href="apidocs/0.10.0/org/apache/crunch/PipelineExecution.html">PipelineExecution</a>
 instance that can be used to monitor the currently running Crunch pipeline. 
The PipelineExecution object is also useful for debugging
 Crunch pipelines by visualizing the Crunch execution plan in DOT format via 
its <code>String getPlanDotFile()</code> method. PipelineExection implements
 Guava's <a 
href="https://code.google.com/p/guava-libraries/wiki/ListenableFutureExplained";>ListenableFuture</a>,
 so you can attach handlers that will be
 called when your pipeline finishes executing.</p>
@@ -1360,7 +1369,7 @@ execution pipelines in a way that is exp
 the different execution engines.</p>
 <p><a name="mrpipeline"></a></p>
 <h3 id="mrpipeline">MRPipeline</h3>
-<p>The <a 
href="apidocs/0.9.0/org/apache/crunch/impl/mr/MRPipeline.html">MRPipeline</a> 
is the oldest implementation of the Pipeline interface and
+<p>The <a 
href="apidocs/0.10.0/org/apache/crunch/impl/mr/MRPipeline.html">MRPipeline</a> 
is the oldest implementation of the Pipeline interface and
 compiles and executes the DAG of PCollections into a series of MapReduce jobs. 
MRPipeline has three constructors that are commonly
 used:</p>
 <ol>
@@ -1420,7 +1429,7 @@ aware of:</p>
 
 <p><a name="sparkpipeline"></a></p>
 <h3 id="sparkpipeline">SparkPipeline</h3>
-<p>The <code>SparkPipeline</code> is the newest implementation of the Pipeline 
interface, and was added in Crunch 0.9.0. It has two default constructors:</p>
+<p>The <code>SparkPipeline</code> is the newest implementation of the Pipeline 
interface, and was added in Crunch 0.10.0. It has two default constructors:</p>
 <ol>
 <li><code>SparkPipeline(String sparkConnection, String appName)</code> which 
takes a Spark connection string, which is of the form 
<code>local[numThreads]</code> for
 local mode or <code>master:port</code> for a Spark cluster. This constructor 
will create its own <code>JavaSparkContext</code> instance to control the Spark 
pipeline
@@ -1446,7 +1455,7 @@ be a little rough around the edges and m
 actively working to ensure complete compatibility between the two 
implementations.</p>
 <p><a name="mempipeline"></a></p>
 <h3 id="mempipeline">MemPipeline</h3>
-<p>The <a 
href="apidocs/0.9.0/org/apache/crunch/impl/mem/MemPipeline.html">MemPipeline</a>
 implementation of Pipeline has a few interesting
+<p>The <a 
href="apidocs/0.10.0/org/apache/crunch/impl/mem/MemPipeline.html">MemPipeline</a>
 implementation of Pipeline has a few interesting
 properties. First, unlike MRPipeline, MemPipeline is a singleton; you don't 
create a MemPipeline, you just get a reference to it
 via the static <code>MemPipeline.getInstance()</code> method. Second, all of 
the operations in the MemPipeline are executed completely in-memory,
 there is no serialization of data to disk by default, and PType usage is 
fairly minimal. This has both benefits and drawbacks; on
@@ -1483,9 +1492,9 @@ without writing them out to disk.</p>
 interface has several tools to help developers create effective unit tests, 
which will be detailed in this section.</p>
 <h3 id="unit-testing-dofns">Unit Testing DoFns</h3>
 <p>Many of the DoFn implementations, such as <code>MapFn</code> and 
<code>FilterFn</code>, are very easy to test, since they accept a single input
-and return a single output. For general purpose DoFns, we need an instance of 
the <a href="apidocs/0.9.0/org/apache/crunch/Emitter.html">Emitter</a>
+and return a single output. For general purpose DoFns, we need an instance of 
the <a href="apidocs/0.10.0/org/apache/crunch/Emitter.html">Emitter</a>
 interface that we can pass to the DoFn's <code>process</code> method and then 
read in the values that are written by the function. Support
-for this pattern is provided by the <a 
href="apidocs/0.9.0/org/apache/crunch/impl/mem/emit/InMemoryEmitter.html">InMemoryEmitter</a>
 class, which
+for this pattern is provided by the <a 
href="apidocs/0.10.0/org/apache/crunch/impl/mem/emit/InMemoryEmitter.html">InMemoryEmitter</a>
 class, which
 has a <code>List&lt;T&gt; getOutput()</code> method that can be used to read 
the values that were passed to the Emitter instance by a DoFn instance:</p>
 <div class="codehilite"><pre><span class="p">@</span><span 
class="n">Test</span>
 <span class="n">public</span> <span class="n">void</span> <span 
class="n">testToUpperCaseFn</span><span class="p">()</span> <span 
class="p">{</span>

svn commit: r918395 - in /websites/staging/crunch/trunk/content: ./ user-guide.html

Reply via email to