Author: buildbot
Date: Mon Nov 25 05:35:13 2013
New Revision: 887983
Log:
Staging update by buildbot for crunch
Modified:
websites/staging/crunch/trunk/content/ (props changed)
websites/staging/crunch/trunk/content/intro.html
Propchange: websites/staging/crunch/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon Nov 25 05:35:13 2013
@@ -1 +1 @@
-1545153
+1545156
Modified: websites/staging/crunch/trunk/content/intro.html
==============================================================================
--- websites/staging/crunch/trunk/content/intro.html (original)
+++ websites/staging/crunch/trunk/content/intro.html Mon Nov 25 05:35:13 2013
@@ -283,25 +283,25 @@ Collections Classes like <code>java.util
pipeline into the client and make decisions based on that data allows us to
create sophisticated analytical
applications that can modify their downstream processing based on the results
of upstream computations.</p>
<h3 id="data-model-and-operators">Data Model and Operators</h3>
-<p>The Java API is centered around three interfaces that represent distributed
datasets: <a
href="apidocs/current/org/apache/crunch/PCollection.html">PCollection<T></a>,
-<a
href="http://crunch.apache.org/apidocs/current/org/apache/crunch/PTable.html">PTable<K,
V></a>, and <a
href="apidocs/current/org/apache/crunch/PGroupedTable.html">PGroupedTable<K,
V></a>.</p>
+<p>The Java API is centered around three interfaces that represent distributed
datasets: <a
href="apidocs/0.8.0/org/apache/crunch/PCollection.html">PCollection<T></a>,
+<a
href="http://crunch.apache.org/apidocs/0.8.0/org/apache/crunch/PTable.html">PTable<K,
V></a>, and <a
href="apidocs/0.8.0/org/apache/crunch/PGroupedTable.html">PGroupedTable<K,
V></a>.</p>
<p>A <code>PCollection<T></code> represents a distributed, unordered
collection of elements of type T. For example, we represent a text file as a
-<code>PCollection<String></code> object.
<code>PCollection<T></code> provides a method, <code>parallelDo</code>,
that applies a <a href="apidocs/current/org/apache/crunch/DoFn.html">DoFn<T,
U></a>
+<code>PCollection<String></code> object.
<code>PCollection<T></code> provides a method, <code>parallelDo</code>,
that applies a <a href="apidocs/0.8.0/org/apache/crunch/DoFn.html">DoFn<T,
U></a>
to each element in the <code>PCollection<T></code> in parallel, and
returns an new <code>PCollection<U></code> as its result.</p>
<p>A <code>PTable<K, V></code> is a sub-interface of
<code>PCollection<Pair<K, V>></code> that represents a distributed,
unordered multimap of its key type K to its value type V.
In addition to the parallelDo operation, PTable provides a
<code>groupByKey</code> operation that aggregates all of the values in the
PTable that
have the same key into a single record. It is the groupByKey operation that
triggers the sort phase of a MapReduce job. Developers can exercise
fine-grained control over the number of reducers and the partitioning,
grouping, and sorting strategies used during the shuffle by providing an
instance
-of the <a
href="apidocs/current/org/apache/crunch/GroupingOptions.html">GroupingOptions</a>
class to the <code>groupByKey</code> function.</p>
+of the <a
href="apidocs/0.8.0/org/apache/crunch/GroupingOptions.html">GroupingOptions</a>
class to the <code>groupByKey</code> function.</p>
<p>The result of a groupByKey operation is a <code>PGroupedTable<K,
V></code> object, which is a distributed, sorted map of keys of type K to an
Iterable<V> that may
be iterated over exactly once. In addition to <code>parallelDo</code>
processing via DoFns, PGroupedTable provides a <code>combineValues</code>
operation that allows a
-commutative and associative <a
href="apidocs/current/org/apache/crunch/Aggregator.html">Aggregator<V></a> to
be applied to the values of the PGroupedTable
+commutative and associative <a
href="apidocs/0.8.0/org/apache/crunch/Aggregator.html">Aggregator<V></a> to be
applied to the values of the PGroupedTable
instance on both the map and reduce sides of the shuffle. A number of common
<code>Aggregator<V></code> implementations are provided in the
-<a
href="apidocs/current/org/apache/crunch/fn/Aggregators.html">Aggregators</a>
class.</p>
+<a href="apidocs/0.8.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a>
class.</p>
<p>Finally, PCollection, PTable, and PGroupedTable all support a
<code>union</code> operation, which takes a series of distinct PCollections
that all have
the same data type and treats them as a single virtual PCollection.</p>
<p>All of the other data transformation operations supported by the Crunch
APIs (aggregations, joins, sorts, secondary sorts, and cogrouping) are
implemented
-in terms of these four primitives. The patterns themselves are defined in the
<a
href="apidocs/current/org/apache/crunch/lib/package-summary.html">org.apache.crunch.lib</a>
+in terms of these four primitives. The patterns themselves are defined in the
<a
href="apidocs/0.8.0/org/apache/crunch/lib/package-summary.html">org.apache.crunch.lib</a>
package and its children, and a few of of the most common patterns have
convenience functions defined on the PCollection and PTable interfaces.</p>
<h3 id="writing-dofns">Writing DoFns</h3>
<p>DoFns represent the logical computations of your Crunch pipelines. They are
designed to be easy to write, easy to test, and easy to deploy
@@ -343,7 +343,7 @@ framework won't kill it,</li>
</ul>
<p>Crunch provides a number of helper methods for working with <a
href="http://codingwiththomas.blogspot.com/2011/04/controlling-hadoop-job-recursion.html">Hadoop
Counters</a>, all named <code>increment</code>. Counters are an incredibly
useful way of keeping track of the state of long running data pipelines and
detecting any exceptional conditions that
occur during processing, and they are supported in both the MapReduce-based
and in-memory Crunch pipeline contexts. You can retrive the value of the
Counters
-in your client code at the end of a MapReduce pipeline by getting them from
the <a
href="apidocs/current/org/apache/crunch/PipelineResult.StageResult.html">StageResult</a>
+in your client code at the end of a MapReduce pipeline by getting them from
the <a
href="apidocs/0.8.0/org/apache/crunch/PipelineResult.StageResult.html">StageResult</a>
objects returned by Crunch at the end of a run.</p>
<p>(Note that there was a change in the Counters API from Hadoop 1.0 to Hadoop
2.0, and thus we do not recommend that you work with the
Counter classes directly in yoru Crunch pipelines (the two
<code>getCounter</code> methods that were defined in DoFn are both deprecated)
so that you will not be
@@ -362,16 +362,16 @@ will require extra memory settings to ru
memory setting for the DoFn's needs before the job was launched on the
cluster.</p>
<h4 id="common-dofn-patterns">Common DoFn Patterns</h4>
<p>The Crunch APIs contain a number of useful subclasses of DoFn that handle
common data processing scenarios and are easier
-to write and test. The top-level <a
href="apidocs/current/org/apache/crunch/package-summary.html">org.apache.crunch</a>
package contains three
+to write and test. The top-level <a
href="apidocs/0.8.0/org/apache/crunch/package-summary.html">org.apache.crunch</a>
package contains three
of the most important specializations, which we will discuss now. Each of
these specialized DoFn implementations has associated methods
on the PCollection, PTable, and PGroupedTable interfaces to support common
data processing steps.</p>
-<p>The simplest extension is the <a
href="apidocs/current/org/apache/crunch/FilterFn.html">FilterFn<T></a> class,
which defines a single abstract method, <code>boolean accept(T input)</code>.
+<p>The simplest extension is the <a
href="apidocs/0.8.0/org/apache/crunch/FilterFn.html">FilterFn<T></a> class,
which defines a single abstract method, <code>boolean accept(T input)</code>.
The FilterFn can be applied to a <code>PCollection<T></code> by calling
the <code>filter(FilterFn<T> fn)</code> method, and will return a new
<code>PCollection<T></code> that only contains
the elements of the input PCollection for which the accept method returned
true. Note that the filter function does not include a PType argument in its
signature, because there is no change in the data type of the PCollection when
the FilterFn is applied. It is possible to compose new FilterFn
instances by combining multiple FilterFns together using the <code>and</code>,
<code>or</code>, and <code>not</code> factory methods defined in the
-<a href="apidocs/current/org/apache/crunch/fn/FilterFns.html">FilterFns</a>
helper class.</p>
-<p>The second extension is the <a
href="apidocs/current/org/apache/crunch/MapFn.html">MapFn<S, T></a> class,
which defines a single abstract method, <code>T map(S input)</code>.
+<a href="apidocs/0.8.0/org/apache/crunch/fn/FilterFns.html">FilterFns</a>
helper class.</p>
+<p>The second extension is the <a
href="apidocs/0.8.0/org/apache/crunch/MapFn.html">MapFn<S, T></a> class, which
defines a single abstract method, <code>T map(S input)</code>.
For simple transform tasks in which every input record will have exactly one
output, it's easy to test a MapFn by verifying that a given input returns a
every input record will have exactly one output, it's easy to test a MapFn by
verifying that a given input returns a given output.</p>
<p>MapFns are also used in specialized methods on the PCollection and PTable
interfaces. <code>PCollection<V></code> defines the method
@@ -380,22 +380,22 @@ function that extracts the key (of type
the key be given and constructs a <code>PTableType<K, V></code> from the
given key type and the PCollection's existing value type. <code>PTable<K,
V></code>, in turn,
has methods <code>PTable<K1, V> mapKeys(MapFn<K, K1> mapFn)</code>
and <code>PTable<K, V2> mapValues(MapFn<V, V2>)</code> that handle
the common case of converting
just one of the paired values in a PTable instance from one type to another
while leaving the other type the same.</p>
-<p>The final top-level extension to DoFn is the <a
href="apidocs/current/org/apache/crunch/CombineFn.html">CombineFn<K, V></a>
class, which is used in conjunction with
+<p>The final top-level extension to DoFn is the <a
href="apidocs/0.8.0/org/apache/crunch/CombineFn.html">CombineFn<K, V></a>
class, which is used in conjunction with
the <code>combineValues</code> method defined on the PGroupedTable interface.
CombineFns are used to represent the associative operations that can be applied
using
the MapReduce Combiner concept in order to reduce the amount data that is
shipped over the network during a shuffle.</p>
<p>The CombineFn extension is different from the FilterFn and MapFn classes in
that it does not define an abstract method for handling data
beyond the default <code>process</code> method that any other DoFn would use;
rather, extending the CombineFn class signals to the Crunch planner that the
logic
contained in this class satisfies the conditions required for use with the
MapReduce combiner.</p>
-<p>Crunch supports many types of these associative patterns, such as sums,
counts, and set unions, via the <a
href="apidocs/current/org/apache/crunch/Aggregator.html">Aggregator<V></a>
+<p>Crunch supports many types of these associative patterns, such as sums,
counts, and set unions, via the <a
href="apidocs/0.8.0/org/apache/crunch/Aggregator.html">Aggregator<V></a>
interface, which is defined right alongside the CombineFn class in the
top-level <code>org.apache.crunch</code> package. There are a number of
implementations of the Aggregator
-interface defined via static factory methods in the <a
href="apidocs/current/org/apache/crunch/fn/Aggregators.html">Aggregators</a>
class.</p>
+interface defined via static factory methods in the <a
href="apidocs/0.8.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a>
class.</p>
<h3 id="serializing-data-with-ptypes">Serializing Data with PTypes</h3>
<p>Why PTypes Are Necessary, the two type families, the core methods and
tuples.</p>
<h4 id="extending-ptypes">Extending PTypes</h4>
<p>The simplest way to create a new <code>PType<T></code> for a data
object is to create a <em>derived</em> PType from one of the built-in PTypes
for the Avro
and Writable type families. If we have a base <code>PType<S></code>, we
can create a derived <code>PType<T></code> by implementing an input
<code>MapFn<S, T></code> and an
output <code>MapFn<T, S></code> and then calling
<code>PTypeFamily.derived(Class<T>, MapFn<S, T> in, MapFn<T,
S> out, PType<S> base)</code>, which will return
-a new <code>PType<T></code>. There are examples of derived PTypes in the
<a href="apidocs/current/org/apache/crunch/types/PTypes.html">PTypes</a> class,
including
+a new <code>PType<T></code>. There are examples of derived PTypes in the
<a href="apidocs/0.8.0/org/apache/crunch/types/PTypes.html">PTypes</a> class,
including
serialization support for protocol buffers, Thrift records, Java Enums,
BigInteger, and UUIDs.</p>
<h3 id="reading-and-writing-data-sources-targets-and-sourcetargets">Reading
and Writing Data: Sources, Targets, and SourceTargets</h3>
<p>MapReduce developers are familiar with the <code>InputFormat<K,
V></code> and <code>OutputFormat<K, V></code> classes for reading and
writing data during
@@ -416,8 +416,8 @@ declares a <code>Iterable<T> read(
or into a DoFn implementation that can use the data read from the source to
perform additional transforms on the main input data that is
processed using the DoFn's <code>process</code> method (this is how Crunch
supports mapside-join operations.)</p>
<p>Support for the most common Source, Target, and SourceTarget
implementations are provided by the factory functions declared in the
-<a href="apidocs/current/org/apache/crunch/io/From.html">From</a> (Sources),
<a href="apidocs/current/org/apache/crunch/io/To.html">To</a> (Targets), and
-<a href="apidocs/current/org/apache/crunch/io/At.html">At</a> (SourceTargets)
classes in the <a
href="apidocs/current/org/apache/crunch/io/package-summary.html">org.apache.crunch.io</a>
+<a href="apidocs/0.8.0/org/apache/crunch/io/From.html">From</a> (Sources), <a
href="apidocs/0.8.0/org/apache/crunch/io/To.html">To</a> (Targets), and
+<a href="apidocs/0.8.0/org/apache/crunch/io/At.html">At</a> (SourceTargets)
classes in the <a
href="apidocs/0.8.0/org/apache/crunch/io/package-summary.html">org.apache.crunch.io</a>
package.</p>
<h3 id="pipeline-building-and-execution">Pipeline Building and Execution</h3>
<h4 id="creating-a-new-crunch-pipeline">Creating A New Crunch Pipeline</h4>