intro.html

buildbot Sun, 24 Nov 2013 21:22:22 -0800

Author: buildbot
Date: Mon Nov 25 05:21:07 2013
New Revision: 887976

Log:
Staging update by buildbot for crunch


Modified:
    websites/staging/crunch/trunk/content/   (props changed)
    websites/staging/crunch/trunk/content/intro.html

Propchange: websites/staging/crunch/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Mon Nov 25 05:21:07 2013
@@ -1 +1 @@
-1544354
+1545153

Modified: websites/staging/crunch/trunk/content/intro.html
==============================================================================
--- websites/staging/crunch/trunk/content/intro.html (original)
+++ websites/staging/crunch/trunk/content/intro.html Mon Nov 25 05:21:07 2013
@@ -250,7 +250,7 @@ return a type that has an associated obj
 supports two serialization frameworks, called <em>type families</em>: one 
based on Hadoop's <code>Writable</code> interface, and another based on 
<code>Apache Avro</code>.
 You can read more about how to work with Crunch's serialization libraries 
here. TODO</p>
 <p>Because all of the core logic in our application is exposed via a single 
static method that operates on Crunch interfaces, we can use Crunch's
-in-memory API to test our business logic using a unit testing framework like 
JUnit. Let's look at an exampel unit test for the word count
+in-memory API to test our business logic using a unit testing framework like 
JUnit. Let's look at an example unit test for the word count
 application:</p>
 <div class="codehilite"><pre><span class="n">package</span> <span 
class="n">org</span><span class="p">.</span><span class="n">myorg</span><span 
class="p">;</span>
 
@@ -283,51 +283,55 @@ Collections Classes like <code>java.util
 pipeline into the client and make decisions based on that data allows us to 
create sophisticated analytical
 applications that can modify their downstream processing based on the results 
of upstream computations.</p>
 <h3 id="data-model-and-operators">Data Model and Operators</h3>
-<p>The Java API is centered around three interfaces that represent distributed 
datasets: <code>PCollection&lt;T&gt;</code>, <code>PTable&lt;K, V&gt;</code>, 
and <code>PGroupedTable&lt;K, V&gt;</code>.</p>
+<p>The Java API is centered around three interfaces that represent distributed 
datasets: <a 
href="apidocs/current/org/apache/crunch/PCollection.html">PCollection<T></a>,
+<a 
href="http://crunch.apache.org/apidocs/current/org/apache/crunch/PTable.html";>PTable<K,
 V></a>, and <a 
href="apidocs/current/org/apache/crunch/PGroupedTable.html">PGroupedTable<K, 
V></a>.</p>
 <p>A <code>PCollection&lt;T&gt;</code> represents a distributed, unordered 
collection of elements of type T. For example, we represent a text file as a
-<code>PCollection&lt;String&gt;</code> object. PCollection provides a method, 
<code>parallelDo</code>, that applies a <code>DoFn</code> to each element in a 
PCollection in parallel,
-and returns a new PCollection as its result. </p>
-<p>A <code>PTable&lt;K, V&gt;</code> is a sub-interface of PCollection that 
represents a distributed, unordered multimap of its key type K to its value 
type V.
+<code>PCollection&lt;String&gt;</code> object. 
<code>PCollection&lt;T&gt;</code> provides a method, <code>parallelDo</code>, 
that applies a <a href="apidocs/current/org/apache/crunch/DoFn.html">DoFn<T, 
U></a>
+to each element in the <code>PCollection&lt;T&gt;</code> in parallel, and 
returns an new <code>PCollection&lt;U&gt;</code> as its result.</p>
+<p>A <code>PTable&lt;K, V&gt;</code> is a sub-interface of 
<code>PCollection&lt;Pair&lt;K, V&gt;&gt;</code> that represents a distributed, 
unordered multimap of its key type K to its value type V.
 In addition to the parallelDo operation, PTable provides a 
<code>groupByKey</code> operation that aggregates all of the values in the 
PTable that
-have the same key into a single record. It is the groupByKey operation that 
triggers the sort phase of a MapReduce job.</p>
-<p>The result of a groupByKey operation is a <code>PGroupedTable&lt;K, 
V&gt;</code> object, which is a distributed, sorted map of keys of type K to an 
Iterable
-collection of values of type V. In addition to parallelDo, the PGroupedTable 
provides a <code>combineValues</code> operation, which allows for
-a commutative and associative aggregation operator to be applied to the values 
of the PGroupedTable instance on both the map side and the
-reduce side of a MapReduce job.</p>
+have the same key into a single record. It is the groupByKey operation that 
triggers the sort phase of a MapReduce job. Developers can exercise
+fine-grained control over the number of reducers and the partitioning, 
grouping, and sorting strategies used during the shuffle by providing an 
instance
+of the <a 
href="apidocs/current/org/apache/crunch/GroupingOptions.html">GroupingOptions</a>
 class to the <code>groupByKey</code> function.</p>
+<p>The result of a groupByKey operation is a <code>PGroupedTable&lt;K, 
V&gt;</code> object, which is a distributed, sorted map of keys of type K to an 
Iterable<V> that may
+be iterated over exactly once. In addition to <code>parallelDo</code> 
processing via DoFns, PGroupedTable provides a <code>combineValues</code> 
operation that allows a
+commutative and associative <a 
href="apidocs/current/org/apache/crunch/Aggregator.html">Aggregator<V></a> to 
be applied to the values of the PGroupedTable
+instance on both the map and reduce sides of the shuffle. A number of common 
<code>Aggregator&lt;V&gt;</code> implementations are provided in the
+<a 
href="apidocs/current/org/apache/crunch/fn/Aggregators.html">Aggregators</a> 
class.</p>
 <p>Finally, PCollection, PTable, and PGroupedTable all support a 
<code>union</code> operation, which takes a series of distinct PCollections 
that all have
 the same data type and treats them as a single virtual PCollection.</p>
-<p>All of the other MapReduce patterns supported by the Crunch APIs 
(aggregations, joins, sorts, secondary sorts, and cogrouping) are all 
implemented
-in terms of these four primitives. The patterns themselves are defined in the 
<code>org.apache.crunch.lib</code> package and its children, and a few of
-the most common patterns have convenience functions defined on the PCollection 
and PTable interfaces. We will do a more detailed review of these
-patterns later in this document, but here are a few examples to get you 
started: TODO</p>
+<p>All of the other data transformation operations supported by the Crunch 
APIs (aggregations, joins, sorts, secondary sorts, and cogrouping) are 
implemented
+in terms of these four primitives. The patterns themselves are defined in the 
<a 
href="apidocs/current/org/apache/crunch/lib/package-summary.html">org.apache.crunch.lib</a>
+package and its children, and a few of of the most common patterns have 
convenience functions defined on the PCollection and PTable interfaces.</p>
 <h3 id="writing-dofns">Writing DoFns</h3>
 <p>DoFns represent the logical computations of your Crunch pipelines. They are 
designed to be easy to write, easy to test, and easy to deploy
 within the context of a MapReduce job. Much of your work with the Crunch APIs 
will be writing DoFns, and so having a good understanding of
 how to use them effectively is critical to crafting elegant and efficient 
pipelines.</p>
 <h4 id="dofn-extends-serializable">DoFn extends Serializable</h4>
 <p>The most important thing to remember about DoFns is that they all implement 
the <code>java.io.Serializable</code> interface, which means that all of the
-state information associated with a DoFn must also be serializable. There is 
an excellent overview of Java serializability here that is worth
-reviewing if you aren't familiar with Java's serializability model. TODO</p>
-<p>If your DoFn needs to work with a class that does not implement 
Serializable and cannot be modified (e.g., because it is defined in a 
third-party
+state information associated with a DoFn must also be serializable. There is 
an <a 
href="http://docs.oracle.com/javase/tutorial/jndi/objects/serial.html";>excellent
 overview of Java serializability</a> that is worth reviewing if you aren't 
familiar with it already.</p>
+<p>If your DoFn needs to work with a class that does not implement 
Serializable and cannot be modified (for example, because it is defined in a 
third-party
 library), you should use the <code>transient</code> keyword on that member 
variable so that serializing the DoFn won't fail if that object happens to be
 defined. You can create an instance of the object during runtime using the 
<code>initialize</code> method described in the following section.</p>
 <h4 id="runtime-processing-steps">Runtime Processing Steps</h4>
 <p>After the Crunch runtime loads the serialized DoFns into its map and reduce 
tasks, the DoFns are executed on the input data via the following
 sequence:</p>
-<h1 
id="first-the-dofn-is-given-access-to-the-taskinputoutputcontext-implementation-for-the-current-task-this-allows-the-dofn-to-access-any">First,
 the DoFn is given access to the <code>TaskInputOutputContext</code> 
implementation for the current task. This allows the DoFn to access any</h1>
-<p>necessary configuration and runtime information needed before or during 
processing.</p>
-<h1 
id="next-the-dofns-initialize-method-is-called-the-initialize-method-is-similar-to-the-setup-method-used-in-the-mapper-and-reducer-classes">Next,
 the DoFn's <code>initialize</code> method is called. The initialize method is 
similar to the <code>setup</code> method used in the Mapper and Reducer 
classes;</h1>
-<p>it is called before processing begins in order to enable any necessary 
initialization or configuration of the DoFn to be performed. For example,
-if we were making use of a non-serializable third-party library, we would 
create an instance of it here.</p>
-<h1 
id="at-this-point-data-processing-begins-the-map-or-reduce-task-will-begin-passing-records-in-to-the-dofns-process-method-and-capturing-the">At
 this point, data processing begins. The map or reduce task will begin passing 
records in to the DoFn's <code>process</code> method, and capturing the</h1>
-<p>output of the process method into an <code>Emitter&lt;T&gt;</code> that can 
either pass the data along to another DoFn for processing or serialize it as 
the output
-of the current processing stage.</p>
-<h1 
id="finally-after-all-of-the-records-have-been-processed-the-void-cleanupemittert-emitter-method-is-called-on-each-dofn-the-cleanup-method">Finally,
 after all of the records have been processed, the <code>void 
cleanup(Emitter&lt;T&gt; emitter)</code> method is called on each DoFn. The 
cleanup method</h1>
-<p>has a dual purpose: it can be used to emit any state information that the 
DoFn wants to pass along to the next stage (for example, cleanup could
+<ol>
+<li>First, the DoFn is given access to the <code>TaskInputOutputContext</code> 
implementation for the current task. This allows the DoFn to access any
+necessary configuration and runtime information needed before or during 
processing.</li>
+<li>Next, the DoFn's <code>initialize</code> method is called. The initialize 
method is similar to the <code>setup</code> method used in the Mapper and 
Reducer classes;
+it is called before processing begins in order to enable any necessary 
initialization or configuration of the DoFn to be performed. For example,
+if we were making use of a non-serializable third-party library, we would 
create an instance of it here.</li>
+<li>At this point, data processing begins. The map or reduce task will begin 
passing records in to the DoFn's <code>process</code> method, and capturing the
+output of the process method into an <code>Emitter&lt;T&gt;</code> that can 
either pass the data along to another DoFn for processing or serialize it as 
the output
+of the current processing stage.</li>
+<li>Finally, after all of the records have been processed, the <code>void 
cleanup(Emitter&lt;T&gt; emitter)</code> method is called on each DoFn. The 
cleanup method
+has a dual purpose: it can be used to emit any state information that the DoFn 
wants to pass along to the next stage (for example, cleanup could
 be used to emit the sum of a list of numbers that was passed in to the DoFn's 
process method), as well as to release any resources or perform any
-other cleanup task that is appropriate once the job has finished executing.</p>
+other cleanup task that is appropriate once the job has finished 
executing.</li>
+</ol>
 <h4 id="accessing-runtime-mapreduce-apis">Accessing Runtime MapReduce APIs</h4>
-<p>DoFns provide direct access to the <code>TaskInputOutputContext</code> 
object that is used within a given Map or Reduce task via the protected 
<code>getContext</code>
+<p>DoFns provide direct access to the <code>TaskInputOutputContext</code> 
object that is used within a given Map or Reduce task via the 
<code>getContext</code>
 method. There are also a number of helper methods for working with the objects 
associated with the TaskInputOutputContext, including:</p>
 <ul>
 <li><code>getConfiguration()</code> for accessing the 
<code>Configuration</code> object that contains much of the detail about system 
and user-specific parameters for a
@@ -337,57 +341,86 @@ framework won't kill it,</li>
 <li><code>setStatus(String status)</code> and <code>getStatus</code> for 
setting task status information, and</li>
 <li><code>getTaskAttemptID()</code> for accessing the current 
<code>TaskAttemptID</code> information.</li>
 </ul>
-<p>Crunch provides a number of helper methods, all named 
<code>increment</code> and having various signatures, for working with Hadoop 
Counters.
-There was a change in the Counters API from Hadoop 1.0 to Hadoop 2.0, and thus 
we do not recommend that you work with the <code>Counter</code> classes
-directly in your Crunch pipelines (the two <code>getCounter</code> methods 
that were defined in DoFn are both deprecated) so that you will not be
-required to recompile your job jars when you move from a Hadoop 1.x cluster to 
a Hadoop 2.x cluster.</p>
+<p>Crunch provides a number of helper methods for working with <a 
href="http://codingwiththomas.blogspot.com/2011/04/controlling-hadoop-job-recursion.html";>Hadoop
 Counters</a>, all named <code>increment</code>. Counters are an incredibly 
useful way of keeping track of the state of long running data pipelines and 
detecting any exceptional conditions that
+occur during processing, and they are supported in both the MapReduce-based 
and in-memory Crunch pipeline contexts. You can retrive the value of the 
Counters
+in your client code at the end of a MapReduce pipeline by getting them from 
the <a 
href="apidocs/current/org/apache/crunch/PipelineResult.StageResult.html">StageResult</a>
+objects returned by Crunch at the end of a run.</p>
+<p>(Note that there was a change in the Counters API from Hadoop 1.0 to Hadoop 
2.0, and thus we do not recommend that you work with the
+Counter classes directly in yoru Crunch pipelines (the two 
<code>getCounter</code> methods that were defined in DoFn are both deprecated) 
so that you will not be
+required to recompile your job jars when you move from a Hadoop 1.0 cluster to 
a Hadoop 2.0 cluster.)</p>
 <h4 
id="configuring-the-crunch-planner-and-mapreduce-jobs-with-dofns">Configuring 
the Crunch Planner and MapReduce Jobs with DoFns</h4>
 <p>Although most of the DoFn methods are focused on runtime execution, there 
are a handful of methods that are used during the planning phase
 before a pipeline is converted into MapReduce jobs. The first of these 
functions is <code>float scaleFactor()</code>, which should return a floating 
point
 value greater than 0.0f. You can override the scaleFactor method in your 
custom DoFns in order to provide a hint to the Crunch planner about
-how much larger (or smaller) an input data set will become after passing 
through the process method. If the groupByKey method is called without
+how much larger (or smaller) an input data set will become after passing 
through the process method. If the <code>groupByKey</code> method is called 
without
 an explicit number of reducers provided, the planner will try to guess how 
many reduce tasks should be used for the job based on the size of
-the input data, which is determined in part by using the scaleFactor 
results.</p>
+the input data, which is determined in part by using the result of calling the 
<code>scaleFactor</code> method on the DoFns in the processing path.</p>
 <p>Sometimes, you may know that one of your DoFns has some unusual parameter 
settings that need to be specified on any job that includes that
 DoFn as part of its processing. A DoFn can modify the Hadoop Configuration 
object that is associated with the MapReduce job it is assigned to
 on the client before processing begins by overriding the <code>void 
configure(Configuration conf)</code> method. For example, you might know that 
the DoFn
 will require extra memory settings to run, and so you could make sure that the 
value of the <code>mapred.child.java.opts</code> argument had a large enough
 memory setting for the DoFn's needs before the job was launched on the 
cluster.</p>
-<h4 id="dofn-extensions-and-helper-classes">DoFn Extensions and Helper 
Classes</h4>
+<h4 id="common-dofn-patterns">Common DoFn Patterns</h4>
 <p>The Crunch APIs contain a number of useful subclasses of DoFn that handle 
common data processing scenarios and are easier
-to write and test. The top-level <code>org.apache.crunch</code> package 
contains three of the most important specializations, which we will
-discuss now. Each of these specialized DoFn implementations has associated 
methods on the PCollection, PTable, and PGroupedTable
-interfaces to support these common data processing tasks.</p>
-<p>The simplest extension is the <code>FilterFn&lt;T&gt;</code> class, which 
defines a single abstract method, <code>boolean accept(T input)</code>. The 
FilterFn can be applied
-to a <code>PCollection&lt;T&gt;</code> by calling the 
<code>filter(FilterFn&lt;T&gt; fn)</code> method, and will return a new 
<code>PCollection&lt;T&gt;</code> that only contains the elements
-of the input PCollection for which the accept method returned true. Note that 
the filter function does not include a PType argument in its
+to write and test. The top-level <a 
href="apidocs/current/org/apache/crunch/package-summary.html">org.apache.crunch</a>
 package contains three
+of the most important specializations, which we will discuss now. Each of 
these specialized DoFn implementations has associated methods
+on the PCollection, PTable, and PGroupedTable interfaces to support common 
data processing steps.</p>
+<p>The simplest extension is the <a 
href="apidocs/current/org/apache/crunch/FilterFn.html">FilterFn<T></a> class, 
which defines a single abstract method, <code>boolean accept(T input)</code>.
+The FilterFn can be applied to a <code>PCollection&lt;T&gt;</code> by calling 
the <code>filter(FilterFn&lt;T&gt; fn)</code> method, and will return a new 
<code>PCollection&lt;T&gt;</code> that only contains
+the elements of the input PCollection for which the accept method returned 
true. Note that the filter function does not include a PType argument in its
 signature, because there is no change in the data type of the PCollection when 
the FilterFn is applied. It is possible to compose new FilterFn
-instances by combining multiple FilterFns together using the <code>and</code>, 
<code>or</code>, and <code>not</code> factory methods defined in the FilterFns 
helper class.</p>
-<p>The second extension is the <code>MapFn&lt;S, T&gt;</code> class, which 
defines a single abstract method, <code>T map(S input)</code>. For simple 
transform tasks in which
-every input record will have exactly one output, it's easy to test a MapFn by 
verifying that a given input returns a given output. MapFns are
-also used by Crunch's data serialization libraries to map between serialized 
data types (such as Writables or Avro records) and POJOs.</p>
+instances by combining multiple FilterFns together using the <code>and</code>, 
<code>or</code>, and <code>not</code> factory methods defined in the
+<a href="apidocs/current/org/apache/crunch/fn/FilterFns.html">FilterFns</a> 
helper class.</p>
+<p>The second extension is the <a 
href="apidocs/current/org/apache/crunch/MapFn.html">MapFn<S, T></a> class, 
which defines a single abstract method, <code>T map(S input)</code>.
+For simple transform tasks in which every input record will have exactly one 
output, it's easy to test a MapFn by verifying that a given input returns a
+every input record will have exactly one output, it's easy to test a MapFn by 
verifying that a given input returns a given output.</p>
 <p>MapFns are also used in specialized methods on the PCollection and PTable 
interfaces. <code>PCollection&lt;V&gt;</code> defines the method
 <code>PTable&lt;K,V&gt; by(MapFn&lt;V, K&gt; mapFn, PType&lt;K&gt; 
keyType)</code> that can be used to create a PTable from a PCollection by 
writing a
 function that extracts the key (of type K) from the value (of type V) 
contained in the PCollection. The by function only requires that the PType of
 the key be given and constructs a <code>PTableType&lt;K, V&gt;</code> from the 
given key type and the PCollection's existing value type. <code>PTable&lt;K, 
V&gt;</code>, in turn,
 has methods <code>PTable&lt;K1, V&gt; mapKeys(MapFn&lt;K, K1&gt; mapFn)</code> 
and <code>PTable&lt;K, V2&gt; mapValues(MapFn&lt;V, V2&gt;)</code> that handle 
the common case of converting
 just one of the paired values in a PTable instance from one type to another 
while leaving the other type the same.</p>
-<p>The final top-level extension to DoFn is the <code>CombineFn&lt;K, 
V&gt;</code> class, which is used in conjunction with the 
<code>combineValues</code> method defined on the
-PGroupedTable interface. CombineFns are used to represent the associative 
operations that can be applied using the MapReduce Combiner concept in
-order to reduce the amount of data that is shipped over the network during the 
shuffle. The CombineFn extension is different from the FilterFn and
-MapFn classes in that it does not define an abstract method for handling data 
besides the default <code>process</code> method that any other DoFn would use;
-rather, extending the CombineFn class signals to the Crunch planner that the 
logic contained in this class satisfies the conditions required for use
-with the MapReduce combiner. Crunch supports many types of these associative 
patterns, such as sums, counts, and set unions, via the 
<code>Aggregator&lt;V&gt;</code> interface,
-which is defined right alongside the CombineFn class in the top-level 
<code>org.apache.crunch</code> package. There are a number of implementations 
of the Aggregator
-interface defined via static factory methods in the 
<code>org.apache.crunch.fn.Aggregators</code> class.</p>
+<p>The final top-level extension to DoFn is the <a 
href="apidocs/current/org/apache/crunch/CombineFn.html">CombineFn<K, V></a> 
class, which is used in conjunction with
+the <code>combineValues</code> method defined on the PGroupedTable interface. 
CombineFns are used to represent the associative operations that can be applied 
using
+the MapReduce Combiner concept in order to reduce the amount data that is 
shipped over the network during a shuffle.</p>
+<p>The CombineFn extension is different from the FilterFn and MapFn classes in 
that it does not define an abstract method for handling data
+beyond the default <code>process</code> method that any other DoFn would use; 
rather, extending the CombineFn class signals to the Crunch planner that the 
logic
+contained in this class satisfies the conditions required for use with the 
MapReduce combiner.</p>
+<p>Crunch supports many types of these associative patterns, such as sums, 
counts, and set unions, via the <a 
href="apidocs/current/org/apache/crunch/Aggregator.html">Aggregator<V></a>
+interface, which is defined right alongside the CombineFn class in the 
top-level <code>org.apache.crunch</code> package. There are a number of 
implementations of the Aggregator
+interface defined via static factory methods in the <a 
href="apidocs/current/org/apache/crunch/fn/Aggregators.html">Aggregators</a> 
class.</p>
 <h3 id="serializing-data-with-ptypes">Serializing Data with PTypes</h3>
 <p>Why PTypes Are Necessary, the two type families, the core methods and 
tuples.</p>
 <h4 id="extending-ptypes">Extending PTypes</h4>
-<h3 id="reading-data-sources">Reading Data: Sources</h3>
-<h3 id="writing-data-targets">Writing Data: Targets</h3>
+<p>The simplest way to create a new <code>PType&lt;T&gt;</code> for a data 
object is to create a <em>derived</em> PType from one of the built-in PTypes 
for the Avro
+and Writable type families. If we have a base <code>PType&lt;S&gt;</code>, we 
can create a derived <code>PType&lt;T&gt;</code> by implementing an input 
<code>MapFn&lt;S, T&gt;</code> and an
+output <code>MapFn&lt;T, S&gt;</code> and then calling 
<code>PTypeFamily.derived(Class&lt;T&gt;, MapFn&lt;S, T&gt; in, MapFn&lt;T, 
S&gt; out, PType&lt;S&gt; base)</code>, which will return
+a new <code>PType&lt;T&gt;</code>. There are examples of derived PTypes in the 
<a href="apidocs/current/org/apache/crunch/types/PTypes.html">PTypes</a> class, 
including
+serialization support for protocol buffers, Thrift records, Java Enums, 
BigInteger, and UUIDs.</p>
+<h3 id="reading-and-writing-data-sources-targets-and-sourcetargets">Reading 
and Writing Data: Sources, Targets, and SourceTargets</h3>
+<p>MapReduce developers are familiar with the <code>InputFormat&lt;K, 
V&gt;</code> and <code>OutputFormat&lt;K, V&gt;</code> classes for reading and 
writing data during
+MapReduce processing. Crunch has the analogous concepts of a 
<code>Source&lt;T&gt;</code> for reading data and a <code>Target</code> for 
writing data. For data
+sources that may be treated as both the output of one pipeline phase and the 
input to another, Crunch has a <code>SourceTarget&lt;T&gt;</code> interface
+that combines the functionality of both <code>Source&lt;T&gt;</code> and 
<code>Target</code>.</p>
+<p>Sources and Targets provide several useful extensions to the functionality 
provided by InputFormat and OutputFormat. First, a Source can
+encapsulate an InputFormat as well as any special Configuration settings that 
are needed by that InputFormat. For example, the
+<code>AvroInputFormat</code> needs to know the Avro schema of the input Avro 
file and expects to find that schema associated with the "avro.schema" key
+in the <code>Configuration</code> object for a pipeline. But if you need to 
read multiple Avro files, each with its own schema, during a single MapReduce
+job, you need a way of ensuring that the different schemas for each file do 
not all overwrite the "avro.schema" key in the shared
+<code>Configuration</code> object. Crunch's <code>Source&lt;T&gt;</code> 
allows you to specify a set of key-value entries that need to be set in the 
<code>Configuration</code>
+before a particular input is read in a way that prevents them from conflicting 
with each other, while the Target interface provides the same
+functionality for OutputFormats.</p>
+<p>The <code>Source&lt;T&gt;</code> interface has two useful extensions. The 
first is <code>TableSource&lt;K, V&gt;</code> which extends 
<code>Source&lt;Pair&lt;K, V&gt;&gt;</code> and can be
+used to read in a <code>PTable&lt;K, V&gt;</code> instance instead of a 
<code>PCollection&lt;Pair&lt;K, V&gt;&gt;</code> instance. The second extension 
is <code>ReadableSource&lt;T&gt;</code>, which
+declares a <code>Iterable&lt;T&gt; read(Configuration conf)</code> method that 
allows the contents of the Source to be read directly, either into the client
+or into a DoFn implementation that can use the data read from the source to 
perform additional transforms on the main input data that is
+processed using the DoFn's <code>process</code> method (this is how Crunch 
supports mapside-join operations.)</p>
+<p>Support for the most common Source, Target, and SourceTarget 
implementations are provided by the factory functions declared in the
+<a href="apidocs/current/org/apache/crunch/io/From.html">From</a> (Sources), 
<a href="apidocs/current/org/apache/crunch/io/To.html">To</a> (Targets), and
+<a href="apidocs/current/org/apache/crunch/io/At.html">At</a> (SourceTargets) 
classes in the <a 
href="apidocs/current/org/apache/crunch/io/package-summary.html">org.apache.crunch.io</a>
+package.</p>
 <h3 id="pipeline-building-and-execution">Pipeline Building and Execution</h3>
 <h4 id="creating-a-new-crunch-pipeline">Creating A New Crunch Pipeline</h4>
-<p>Section here on Configuration of pipelines.</p>
 <h4 id="managing-pipeline-execution-and-cleanup">Managing Pipeline Execution 
and Cleanup</h4>
 <h2 id="more-information">More Information</h2>
 <p><a href="pipelines.html">Writing Your Own Pipelines</a></p>

svn commit: r887976 - in /websites/staging/crunch/trunk/content: ./ intro.html

Reply via email to