Author: buildbot
Date: Tue Nov 26 01:28:28 2013
New Revision: 888090
Log:
Staging update by buildbot for crunch
Modified:
websites/staging/crunch/trunk/content/ (props changed)
websites/staging/crunch/trunk/content/intro.html
Propchange: websites/staging/crunch/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Tue Nov 26 01:28:28 2013
@@ -1 +1 @@
-1545156
+1545498
Modified: websites/staging/crunch/trunk/content/intro.html
==============================================================================
--- websites/staging/crunch/trunk/content/intro.html (original)
+++ websites/staging/crunch/trunk/content/intro.html Tue Nov 26 01:28:28 2013
@@ -247,8 +247,8 @@ we cannot know at runtime what type of d
us with an object that contains this information: in our example word count
application, the object that tells us that we are working with strings is
returned by the <code>Writables.strings()</code> static method that is the
third argument to the <code>parallelDo</code> function in
<code>countWords</code>. Every <code>DoFn</code> instance must
return a type that has an associated object, called a
<code>PType<T></code>, that contains instructions for how to serialize
the data returned by that <code>DoFn</code>. By default, Crunch
-supports two serialization frameworks, called <em>type families</em>: one
based on Hadoop's <code>Writable</code> interface, and another based on
<code>Apache Avro</code>.
-You can read more about how to work with Crunch's serialization libraries
here. TODO</p>
+supports two serialization frameworks, called <em>type families</em>: one
based on Hadoop's <code>Writable</code> interface, and another based on
<code>Apache Avro</code>. Details
+on the type families are contained in the section on "Serializing Data with
PTypes" in this document.</p>
<p>Because all of the core logic in our application is exposed via a single
static method that operates on Crunch interfaces, we can use Crunch's
in-memory API to test our business logic using a unit testing framework like
JUnit. Let's look at an example unit test for the word count
application:</p>
@@ -390,7 +390,30 @@ contained in this class satisfies the co
interface, which is defined right alongside the CombineFn class in the
top-level <code>org.apache.crunch</code> package. There are a number of
implementations of the Aggregator
interface defined via static factory methods in the <a
href="apidocs/0.8.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a>
class.</p>
<h3 id="serializing-data-with-ptypes">Serializing Data with PTypes</h3>
-<p>Why PTypes Are Necessary, the two type families, the core methods and
tuples.</p>
+<p>Every <code>PCollection<T></code> has an associated
<code>PType<T></code> that encapsulates the information on how to
serialize and deserialize the contents of that
+PCollection. PTypes are necessary because of <a
href="http://docs.oracle.com/javase/tutorial/java/generics/erasure.html">type
erasure</a>; at runtime, when
+the Crunch planner is mapping from PCollections to a series of MapReduce jobs,
the type of a PCollection (that is, the <code>T</code> in
<code>PCollection<T></code>)
+is no longer available to us, and must be provided by the associated PType
instance.</p>
+<p>Crunch supports two independent <em>type families</em>, which each
implement the <a
href="apidocs/0.8.0/org/apache/crunch/types/PTypeFamily.html">PTypeFamily</a>
interface:
+one for Hadoop's <a
href="apidocs/0.8.0/org/apache/crunch/types/writable/WritableTypeFamily.html">Writable
interface</a> and another based on
+<a
href="apidocs/0.8.0/org/apache/crunch/types/avro/AvroTypeFamily.html">Apache
Avro</a>. There are also classes that contain static factory methods for
+each PTypeFamily to allow for easy import and usage: one for <a
href="apidocs/0.8.0/org/apache/crunch/types/writable/Writables.html">Writables</a>
and one for
+<a href="apidocs/0.8.0/org/apache/crunch/types/avro/Avros.html">Avros</a>.</p>
+<p>The two different type families exist for historical reasons: Writables
have long been the standard form for representing serializable data in Hadoop,
+but the Avro based serialization scheme is very compact, fast, and allows for
complex record schemas to evolve over time. It's fine (and even encouraged)
+to mix-and-match PCollections that use different PTypes in the same Crunch
pipeline (e.g., you could
+read in Writable data, do a shuffle using Avro, and then write the output data
as Writables), but each PCollection's PType must belong to a single
+type family; for example, you cannot have a PTable whose key is serialized as
a Writable and whose value is serialized as an Avro record.</p>
+<h4 id="core-ptypes">Core PTypes</h4>
+<p>Both type families support a common set of primitive types (strings, longs,
ints, floats, doubles, booleans, and bytes) as well as more complex
+PTypes that can be constructed out of other PTypes:</p>
+<ol>
+<li>Tuples of other PTypes (<code>pairs</code>, <code>trips</code>,
<code>quads</code>, and <code>tuples</code> for arbitrary N),</li>
+<li>Collections of other PTypes (<code>collections</code> to create a
<code>Collection<T></code> and <code>maps</code> to return a
<code>Map<String, T></code>),</li>
+<li>and <code>tableOf</code> to construct a <code>PTableType<K,
V></code>, the PType used to distinguish a <code>PTable<K, V></code>
from a <code>PCollection<Pair<K, V>></code>.</li>
+</ol>
+<p>Both of the type families have additional methods for working with records
that are specific to each serialization format (for example, the
+AvroTypeFamily contains methods to support Generic and Specific records as
well as Avro's reflection-based serialization.)</p>
<h4 id="extending-ptypes">Extending PTypes</h4>
<p>The simplest way to create a new <code>PType<T></code> for a data
object is to create a <em>derived</em> PType from one of the built-in PTypes
for the Avro
and Writable type families. If we have a base <code>PType<S></code>, we
can create a derived <code>PType<T></code> by implementing an input
<code>MapFn<S, T></code> and an