intro.html

buildbot Mon, 25 Nov 2013 17:29:38 -0800

Author: buildbot
Date: Tue Nov 26 01:28:28 2013
New Revision: 888090

Log:
Staging update by buildbot for crunch


Modified:
    websites/staging/crunch/trunk/content/   (props changed)
    websites/staging/crunch/trunk/content/intro.html

Propchange: websites/staging/crunch/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Tue Nov 26 01:28:28 2013
@@ -1 +1 @@
-1545156
+1545498

Modified: websites/staging/crunch/trunk/content/intro.html
==============================================================================
--- websites/staging/crunch/trunk/content/intro.html (original)
+++ websites/staging/crunch/trunk/content/intro.html Tue Nov 26 01:28:28 2013
@@ -247,8 +247,8 @@ we cannot know at runtime what type of d
 us with an object that contains this information: in our example word count 
application, the object that tells us that we are working with strings is
 returned by the <code>Writables.strings()</code> static method that is the 
third argument to the <code>parallelDo</code> function in 
<code>countWords</code>. Every <code>DoFn</code> instance must
 return a type that has an associated object, called a 
<code>PType&lt;T&gt;</code>, that contains instructions for how to serialize 
the data returned by that <code>DoFn</code>. By default, Crunch
-supports two serialization frameworks, called <em>type families</em>: one 
based on Hadoop's <code>Writable</code> interface, and another based on 
<code>Apache Avro</code>.
-You can read more about how to work with Crunch's serialization libraries 
here. TODO</p>
+supports two serialization frameworks, called <em>type families</em>: one 
based on Hadoop's <code>Writable</code> interface, and another based on 
<code>Apache Avro</code>. Details
+on the type families are contained in the section on "Serializing Data with 
PTypes" in this document.</p>
 <p>Because all of the core logic in our application is exposed via a single 
static method that operates on Crunch interfaces, we can use Crunch's
 in-memory API to test our business logic using a unit testing framework like 
JUnit. Let's look at an example unit test for the word count
 application:</p>
@@ -390,7 +390,30 @@ contained in this class satisfies the co
 interface, which is defined right alongside the CombineFn class in the 
top-level <code>org.apache.crunch</code> package. There are a number of 
implementations of the Aggregator
 interface defined via static factory methods in the <a 
href="apidocs/0.8.0/org/apache/crunch/fn/Aggregators.html">Aggregators</a> 
class.</p>
 <h3 id="serializing-data-with-ptypes">Serializing Data with PTypes</h3>
-<p>Why PTypes Are Necessary, the two type families, the core methods and 
tuples.</p>
+<p>Every <code>PCollection&lt;T&gt;</code> has an associated 
<code>PType&lt;T&gt;</code> that encapsulates the information on how to 
serialize and deserialize the contents of that
+PCollection. PTypes are necessary because of <a 
href="http://docs.oracle.com/javase/tutorial/java/generics/erasure.html";>type 
erasure</a>; at runtime, when
+the Crunch planner is mapping from PCollections to a series of MapReduce jobs, 
the type of a PCollection (that is, the <code>T</code> in 
<code>PCollection&lt;T&gt;</code>)
+is no longer available to us, and must be provided by the associated PType 
instance.</p>
+<p>Crunch supports two independent <em>type families</em>, which each 
implement the <a 
href="apidocs/0.8.0/org/apache/crunch/types/PTypeFamily.html">PTypeFamily</a> 
interface:
+one for Hadoop's <a 
href="apidocs/0.8.0/org/apache/crunch/types/writable/WritableTypeFamily.html">Writable
 interface</a> and another based on
+<a 
href="apidocs/0.8.0/org/apache/crunch/types/avro/AvroTypeFamily.html">Apache 
Avro</a>. There are also classes that contain static factory methods for
+each PTypeFamily to allow for easy import and usage: one for <a 
href="apidocs/0.8.0/org/apache/crunch/types/writable/Writables.html">Writables</a>
 and one for
+<a href="apidocs/0.8.0/org/apache/crunch/types/avro/Avros.html">Avros</a>.</p>
+<p>The two different type families exist for historical reasons: Writables 
have long been the standard form for representing serializable data in Hadoop,
+but the Avro based serialization scheme is very compact, fast, and allows for 
complex record schemas to evolve over time. It's fine (and even encouraged)
+to mix-and-match PCollections that use different PTypes in the same Crunch 
pipeline (e.g., you could
+read in Writable data, do a shuffle using Avro, and then write the output data 
as Writables), but each PCollection's PType must belong to a single
+type family; for example, you cannot have a PTable whose key is serialized as 
a Writable and whose value is serialized as an Avro record.</p>
+<h4 id="core-ptypes">Core PTypes</h4>
+<p>Both type families support a common set of primitive types (strings, longs, 
ints, floats, doubles, booleans, and bytes) as well as more complex
+PTypes that can be constructed out of other PTypes:</p>
+<ol>
+<li>Tuples of other PTypes (<code>pairs</code>, <code>trips</code>, 
<code>quads</code>, and <code>tuples</code> for arbitrary N),</li>
+<li>Collections of other PTypes (<code>collections</code> to create a 
<code>Collection&lt;T&gt;</code> and <code>maps</code> to return a 
<code>Map&lt;String, T&gt;</code>),</li>
+<li>and <code>tableOf</code> to construct a <code>PTableType&lt;K, 
V&gt;</code>, the PType used to distinguish a <code>PTable&lt;K, V&gt;</code> 
from a <code>PCollection&lt;Pair&lt;K, V&gt;&gt;</code>.</li>
+</ol>
+<p>Both of the type families have additional methods for working with records 
that are specific to each serialization format (for example, the
+AvroTypeFamily contains methods to support Generic and Specific records as 
well as Avro's reflection-based serialization.)</p>
 <h4 id="extending-ptypes">Extending PTypes</h4>
 <p>The simplest way to create a new <code>PType&lt;T&gt;</code> for a data 
object is to create a <em>derived</em> PType from one of the built-in PTypes 
for the Avro
 and Writable type families. If we have a base <code>PType&lt;S&gt;</code>, we 
can create a derived <code>PType&lt;T&gt;</code> by implementing an input 
<code>MapFn&lt;S, T&gt;</code> and an

svn commit: r888090 - in /websites/staging/crunch/trunk/content: ./ intro.html

Reply via email to