intro.mdtext

jwills Mon, 25 Nov 2013 17:29:33 -0800

Author: jwills
Date: Tue Nov 26 01:28:17 2013
New Revision: 1545498

URL: http://svn.apache.org/r1545498
Log:
Add details on PType serialization


Modified:
    crunch/site/trunk/content/intro.mdtext

Modified: crunch/site/trunk/content/intro.mdtext
URL: 
http://svn.apache.org/viewvc/crunch/site/trunk/content/intro.mdtext?rev=1545498&r1=1545497&r2=1545498&view=diff
==============================================================================
--- crunch/site/trunk/content/intro.mdtext (original)
+++ crunch/site/trunk/content/intro.mdtext Tue Nov 26 01:28:17 2013
@@ -135,8 +135,8 @@ we cannot know at runtime what type of d
 us with an object that contains this information: in our example word count 
application, the object that tells us that we are working with strings is
 returned by the `Writables.strings()` static method that is the third argument 
to the `parallelDo` function in `countWords`. Every `DoFn` instance must
 return a type that has an associated object, called a `PType<T>`, that 
contains instructions for how to serialize the data returned by that `DoFn`. By 
default, Crunch
-supports two serialization frameworks, called _type families_: one based on 
Hadoop's `Writable` interface, and another based on `Apache Avro`.
-You can read more about how to work with Crunch's serialization libraries 
here. TODO
+supports two serialization frameworks, called _type families_: one based on 
Hadoop's `Writable` interface, and another based on `Apache Avro`. Details
+on the type families are contained in the section on "Serializing Data with 
PTypes" in this document.
 
 Because all of the core logic in our application is exposed via a single 
static method that operates on Crunch interfaces, we can use Crunch's
 in-memory API to test our business logic using a unit testing framework like 
JUnit. Let's look at an example unit test for the word count
@@ -307,7 +307,34 @@ interface defined via static factory met
 
 ### Serializing Data with PTypes
 
-Why PTypes Are Necessary, the two type families, the core methods and tuples.
+Every `PCollection<T>` has an associated `PType<T>` that encapsulates the 
information on how to serialize and deserialize the contents of that
+PCollection. PTypes are necessary because of [type 
erasure](http://docs.oracle.com/javase/tutorial/java/generics/erasure.html); at 
runtime, when
+the Crunch planner is mapping from PCollections to a series of MapReduce jobs, 
the type of a PCollection (that is, the `T` in `PCollection<T>`)
+is no longer available to us, and must be provided by the associated PType 
instance.
+
+Crunch supports two independent _type families_, which each implement the 
[PTypeFamily](apidocs/0.8.0/org/apache/crunch/types/PTypeFamily.html) interface:
+one for Hadoop's [Writable 
interface](apidocs/0.8.0/org/apache/crunch/types/writable/WritableTypeFamily.html)
 and another based on
+[Apache Avro](apidocs/0.8.0/org/apache/crunch/types/avro/AvroTypeFamily.html). 
There are also classes that contain static factory methods for
+each PTypeFamily to allow for easy import and usage: one for 
[Writables](apidocs/0.8.0/org/apache/crunch/types/writable/Writables.html) and 
one for
+[Avros](apidocs/0.8.0/org/apache/crunch/types/avro/Avros.html).
+
+The two different type families exist for historical reasons: Writables have 
long been the standard form for representing serializable data in Hadoop,
+but the Avro based serialization scheme is very compact, fast, and allows for 
complex record schemas to evolve over time. It's fine (and even encouraged)
+to mix-and-match PCollections that use different PTypes in the same Crunch 
pipeline (e.g., you could
+read in Writable data, do a shuffle using Avro, and then write the output data 
as Writables), but each PCollection's PType must belong to a single
+type family; for example, you cannot have a PTable whose key is serialized as 
a Writable and whose value is serialized as an Avro record.
+
+#### Core PTypes
+
+Both type families support a common set of primitive types (strings, longs, 
ints, floats, doubles, booleans, and bytes) as well as more complex
+PTypes that can be constructed out of other PTypes:
+
+1. Tuples of other PTypes (`pairs`, `trips`, `quads`, and `tuples` for 
arbitrary N),
+2. Collections of other PTypes (`collections` to create a `Collection<T>` and 
`maps` to return a `Map<String, T>`),
+3. and `tableOf` to construct a `PTableType<K, V>`, the PType used to 
distinguish a `PTable<K, V>` from a `PCollection<Pair<K, V>>`.
+
+Both of the type families have additional methods for working with records 
that are specific to each serialization format (for example, the
+AvroTypeFamily contains methods to support Generic and Specific records as 
well as Avro's reflection-based serialization.)
 
 #### Extending PTypes

svn commit: r1545498 - /crunch/site/trunk/content/intro.mdtext

Reply via email to