Author: jwills
Date: Fri Dec 14 01:27:33 2012
New Revision: 1421632
URL: http://svn.apache.org/viewvc?rev=1421632&view=rev
Log:
Trademarkification of the website
Modified:
incubator/crunch/site/trunk/content/crunch/download.mdtext
incubator/crunch/site/trunk/content/crunch/future-work.mdtext
incubator/crunch/site/trunk/content/crunch/getting-started.mdtext
incubator/crunch/site/trunk/content/crunch/index.mdtext
incubator/crunch/site/trunk/content/crunch/intro.mdtext
incubator/crunch/site/trunk/content/crunch/mailing-lists.mdtext
incubator/crunch/site/trunk/content/crunch/pipelines.mdtext
incubator/crunch/site/trunk/content/crunch/scrunch.mdtext
incubator/crunch/site/trunk/content/crunch/source-repository.mdtext
Modified: incubator/crunch/site/trunk/content/crunch/download.mdtext
URL:
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/download.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/download.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/download.mdtext Fri Dec 14
01:27:33 2012
@@ -16,7 +16,7 @@ Notice: Licensed to the Apache Softwar
specific language governing permissions and limitations
under the License.
-Apache Crunch is distributed under the [Apache License 2.0][license].
+The Apache Crunch (incubating) libraries are distributed under the [Apache
License 2.0][license].
The link in the Download column takes you to a list of mirrors based on
your location. Checksum and signature are located on Apache's main
Modified: incubator/crunch/site/trunk/content/crunch/future-work.mdtext
URL:
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/future-work.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/future-work.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/future-work.mdtext Fri Dec 14
01:27:33 2012
@@ -16,11 +16,9 @@ Notice: Licensed to the Apache Softwar
specific language governing permissions and limitations
under the License.
-This section contains an almost certainly incomplete list of known limitations
of Crunch and plans for future work.
+This section contains an almost certainly incomplete list of known limitations
and plans for future work.
-* We would like to have easy support for reading and writing data from/to
HCatalog.
-* The decision of how to split up processing tasks between dependent MapReduce
jobs is very naiive right now- we simply
-delegate all of the work to the reduce stage of the predecessor job. We should
take advantage of information about the
-expected size of different PCollections to optimize this processing.
-* The Crunch optimizer does not yet merge different groupByKey operations that
run over the same input data into a single
+* We would like to have easy support for reading and writing data from/to the
Hive metastore via the HCatalog
+APIs.
+* The optimizer does not yet merge different groupByKey operations that run
over the same input data into a single
MapReduce job. Implementing this optimization will provide a major performance
benefit for a number of problems.
Modified: incubator/crunch/site/trunk/content/crunch/getting-started.mdtext
URL:
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/getting-started.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/getting-started.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/getting-started.mdtext Fri Dec
14 01:27:33 2012
@@ -16,13 +16,13 @@ Notice: Licensed to the Apache Softwar
specific language governing permissions and limitations
under the License.
-Crunch is developed against Apache Hadoop version 1.0.3 and is also tested
against
-Apache Hadoop 2.0.0-alpha. Crunch should work with any version of Hadoop
-after 1.0.3 or 2.0.0-alpha, and is also known to work with distributions from
-vendors like Cloudera, Hortonworks, and IBM. Crunch is _not_ compatible with
-versions of Hadoop prior to 1.0.x or 2.0.x, such as Apache Hadoop 0.20.x.
+The Apache Crunch (incubating) library is developed against version 1.0.3 of
the Apache Hadoop library,
+and is also tested against version 2.0.0-alpha. The library should also work
with any version
+after 1.0.3 or 2.0.0-alpha, and is also known to work with distributions from
vendors like Cloudera,
+Hortonworks, and IBM. The library is _not_ compatible with versions of Hadoop
prior to 1.0.x or 2.0.x,
+such as version 0.20.x.
-The easiest way to get started with Crunch is to use its Maven archetype
+The easiest way to get started with the library is to use the Maven archetype
to generate a simple project. The archetype is available from Maven Central;
just enter the following command, answer a few questions, and you're ready to
go:
@@ -30,7 +30,7 @@ go:
<pre>
$ <strong>mvn archetype:generate
-Dfilter=org.apache.crunch:crunch-archetype</strong>
[...]
-1: remote -> org.apache.crunch:crunch-archetype (Create a basic,
self-contained job for Apache Crunch.)
+1: remote -> org.apache.crunch:crunch-archetype (Create a basic,
self-contained job with the core library.)
Choose a number or apply filter (format: [groupId:]artifactId, case sensitive
contains): : <strong>1</strong>
Define value for property 'groupId': : <strong>com.example</strong>
Define value for property 'artifactId': : <strong>crunch-demo</strong>
@@ -72,7 +72,7 @@ $ <strong>tree</strong>
`-- TokenizerTest.java
</pre>
-The `WordCount.java` file contains the main class that defines a Crunch-based
+The `WordCount.java` file contains the main class that defines a pipeline
application which is referenced from `pom.xml`.
Build the code:
@@ -92,9 +92,9 @@ $ <strong>hadoop jar target/hadoop-job-d
</pre>
The `<in>` parameter references a text file or a directory containing text
-files, while `<out>` is a directory where Crunch writes the final results to.
+files, while `<out>` is a directory where the pipeline writes the final
results to.
-Crunch also lets you run applications from within an IDE, either as standalone
+The library also supports running applications from within an IDE, either as
standalone
Java applications or from unit tests. All required dependencies are on Maven's
classpath so you can run the `WordCount` class directly without any additional
setup.
Modified: incubator/crunch/site/trunk/content/crunch/index.mdtext
URL:
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/index.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/index.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/index.mdtext Fri Dec 14 01:27:33
2012
@@ -1,4 +1,4 @@
-Title: Apache Crunch
+Title: Apache Crunch ™
Subtitle: Simple and Efficient MapReduce Pipelines
Notice: Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
@@ -19,22 +19,25 @@ Notice: Licensed to the Apache Softwar
---
-> *Apache Crunch (incubating)* is a Java library for writing, testing, and
-> running MapReduce pipelines, based on Google's FlumeJava. Its goal is to make
+> The *Apache Crunch (incubating)* Java library provides a framework for
writing, testing, and
+> running MapReduce pipelines, and is based on Google's FlumeJava library. Its
goal is to make
> pipelines that are composed of many user-defined functions simple to write,
> easy to test, and efficient to run.
---
-Running on top of [Hadoop MapReduce](http://hadoop.apache.org/mapreduce/),
Apache
-Crunch provides a simple Java API for tasks like joining and data aggregation
-that are tedious to implement on plain MapReduce. For Scala users, there is
also
-Scrunch, an idiomatic Scala API to Crunch.
+Running on top of [Hadoop MapReduce](http://hadoop.apache.org/mapreduce/), the
Apache
+Crunch library is a simple Java API for tasks like joining and data aggregation
+that are tedious to implement on plain MapReduce. The APIs are especially
useful when
+processing data that does not fit naturally into relational model, such as
time series,
+serialized object formats like protocol buffers or Avro records, and HBase
rows and columns.
+For Scala users, there is the Scrunch API, which is built on top of the Java
APIs and
+includes a REPL (read-eval-print loop) for creating MapReduce pipelines.
## Documentation
- * [Introduction to Apache Crunch](intro.html)
- * [Introduction to Scrunch](scrunch.html)
+ * [Introduction to the Apache Crunch API](intro.html)
+ * [Introduction to the Scrunch API](scrunch.html)
* [Current Limitations and Future Work](future-work.html)
## Disclaimer
Modified: incubator/crunch/site/trunk/content/crunch/intro.mdtext
URL:
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/intro.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/intro.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/intro.mdtext Fri Dec 14 01:27:33
2012
@@ -18,12 +18,15 @@ Notice: Licensed to the Apache Softwar
## Build and Installation
-To use Crunch you first have to build the source code using Maven and install
+You can download the most recently released libraries from the
[Download](download.html) page or from the Maven
+Central Repository.
+
+If you prefer, you can also build the libraries from the source code using
Maven and install
it in your local repository:
mvn clean install
-This also runs the integration test suite which will take a while. Afterwards
+This also runs the integration test suite which will take a while to complete.
Afterwards
you can run the bundled example applications such as WordCount:
hadoop jar crunch-examples/target/crunch-examples-*-job.jar
org.apache.crunch.examples.WordCount <inputfile> <outputdir>
@@ -36,9 +39,9 @@ crunch-examples/src/main/resources/acces
### Data Model and Operators
-Crunch is centered around three interfaces that represent distributed
datasets: `PCollection<T>`, `PTable<K, V>`, and `PGroupedTable<K, V>`.
+The Java API is centered around three interfaces that represent distributed
datasets: `PCollection<T>`, `PTable<K, V>`, and `PGroupedTable<K, V>`.
-A `PCollection<T>` represents a distributed, unordered collection of elements
of type T. For example, we represent a text file in Crunch as a
+A `PCollection<T>` represents a distributed, unordered collection of elements
of type T. For example, we represent a text file as a
`PCollection<String>` object. PCollection provides a method, `parallelDo`,
that applies a function to each element in a PCollection in parallel,
and returns a new PCollection as its result.
@@ -57,13 +60,13 @@ joins.
### Pipeline Building and Execution
-Every Crunch pipeline starts with a `Pipeline` object that is used to
coordinate building the pipeline and executing the underlying MapReduce
-jobs. For efficiency, Crunch uses lazy evaluation, so it will only construct
MapReduce jobs from the different stages of the pipelines when
+Every pipeline starts with a `Pipeline` object that is used to coordinate
building the pipeline and executing the underlying MapReduce
+jobs. For efficiency, the library uses lazy evaluation, so it will only
construct MapReduce jobs from the different stages of the pipelines when
the Pipeline object's `run` or `done` methods are called.
## A Detailed Example
-Here is the classic WordCount application using Crunch:
+Here is the classic WordCount application using the APIs:
import org.apache.crunch.DoFn;
import org.apache.crunch.Emitter;
@@ -104,7 +107,7 @@ that is used to tell Hadoop where to fin
We now need to tell the Pipeline about the inputs it will be consuming. The
Pipeline interface
defines a `readTextFile` method that takes in a String and returns a
PCollection of Strings.
-In addition to text files, Crunch supports reading data from SequenceFiles and
Avro container files,
+In addition to text files, the library supports reading data from
SequenceFiles and Avro container files,
via the `SequenceFileSource` and `AvroFileSource` classes defined in the
org.apache.crunch.io package.
Note that each PCollection is a _reference_ to a source of data- no data is
actually loaded into a
@@ -112,7 +115,7 @@ PCollection on the client machine.
### Step 2: Splitting the lines of text into words
-Crunch defines a small set of primitive operations that can be composed in
order to build complex data
+The library defines a small set of primitive operations that can be composed
in order to build complex data
pipelines. The first of these primitives is the `parallelDo` function, which
applies a function (defined
by a subclass of `DoFn`) to every record in a PCollection, and returns a new
PCollection that contains
the results.
@@ -128,8 +131,8 @@ may have any number of output values wri
words, using a blank space as a separator, and emits the words from the split
to the output PCollection.
The last argument to parallelDo is an instance of the `PType` interface, which
specifies how the data
-in the output PCollection is serialized. While Crunch takes advantage of Java
Generics to provide
-compile-time type safety, the generic type information is not available at
runtime. Crunch needs to know
+in the output PCollection is serialized. While the API takes advantage of Java
Generics to provide
+compile-time type safety, the generic type information is not available at
runtime. The job planner needs to know
how to map the records stored in each PCollection into a Hadoop-supported
serialization format in order
to read and write data to disk. Two serialization implementations are
supported in crunch via the
`PTypeFamily` interface: a Writable-based system that is defined in the
org.apache.crunch.types.writable
@@ -139,7 +142,7 @@ as well as utility methods for creating
### Step 3: Counting the words
-Out of Crunch's simple primitive operations, we can build arbitrarily complex
chains of operations in order
+Out of the simple primitive operations, we can build arbitrarily complex
chains of operations in order
to perform higher-level operations, like aggregations and joins, that can work
on any type of input data.
Let's look at the implementation of the `Aggregate.count` function:
@@ -187,15 +190,15 @@ and the number one by extending the `Map
PTable instance, with the key being the PType of the PCollection and the value
being the Long
implementation for this PTypeFamily.
-The next line features the second of Crunch's four operations, `groupByKey`.
The groupByKey
+The next line features the second of the four primary operations,
`groupByKey`. The groupByKey
operation may only be applied to a PTable, and returns an instance of the
`PGroupedTable`
interface, which references the grouping of all of the values in the PTable
that have the same key.
-The groupByKey operation is what triggers the reduce phase of a MapReduce
within Crunch.
+The groupByKey operation is what triggers the reduce phase of a MapReduce.
-The last line in the function returns the output of the third of Crunch's four
operations,
+The last line in the function returns the output of the third of the four
primary operations,
`combineValues`. The combineValues operator takes a `CombineFn` as an
argument, which is a
specialized subclass of DoFn that operates on an implementation of Java's
Iterable interface. The
-use of combineValues (as opposed to parallelDo) signals to Crunch that the
CombineFn may be used to
+use of combineValues (as opposed to parallelDo) signals to the planner that
the CombineFn may be used to
aggregate values for the same key on the map side of a MapReduce job as well
as the reduce side.
### Step 4: Writing the output and running the pipeline
Modified: incubator/crunch/site/trunk/content/crunch/mailing-lists.mdtext
URL:
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/mailing-lists.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/mailing-lists.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/mailing-lists.mdtext Fri Dec 14
01:27:33 2012
@@ -21,7 +21,7 @@ Notice: Licensed to the Apache Softwar
so we use plain HTML tables.
-->
-There are several mailing lists for Apache Crunch. To subscribe or unsubscribe
+There are several mailing lists for the Apache Crunch project. To subscribe or
unsubscribe
to a list send mail to the respective administrative address given below. You
will then receive a confirmation mail with further instructions.
Modified: incubator/crunch/site/trunk/content/crunch/pipelines.mdtext
URL:
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/pipelines.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/pipelines.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/pipelines.mdtext Fri Dec 14
01:27:33 2012
@@ -16,7 +16,7 @@ Notice: Licensed to the Apache Softwar
specific language governing permissions and limitations
under the License.
-This section discusses the different steps of creating your own Crunch
pipelines in more detail.
+This section discusses the different steps of creating your own pipelines in
more detail.
## Writing a DoFn
@@ -25,7 +25,7 @@ don't need them while still keeping them
### Serialization
-First, all DoFn instances are required to be `java.io.Serializable`. This is a
key aspect of Crunch's design:
+First, all DoFn instances are required to be `java.io.Serializable`. This is a
key aspect of the library's design:
once a particular DoFn is assigned to the Map or Reduce stage of a MapReduce
job, all of the state
of that DoFn is serialized so that it may be distributed to all of the nodes
in the Hadoop cluster that
will be running that task. There are two important implications of this for
developers:
@@ -53,14 +53,14 @@ are associated with a MapReduce stage, s
### Performing Cogroups and Joins
-In Crunch, cogroups and joins are performed on PTable instances that have the
same key type. This section walks through
-the basic flow of a cogroup operation, explaining how this higher-level
operation is composed of Crunch's four primitives.
-In general, these common operations are provided as part of the core Crunch
library or in extensions, you do not need
+Cogroups and joins are performed on PTable instances that have the same key
type. This section walks through
+the basic flow of a cogroup operation, explaining how this higher-level
operation is composed of the four primitive operations.
+In general, these common operations are provided as part of the core library
or in extensions, you do not need
to write them yourself. But it can be useful to understand how they work under
the covers.
Assume we have a `PTable<K, U>` named "a" and a different `PTable<K, V>` named
"b" that we would like to combine into a
single `PTable<K, Pair<Collection<U>, Collection<V>>>`. First, we need to
apply parallelDo operations to a and b that
-convert them into the same Crunch type, `PTable<K, Pair<U, V>>`:
+convert them into the same PType, `PTable<K, Pair<U, V>>`:
// Perform the "tagging" operation as a parallelDo on PTable a
PTable<K, Pair<U, V>> aPrime = a.parallelDo("taga", new MapFn<Pair<K, U>,
Pair<K, Pair<U, V>>>() {
Modified: incubator/crunch/site/trunk/content/crunch/scrunch.mdtext
URL:
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/scrunch.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/scrunch.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/scrunch.mdtext Fri Dec 14
01:27:33 2012
@@ -1,5 +1,5 @@
Title: Scrunch
-Subtitle: A Scala Wrapper for Apache Crunch
+Subtitle: A Scala Wrapper for the Apache Crunch (incubating) Java API
Notice: Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
@@ -19,16 +19,16 @@ Notice: Licensed to the Apache Softwar
## Introduction
-Scrunch is an experimental Scala wrapper for Crunch, based on the same ideas
as the
-[Cascade](http://days2011.scala-lang.org/node/138/282) project at Google,
which created
-a Scala wrapper for FlumeJava.
+Scrunch is an experimental Scala wrapper for the Apache Crunch (incubating)
Java API, based on the same ideas as the
+[Cascade](http://days2011.scala-lang.org/node/138/282) project at Google,
which created a Scala wrapper for
+FlumeJava.
## Why Scala?
-In many ways, Scala is the perfect language for writing Crunch pipelines.
Scala supports
+In many ways, Scala is the perfect language for writing MapReduce pipelines.
Scala supports
a mixture of functional and object-oriented programming styles and has
powerful type-inference
capabilities, allowing us to create complex pipelines using very few
keystrokes. Here is
-the Scrunch analogue of the classic WordCount problem:
+an implementation of the classic WordCount problem using the Scrunch API:
import org.apache.crunch.io.{From => from}
import org.apache.crunch.scrunch._
@@ -46,7 +46,7 @@ the Scrunch analogue of the classic Word
}
The Scala compiler can infer the return type of the flatMap function as an
Array[String], and
-the Scrunch wrapper uses the type inference mechanism to figure out how to
serialize the
+the Scrunch wrapper code uses the type inference mechanism to figure out how
to serialize the
data between the Map and Reduce stages. Here's a slightly more complex
example, in which we
get the word counts for two different files and compute the deltas of how
often different
words occur, and then only returns the words where the first file had more
occurrences then
@@ -60,14 +60,10 @@ the second:
}
}
-Note that all of the functions are using Scala Tuples, not Crunch Tuples.
Under the covers,
-Scrunch uses Scala's implicit type conversion mechanism to transparently
convert data from the
-Crunch format to the Scala format and back again.
-
## Materializing Job Outputs
-Scrunch also incorporates Crunch's materialize functionality, which allows us
to easily read
-the output of a Crunch pipeline into the client:
+The Scrunch API also incorporates the Java library's `materialize`
functionality, which allows us to easily read
+the output of a MapReduce pipeline into the client:
class WordCountExample {
def hasHamlet = wordGt("shakespeare.txt",
"maugham.txt").materialize.exists(_ == "hamlet")
@@ -75,13 +71,8 @@ the output of a Crunch pipeline into the
## Notes and Thanks
-Scrunch is alpha-quality code, written by someone who was learning Scala on
the fly. There will be bugs,
-rough edges, and non-idiomatic Scala usage all over the place. This will
improve with time, and we welcome
-contributions from Scala experts who are interested in helping us make Scrunch
into a first-class project.
-
Scrunch emerged out of conversations with [Dmitriy
Ryaboy](http://twitter.com/#!/squarecog),
[Oscar Boykin](http://twitter.com/#!/posco), and [Avi
Bryant](http://twitter.com/#!/avibryant) from Twitter.
Many thanks to them for their feedback, guidance, and encouragement. We are
also grateful to
[Matei Zaharia](http://twitter.com/#!/matei_zaharia), whose [Spark
Project](http://www.spark-project.org/)
-inspired much of our implementation and was kind enough to loan us the
ClosureCleaner implementation
-Spark developed for use in Scrunch.
+inspired much of the original Scrunch API implementation.
Modified: incubator/crunch/site/trunk/content/crunch/source-repository.mdtext
URL:
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/source-repository.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/source-repository.mdtext
(original)
+++ incubator/crunch/site/trunk/content/crunch/source-repository.mdtext Fri Dec
14 01:27:33 2012
@@ -16,7 +16,7 @@ Notice: Licensed to the Apache Softwar
specific language governing permissions and limitations
under the License.
-Apache Crunch uses [Git](http://git-scm.com/) for version control. Run the
+The Apache Crunch (incubating) Project uses [Git](http://git-scm.com/) for
version control. Run the
following command to clone the repository:
git clone https://git-wip-us.apache.org/repos/asf/incubator-crunch.git