crunch: download.mdtext future-work.mdtext getting-started.mdtext index.mdtext intro.mdtext mailing-lists.mdtext pipelines.mdtext scrunch.mdtext source-repository.mdtext

jwills Thu, 13 Dec 2012 17:28:16 -0800

Author: jwills
Date: Fri Dec 14 01:27:33 2012
New Revision: 1421632

URL: http://svn.apache.org/viewvc?rev=1421632&view=rev
Log:
Trademarkification of the website


Modified:
    incubator/crunch/site/trunk/content/crunch/download.mdtext
    incubator/crunch/site/trunk/content/crunch/future-work.mdtext
    incubator/crunch/site/trunk/content/crunch/getting-started.mdtext
    incubator/crunch/site/trunk/content/crunch/index.mdtext
    incubator/crunch/site/trunk/content/crunch/intro.mdtext
    incubator/crunch/site/trunk/content/crunch/mailing-lists.mdtext
    incubator/crunch/site/trunk/content/crunch/pipelines.mdtext
    incubator/crunch/site/trunk/content/crunch/scrunch.mdtext
    incubator/crunch/site/trunk/content/crunch/source-repository.mdtext

Modified: incubator/crunch/site/trunk/content/crunch/download.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/download.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/download.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/download.mdtext Fri Dec 14 
01:27:33 2012
@@ -16,7 +16,7 @@ Notice:   Licensed to the Apache Softwar
           specific language governing permissions and limitations
           under the License.
 
-Apache Crunch is distributed under the [Apache License 2.0][license].
+The Apache Crunch (incubating) libraries are distributed under the [Apache 
License 2.0][license].
 
 The link in the Download column takes you to a list of mirrors based on
 your location. Checksum and signature are located on Apache's main

Modified: incubator/crunch/site/trunk/content/crunch/future-work.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/future-work.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/future-work.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/future-work.mdtext Fri Dec 14 
01:27:33 2012
@@ -16,11 +16,9 @@ Notice:   Licensed to the Apache Softwar
           specific language governing permissions and limitations
           under the License.
 
-This section contains an almost certainly incomplete list of known limitations 
of Crunch and plans for future work.
+This section contains an almost certainly incomplete list of known limitations 
and plans for future work.
 
-* We would like to have easy support for reading and writing data from/to 
HCatalog.
-* The decision of how to split up processing tasks between dependent MapReduce 
jobs is very naiive right now- we simply
-delegate all of the work to the reduce stage of the predecessor job. We should 
take advantage of information about the
-expected size of different PCollections to optimize this processing.
-* The Crunch optimizer does not yet merge different groupByKey operations that 
run over the same input data into a single
+* We would like to have easy support for reading and writing data from/to the 
Hive metastore via the HCatalog
+APIs.
+* The optimizer does not yet merge different groupByKey operations that run 
over the same input data into a single
 MapReduce job. Implementing this optimization will provide a major performance 
benefit for a number of problems.

Modified: incubator/crunch/site/trunk/content/crunch/getting-started.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/getting-started.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/getting-started.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/getting-started.mdtext Fri Dec 
14 01:27:33 2012
@@ -16,13 +16,13 @@ Notice:   Licensed to the Apache Softwar
           specific language governing permissions and limitations
           under the License.
 
-Crunch is developed against Apache Hadoop version 1.0.3 and is also tested 
against
-Apache Hadoop 2.0.0-alpha. Crunch should work with any version of Hadoop
-after 1.0.3 or 2.0.0-alpha, and is also known to work with distributions from
-vendors like Cloudera, Hortonworks, and IBM. Crunch is _not_ compatible with
-versions of Hadoop prior to 1.0.x or 2.0.x, such as Apache Hadoop 0.20.x.
+The Apache Crunch (incubating) library is developed against version 1.0.3 of 
the Apache Hadoop library,
+and is also tested against version 2.0.0-alpha. The library should also work 
with any version
+after 1.0.3 or 2.0.0-alpha, and is also known to work with distributions from 
vendors like Cloudera,
+Hortonworks, and IBM. The library is _not_ compatible with versions of Hadoop 
prior to 1.0.x or 2.0.x,
+such as version 0.20.x.
 
-The easiest way to get started with Crunch is to use its Maven archetype
+The easiest way to get started with the library is to use the Maven archetype
 to generate a simple project. The archetype is available from Maven Central;
 just enter the following command, answer a few questions, and you're ready to
 go:
@@ -30,7 +30,7 @@ go:
 <pre>
 $ <strong>mvn archetype:generate 
-Dfilter=org.apache.crunch:crunch-archetype</strong>
 [...]
-1: remote -> org.apache.crunch:crunch-archetype (Create a basic, 
self-contained job for Apache Crunch.)
+1: remote -> org.apache.crunch:crunch-archetype (Create a basic, 
self-contained job with the core library.)
 Choose a number or apply filter (format: [groupId:]artifactId, case sensitive 
contains): : <strong>1</strong>
 Define value for property 'groupId': : <strong>com.example</strong>
 Define value for property 'artifactId': : <strong>crunch-demo</strong>
@@ -72,7 +72,7 @@ $ <strong>tree</strong>
                     `-- TokenizerTest.java
 </pre>
  
-The `WordCount.java` file contains the main class that defines a Crunch-based
+The `WordCount.java` file contains the main class that defines a pipeline
 application which is referenced from `pom.xml`.
 
 Build the code:
@@ -92,9 +92,9 @@ $ <strong>hadoop jar target/hadoop-job-d
 </pre>
 
 The `<in>` parameter references a text file or a directory containing text
-files, while `<out>` is a directory where Crunch writes the final results to.
+files, while `<out>` is a directory where the pipeline writes the final 
results to.
 
-Crunch also lets you run applications from within an IDE, either as standalone
+The library also supports running applications from within an IDE, either as 
standalone
 Java applications or from unit tests. All required dependencies are on Maven's
 classpath so you can run the `WordCount` class directly without any additional
 setup.

Modified: incubator/crunch/site/trunk/content/crunch/index.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/index.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/index.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/index.mdtext Fri Dec 14 01:27:33 
2012
@@ -1,4 +1,4 @@
-Title:    Apache Crunch
+Title:    Apache Crunch &trade;
 Subtitle: Simple and Efficient MapReduce Pipelines
 Notice:   Licensed to the Apache Software Foundation (ASF) under one
           or more contributor license agreements.  See the NOTICE file
@@ -19,22 +19,25 @@ Notice:   Licensed to the Apache Softwar
 
 ---
 
-> *Apache Crunch (incubating)* is a Java library for writing, testing, and
-> running MapReduce pipelines, based on Google's FlumeJava. Its goal is to make
+> The *Apache Crunch (incubating)* Java library provides a framework for 
writing, testing, and
+> running MapReduce pipelines, and is based on Google's FlumeJava library. Its 
goal is to make
 > pipelines that are composed of many user-defined functions simple to write,
 > easy to test, and efficient to run.
 
 ---
 
-Running on top of [Hadoop MapReduce](http://hadoop.apache.org/mapreduce/), 
Apache
-Crunch provides a simple Java API for tasks like joining and data aggregation
-that are tedious to implement on plain MapReduce. For Scala users, there is 
also
-Scrunch, an idiomatic Scala API to Crunch.
+Running on top of [Hadoop MapReduce](http://hadoop.apache.org/mapreduce/), the 
Apache
+Crunch library is a simple Java API for tasks like joining and data aggregation
+that are tedious to implement on plain MapReduce. The APIs are especially 
useful when
+processing data that does not fit naturally into relational model, such as 
time series,
+serialized object formats like protocol buffers or Avro records, and HBase 
rows and columns.
+For Scala users, there is the Scrunch API, which is built on top of the Java 
APIs and
+includes a REPL (read-eval-print loop) for creating MapReduce pipelines.
 
 ## Documentation
 
-  * [Introduction to Apache Crunch](intro.html)
-  * [Introduction to Scrunch](scrunch.html)
+  * [Introduction to the Apache Crunch API](intro.html)
+  * [Introduction to the Scrunch API](scrunch.html)
   * [Current Limitations and Future Work](future-work.html)
 
 ## Disclaimer

Modified: incubator/crunch/site/trunk/content/crunch/intro.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/intro.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/intro.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/intro.mdtext Fri Dec 14 01:27:33 
2012
@@ -18,12 +18,15 @@ Notice:   Licensed to the Apache Softwar
 
 ## Build and Installation
 
-To use Crunch you first have to build the source code using Maven and install
+You can download the most recently released libraries from the 
[Download](download.html) page or from the Maven
+Central Repository.
+
+If you prefer, you can also build the libraries from the source code using 
Maven and install
 it in your local repository:
 
     mvn clean install
 
-This also runs the integration test suite which will take a while. Afterwards
+This also runs the integration test suite which will take a while to complete. 
Afterwards
 you can run the bundled example applications such as WordCount:
 
     hadoop jar crunch-examples/target/crunch-examples-*-job.jar 
org.apache.crunch.examples.WordCount <inputfile> <outputdir>
@@ -36,9 +39,9 @@ crunch-examples/src/main/resources/acces
 
 ### Data Model and Operators
 
-Crunch is centered around three interfaces that represent distributed 
datasets: `PCollection<T>`, `PTable<K, V>`, and `PGroupedTable<K, V>`.
+The Java API is centered around three interfaces that represent distributed 
datasets: `PCollection<T>`, `PTable<K, V>`, and `PGroupedTable<K, V>`.
 
-A `PCollection<T>` represents a distributed, unordered collection of elements 
of type T. For example, we represent a text file in Crunch as a
+A `PCollection<T>` represents a distributed, unordered collection of elements 
of type T. For example, we represent a text file as a
 `PCollection<String>` object. PCollection provides a method, `parallelDo`, 
that applies a function to each element in a PCollection in parallel,
 and returns a new PCollection as its result.
 
@@ -57,13 +60,13 @@ joins.
 
 ### Pipeline Building and Execution
 
-Every Crunch pipeline starts with a `Pipeline` object that is used to 
coordinate building the pipeline and executing the underlying MapReduce
-jobs. For efficiency, Crunch uses lazy evaluation, so it will only construct 
MapReduce jobs from the different stages of the pipelines when
+Every pipeline starts with a `Pipeline` object that is used to coordinate 
building the pipeline and executing the underlying MapReduce
+jobs. For efficiency, the library uses lazy evaluation, so it will only 
construct MapReduce jobs from the different stages of the pipelines when
 the Pipeline object's `run` or `done` methods are called.
 
 ## A Detailed Example
 
-Here is the classic WordCount application using Crunch:
+Here is the classic WordCount application using the APIs:
 
     import org.apache.crunch.DoFn;
     import org.apache.crunch.Emitter;
@@ -104,7 +107,7 @@ that is used to tell Hadoop where to fin
 
 We now need to tell the Pipeline about the inputs it will be consuming. The 
Pipeline interface
 defines a `readTextFile` method that takes in a String and returns a 
PCollection of Strings.
-In addition to text files, Crunch supports reading data from SequenceFiles and 
Avro container files,
+In addition to text files, the library supports reading data from 
SequenceFiles and Avro container files,
 via the `SequenceFileSource` and `AvroFileSource` classes defined in the 
org.apache.crunch.io package.
 
 Note that each PCollection is a _reference_ to a source of data- no data is 
actually loaded into a
@@ -112,7 +115,7 @@ PCollection on the client machine.
 
 ### Step 2: Splitting the lines of text into words
 
-Crunch defines a small set of primitive operations that can be composed in 
order to build complex data
+The library defines a small set of primitive operations that can be composed 
in order to build complex data
 pipelines. The first of these primitives is the `parallelDo` function, which 
applies a function (defined
 by a subclass of `DoFn`) to every record in a PCollection, and returns a new 
PCollection that contains
 the results.
@@ -128,8 +131,8 @@ may have any number of output values wri
 words, using a blank space as a separator, and emits the words from the split 
to the output PCollection.
 
 The last argument to parallelDo is an instance of the `PType` interface, which 
specifies how the data
-in the output PCollection is serialized. While Crunch takes advantage of Java 
Generics to provide
-compile-time type safety, the generic type information is not available at 
runtime. Crunch needs to know
+in the output PCollection is serialized. While the API takes advantage of Java 
Generics to provide
+compile-time type safety, the generic type information is not available at 
runtime. The job planner needs to know
 how to map the records stored in each PCollection into a Hadoop-supported 
serialization format in order
 to read and write data to disk. Two serialization implementations are 
supported in crunch via the
 `PTypeFamily` interface: a Writable-based system that is defined in the 
org.apache.crunch.types.writable
@@ -139,7 +142,7 @@ as well as utility methods for creating 
 
 ### Step 3: Counting the words
 
-Out of Crunch's simple primitive operations, we can build arbitrarily complex 
chains of operations in order
+Out of the simple primitive operations, we can build arbitrarily complex 
chains of operations in order
 to perform higher-level operations, like aggregations and joins, that can work 
on any type of input data.
 Let's look at the implementation of the `Aggregate.count` function:
 
@@ -187,15 +190,15 @@ and the number one by extending the `Map
 PTable instance, with the key being the PType of the PCollection and the value 
being the Long
 implementation for this PTypeFamily.
 
-The next line features the second of Crunch's four operations, `groupByKey`. 
The groupByKey
+The next line features the second of the four primary operations, 
`groupByKey`. The groupByKey
 operation may only be applied to a PTable, and returns an instance of the 
`PGroupedTable`
 interface, which references the grouping of all of the values in the PTable 
that have the same key.
-The groupByKey operation is what triggers the reduce phase of a MapReduce 
within Crunch.
+The groupByKey operation is what triggers the reduce phase of a MapReduce.
 
-The last line in the function returns the output of the third of Crunch's four 
operations,
+The last line in the function returns the output of the third of the four 
primary operations,
 `combineValues`. The combineValues operator takes a `CombineFn` as an 
argument, which is a
 specialized subclass of DoFn that operates on an implementation of Java's 
Iterable interface. The
-use of combineValues (as opposed to parallelDo) signals to Crunch that the 
CombineFn may be used to
+use of combineValues (as opposed to parallelDo) signals to the planner that 
the CombineFn may be used to
 aggregate values for the same key on the map side of a MapReduce job as well 
as the reduce side.
 
 ### Step 4: Writing the output and running the pipeline

Modified: incubator/crunch/site/trunk/content/crunch/mailing-lists.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/mailing-lists.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/mailing-lists.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/mailing-lists.mdtext Fri Dec 14 
01:27:33 2012
@@ -21,7 +21,7 @@ Notice:   Licensed to the Apache Softwar
   so we use plain HTML tables.
 -->
 
-There are several mailing lists for Apache Crunch. To subscribe or unsubscribe
+There are several mailing lists for the Apache Crunch project. To subscribe or 
unsubscribe
 to a list send mail to the respective administrative address given below. You
 will then receive a confirmation mail with further instructions.
 

Modified: incubator/crunch/site/trunk/content/crunch/pipelines.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/pipelines.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/pipelines.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/pipelines.mdtext Fri Dec 14 
01:27:33 2012
@@ -16,7 +16,7 @@ Notice:   Licensed to the Apache Softwar
           specific language governing permissions and limitations
           under the License.
 
-This section discusses the different steps of creating your own Crunch 
pipelines in more detail.
+This section discusses the different steps of creating your own pipelines in 
more detail.
 
 ## Writing a DoFn
 
@@ -25,7 +25,7 @@ don't need them while still keeping them
 
 ### Serialization
 
-First, all DoFn instances are required to be `java.io.Serializable`. This is a 
key aspect of Crunch's design:
+First, all DoFn instances are required to be `java.io.Serializable`. This is a 
key aspect of the library's design:
 once a particular DoFn is assigned to the Map or Reduce stage of a MapReduce 
job, all of the state
 of that DoFn is serialized so that it may be distributed to all of the nodes 
in the Hadoop cluster that
 will be running that task. There are two important implications of this for 
developers:
@@ -53,14 +53,14 @@ are associated with a MapReduce stage, s
 
 ### Performing Cogroups and Joins
 
-In Crunch, cogroups and joins are performed on PTable instances that have the 
same key type. This section walks through
-the basic flow of a cogroup operation, explaining how this higher-level 
operation is composed of Crunch's four primitives.
-In general, these common operations are provided as part of the core Crunch 
library or in extensions, you do not need
+Cogroups and joins are performed on PTable instances that have the same key 
type. This section walks through
+the basic flow of a cogroup operation, explaining how this higher-level 
operation is composed of the four primitive operations.
+In general, these common operations are provided as part of the core library 
or in extensions, you do not need
 to write them yourself. But it can be useful to understand how they work under 
the covers.
 
 Assume we have a `PTable<K, U>` named "a" and a different `PTable<K, V>` named 
"b" that we would like to combine into a
 single `PTable<K, Pair<Collection<U>, Collection<V>>>`. First, we need to 
apply parallelDo operations to a and b that
-convert them into the same Crunch type, `PTable<K, Pair<U, V>>`:
+convert them into the same PType, `PTable<K, Pair<U, V>>`:
 
     // Perform the "tagging" operation as a parallelDo on PTable a
     PTable<K, Pair<U, V>> aPrime = a.parallelDo("taga", new MapFn<Pair<K, U>, 
Pair<K, Pair<U, V>>>() {

Modified: incubator/crunch/site/trunk/content/crunch/scrunch.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/scrunch.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/scrunch.mdtext (original)
+++ incubator/crunch/site/trunk/content/crunch/scrunch.mdtext Fri Dec 14 
01:27:33 2012
@@ -1,5 +1,5 @@
 Title:    Scrunch
-Subtitle: A Scala Wrapper for Apache Crunch
+Subtitle: A Scala Wrapper for the Apache Crunch (incubating) Java API
 Notice:   Licensed to the Apache Software Foundation (ASF) under one
           or more contributor license agreements.  See the NOTICE file
           distributed with this work for additional information
@@ -19,16 +19,16 @@ Notice:   Licensed to the Apache Softwar
 
 ## Introduction
 
-Scrunch is an experimental Scala wrapper for Crunch, based on the same ideas 
as the
-[Cascade](http://days2011.scala-lang.org/node/138/282) project at Google, 
which created
-a Scala wrapper for FlumeJava.
+Scrunch is an experimental Scala wrapper for the Apache Crunch (incubating) 
Java API, based on the same ideas as the
+[Cascade](http://days2011.scala-lang.org/node/138/282) project at Google, 
which created a Scala wrapper for
+FlumeJava.
 
 ## Why Scala?
 
-In many ways, Scala is the perfect language for writing Crunch pipelines. 
Scala supports
+In many ways, Scala is the perfect language for writing MapReduce pipelines. 
Scala supports
 a mixture of functional and object-oriented programming styles and has 
powerful type-inference
 capabilities, allowing us to create complex pipelines using very few 
keystrokes. Here is
-the Scrunch analogue of the classic WordCount problem:
+an implementation of the classic WordCount problem using the Scrunch API:
 
        import org.apache.crunch.io.{From => from}
        import org.apache.crunch.scrunch._
@@ -46,7 +46,7 @@ the Scrunch analogue of the classic Word
        }
 
 The Scala compiler can infer the return type of the flatMap function as an 
Array[String], and
-the Scrunch wrapper uses the type inference mechanism to figure out how to 
serialize the
+the Scrunch wrapper code uses the type inference mechanism to figure out how 
to serialize the
 data between the Map and Reduce stages. Here's a slightly more complex 
example, in which we
 get the word counts for two different files and compute the deltas of how 
often different
 words occur, and then only returns the words where the first file had more 
occurrences then
@@ -60,14 +60,10 @@ the second:
          }
        }
 
-Note that all of the functions are using Scala Tuples, not Crunch Tuples. 
Under the covers,
-Scrunch uses Scala's implicit type conversion mechanism to transparently 
convert data from the
-Crunch format to the Scala format and back again.
-
 ## Materializing Job Outputs
 
-Scrunch also incorporates Crunch's materialize functionality, which allows us 
to easily read
-the output of a Crunch pipeline into the client:
+The Scrunch API also incorporates the Java library's `materialize` 
functionality, which allows us to easily read
+the output of a MapReduce pipeline into the client:
 
        class WordCountExample {
          def hasHamlet = wordGt("shakespeare.txt", 
"maugham.txt").materialize.exists(_ == "hamlet")
@@ -75,13 +71,8 @@ the output of a Crunch pipeline into the
 
 ## Notes and Thanks
 
-Scrunch is alpha-quality code, written by someone who was learning Scala on 
the fly. There will be bugs,
-rough edges, and non-idiomatic Scala usage all over the place. This will 
improve with time, and we welcome
-contributions from Scala experts who are interested in helping us make Scrunch 
into a first-class project.
-
 Scrunch emerged out of conversations with [Dmitriy 
Ryaboy](http://twitter.com/#!/squarecog),
 [Oscar Boykin](http://twitter.com/#!/posco), and [Avi 
Bryant](http://twitter.com/#!/avibryant) from Twitter.
 Many thanks to them for their feedback, guidance, and encouragement. We are 
also grateful to
 [Matei Zaharia](http://twitter.com/#!/matei_zaharia), whose [Spark 
Project](http://www.spark-project.org/)
-inspired much of our implementation and was kind enough to loan us the 
ClosureCleaner implementation
-Spark developed for use in Scrunch.
+inspired much of the original Scrunch API implementation.

Modified: incubator/crunch/site/trunk/content/crunch/source-repository.mdtext
URL: 
http://svn.apache.org/viewvc/incubator/crunch/site/trunk/content/crunch/source-repository.mdtext?rev=1421632&r1=1421631&r2=1421632&view=diff
==============================================================================
--- incubator/crunch/site/trunk/content/crunch/source-repository.mdtext 
(original)
+++ incubator/crunch/site/trunk/content/crunch/source-repository.mdtext Fri Dec 
14 01:27:33 2012
@@ -16,7 +16,7 @@ Notice:   Licensed to the Apache Softwar
           specific language governing permissions and limitations
           under the License.
 
-Apache Crunch uses [Git](http://git-scm.com/) for version control. Run the
+The Apache Crunch (incubating) Project uses [Git](http://git-scm.com/) for 
version control. Run the
 following command to clone the repository:
 
     git clone https://git-wip-us.apache.org/repos/asf/incubator-crunch.git

svn commit: r1421632 - in /incubator/crunch/site/trunk/content/crunch: download.mdtext future-work.mdtext getting-started.mdtext index.mdtext intro.mdtext mailing-lists.mdtext pipelines.mdtext scrunch.mdtext source-repository.mdtext

Reply via email to