[05/39] SAMZA-259: Restructure documentation folders

yanfang Thu, 14 Aug 2014 22:23:30 -0700

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/comparisons/spark-streaming.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/comparisons/spark-streaming.md 
b/docs/learn/documentation/versioned/comparisons/spark-streaming.md
new file mode 100644
index 0000000..b8a521f
--- /dev/null
+++ b/docs/learn/documentation/versioned/comparisons/spark-streaming.md
@@ -0,0 +1,105 @@
+---
+layout: page
+title: Spark Streaming
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+*People generally want to know how similar systems compare. We've done our 
best to fairly contrast the feature sets of Samza with other systems. But we 
aren't experts in these frameworks, and we are, of course, totally biased. If 
we have goofed anything, please let us know and we will correct it.*
+
+[Spark 
Streaming](http://spark.apache.org/docs/latest/streaming-programming-guide.html)
 is a stream processing system that uses the core [Apache 
Spark](http://spark.apache.org/) API. Both Samza and Spark Streaming provide 
data consistency, fault tolerance, a programming API, etc. Spark's approach to 
streaming is different from Samza's. Samza processes messages as they are 
received, while Spark Streaming treats streaming as a series of deterministic 
batch operations. Spark Streaming groups the stream into batches of a fixed 
duration (such as 1 second). Each batch is represented as a Resilient 
Distributed Dataset 
([RDD](http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf)). A 
neverending sequence of these RDDs is called a Discretized Stream 
([DStream](http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf)).
+
+### Overview of Spark Streaming
+
+Before going into the comparison, here is a brief overview of the Spark 
Streaming application. If you already are familiar with Spark Streaming, you 
may skip this part. There are two main parts of a Spark Streaming application: 
data receiving and data processing. 
+
+* Data receiving is accomplished by a 
[receiver](https://spark.apache.org/docs/latest/streaming-custom-receivers.html)
 which receives data and stores data in Spark (though not in an RDD at this 
point). 
+* Data processing transfers the data stored in Spark into the DStream. You can 
then apply the two 
[operations](https://spark.apache.org/docs/latest/streaming-programming-guide.html#operations)
 -- transformations and output operations -- on the DStream. The operations for 
DStream are a little different from what you can use for the general Spark RDD 
because of the streaming environment.
+
+Here is an overview of the Spark Streaming's 
[deploy](https://spark.apache.org/docs/latest/cluster-overview.html). Spark has 
a SparkContext (in SparkStreaming, itâs called 
[StreamingContext](https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.streaming.StreamingContext))
 object in the driver program. The SparkContext talks with cluster manager 
(e.g. YARN, Mesos) which then allocates resources (that is, executors) for the 
Spark application. And executors will run tasks sent by the SparkContext ([read 
more](http://spark.apache.org/docs/latest/cluster-overview.html#compenents)). 
In YARNâs context, one executor is equivalent to one container. Tasks are 
what is running in the containers. The driver program runs in the client 
machine that submits job ([client 
mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn))
 or in the application manager ([cluster 
mode](https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spa
 rk-on-yarn)). Both data receiving and data processing are tasks for executors. 
One receiver (receives one input stream) is a long-running task. Processing has 
a bunch of tasks. All the tasks are sent to the available executors.
+
+### Ordering and Guarantees
+
+Spark Streaming guarantees ordered processing of batches in a DStream. Since 
messages are processed in batches by side-effect-free operators, the exact 
ordering of messages is not important in Spark Streaming. Spark Streaming does 
not gurantee at-least-once or at-most-once messaging semantics because in some 
situations it may lose data when the driver program fails (see 
[fault-tolerance](#fault-tolerance)). In addition, because Spark Streaming 
requires transformation operations to be deterministic, it is unsuitable for 
nondeterministic processing, e.g. a randomized machine learning algorithm.
+
+Samza guarantees processing the messages as the order they appear in the 
partition of the stream. Samza also allows you to define a deterministic 
ordering of messages between partitions using a 
[MessageChooser](../container/streams.html). It provides an at-least-once 
message delivery guarantee. And it does not require operations to be 
deterministic.
+
+### State Management
+
+Spark Streaming provides a state DStream which keeps the state for each key 
and a transformation operation called 
[updateStateByKey](https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations)
 to mutate state. Everytime updateStateByKey is applied, you will get a new 
state DStream where all of the state is updated by applying the function passed 
to updateStateByKey. This transformation can serve as a basic key-value store, 
though it has a few drawbacks:
+
+* you can only apply the DStream operations to your state because essentially 
it's a DStream.
+* does not provide any key-value access to the data. If you want to access a 
certain key-value, you need to iterate the whole DStream.
+* it is inefficient when the state is large because every time a new batch is 
processed, Spark Streaming consumes the entire state DStream to update relevant 
keys and values.
+
+Spark Streaming periodically writes intermedia data of stateful operations 
(updateStateByKey and window-based operations) into the HDFS. In the case of 
updateStateByKey, the entire state RDD is written into the HDFS after every 
checkpointing interval. As we mentioned in the *[in memory state with 
checkpointing](../container/state-management.html#in-memory-state-with-checkpointing)*,
 writing the entire state to durable storage is very expensive when the state 
becomes large.
+
+Samza uses an embedded key-value store for [state 
management](../container/state-management.html#local-state-in-samza). This 
store is replicated as it's mutated, and supports both very high throughput 
writing and reading. And it gives you a lot of flexibility to decide what kind 
of state you want to maintain. What is more, you can also plug in other 
[storage engines](../container/state-management.html#other-storage-engines), 
which enables great flexibility in the stream processing algorithms you can 
use. A good comparison of different types of state manager approaches can be 
found 
[here](../container/state-management.html#approaches-to-managing-task-state).
+
+One of the common use cases in state management is [stream-stream 
join](../container/state-management.html#stream-stream-join). Though Spark 
Streaming has the 
[join](https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations)
 operation, this operation only joins two batches that are in the same time 
interval. It does not deal with the situation where events in two streams have 
mismatch. Spark Streaming's updateStateByKey approach to store mismatch events 
also has the limitation because if the number of mismatch events is large, 
there will be a large state, which causes the inefficience in Spark Streaming. 
While Samza does not have this limitation.
+
+### Partitioning and Parallelism
+
+Spark Streaming's Parallelism is achieved by splitting the job into small 
tasks and sending them to executors. There are two types of [parallelism in 
Spark 
Streaming](http://spark.apache.org/docs/latest/streaming-programming-guide.html#level-of-parallelism-in-data-receiving):
 parallelism in receiving the stream and parallelism in processing the stream. 
On the receiving side, one input DStream creates one receiver, and one receiver 
receives one input stream of data and runs as a long-running task. So in order 
to parallelize the receiving process, you can split one input stream into 
multiple input streams based on some criteria (e.g. if you are receiving a 
Kafka stream with some partitions, you may split this stream based on the 
partition). Then you can create multiple input DStreams (so multiple receivers) 
for these streams and the receivers will run as multiple tasks. Accordingly, 
you should provide enough resources by increasing the core number of the 
executors or bringing up more 
 executors. Then you can combine all the input Dstreams into one DStream during 
the processing if necessary. On the processing side, since a DStream is a 
continuous sequence of RDDs, the parallelism is simply accomplished by normal 
RDD operations, such as map, reduceByKey, reduceByWindow (check [here] 
(https://spark.apache.org/docs/latest/tuning.html#level-of-parallelism)).
+
+Samzaâs parallelism is achieved by splitting processing into independent 
[tasks](../api/overview.html) which can be parallelized. You can run multiple 
tasks in one container or only one task per container. That depends on your 
workload and latency requirement. For example, if you want to quickly 
[reprocess a stream](../jobs/reprocessing.html), you may increase the number of 
containers to one task per container. It is important to notice that one 
container only uses [one thread](../container/event-loop.html), which maps to 
exactly one CPU. This design attempts to simplify  resource management and the 
isolation between jobs.
+
+### Buffering &amp; Latency
+
+Spark streaming essentially is a sequence of small batch processes. With a 
fast execution engine, it can reach the latency as low as one second (from 
their 
[paper](http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf)).
 If the processing is slower than receiving, the data will be queued as 
DStreams in memory and the queue will keep increasing. In order to run a 
healthy Spark streaming application, the system should be 
[tuned](http://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning)
 until the speed of processing is as fast as receiving.
+
+Samza jobs can have latency in the low milliseconds when running with Apache 
Kafka. It has a different approach to buffering. The buffering mechanism is 
dependent on the input and output system. For example, when using 
[Kafka](http://kafka.apache.org/) as the input and output system, data is 
actually buffered to disk. This design decision, by sacrificing a little 
latency, allows the buffer to absorb a large backlog of messages when a job has 
fallen behind in its processing.
+
+### Fault-tolerance
+
+There are two kinds of failures in both Spark Streaming and Samza: worker node 
(running executors) failure in Spark Streaming (equivalent to container failure 
in Samza) and driver node (running driver program) failure (equivalent to 
application manager (AM) failure in Samza).
+
+When a worker node fails in Spark Streaming, it will be restarted by the 
cluster manager. When a container fails in Samza, the application manager will 
work with YARN to start a new container. 
+
+When a driver node fails in Spark Streaming, Sparkâs [standalone cluster 
mode](http://spark.apache.org/docs/latest/spark-standalone.html) will restart 
the driver node automatically. But it is currently not supported in YARN and 
Mesos. You will need other mechanisms to restart the driver node automatically. 
Spark Streaming can use the checkpoint in HDFS to recreate the 
StreamingContext. When the AM fails in Samza, YARN will handle restarting the 
AM. Samza will restart all the containers if the AM restarts.
+
+In terms of data lost, there is a difference between Spark Streaming and 
Samza. If the input stream is active streaming system, such as Flume, Kafka, 
Spark Streaming may lose data if the failure happens when the data is received 
but not yet replicated to other nodes (also see 
[SPARK-1647](https://issues.apache.org/jira/browse/SPARK-1647)). Samza will not 
lose data when the failure happens because it has the concept of 
[checkpointing](../container/checkpointing.html) that stores the offset of the 
latest processed message and always commits the checkpoint after processing the 
data. There is not data lost situation like Spark Streaming has. If a container 
fails, it reads from the latest checkpoint. When a Samza job recovers from a 
failure, it's possible that it will process some data more than once. This 
happens because the job restarts at the last checkpoint, and any messages that 
had been processed between that checkpoint and the failure are processed again. 
The amount of reprocessed
  data can be minimized by setting a small checkpoint interval period.
+
+### Deployment &amp; Execution
+
+Spark has a SparkContext object to talk with cluster managers, which then 
allocate resources for the application. Currently Spark supports three types of 
cluster managers: [Spark 
standalone](http://spark.apache.org/docs/latest/spark-standalone.html), [Apache 
Mesos](http://mesos.apache.org/) and [Hadoop 
YARN](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html).
 Besides these, Spark has a script for launching in [Amazon 
EC2](http://spark.apache.org/docs/latest/ec2-scripts.html).
+
+Samza only supports YARN and local execution currently.
+
+### Isolation
+
+Spark Streaming and Samza have the same isolation. Spark Streaming depends on 
cluster managers (e.g Mesos or YARN) and Samza depend on YARN to provide 
processor isolation. Different applications run in different JVMs. Data cannot 
be shared among different applications unless it is written to external 
storage. Since Samza provides out-of-box Kafka integration, it is very easy to 
reuse the output of other Samza jobs (see 
[here](../introduction/concepts.html#dataflow-graphs)).
+
+### Language Support
+
+Spark Streaming is written in Java and Scala and provides Scala, Java, and 
Python APIs. Samza is written in Java and Scala and has a Java API.
+
+### Workflow
+
+In Spark Streaming, you build an entire processing graph with a DSL API and 
deploy that entire graph as one unit. The communication between the nodes in 
that graph (in the form of DStreams) is provided by the framework. That is a 
similar to Storm. Samza is totally different -- each job is just a 
message-at-a-time processor, and there is no framework support for topologies. 
Output of a processing task always needs to go back to a message broker (e.g. 
Kafka).
+
+A positive consequence of Samza's design is that a job's output can be 
consumed by multiple unrelated jobs, potentially run by different teams, and 
those jobs are isolated from each other through Kafka's buffering. That is not 
the case with Storm's and Spark Streaming's framework-internal streams.
+
+Although a Storm/Spark Streaming job could in principle write its output to a 
message broker, the framework doesn't really make this easy. It seems that 
Storm/Spark aren't intended to used in a way where one topology's output is 
another topology's input. By contrast, in Samza, that mode of usage is standard.
+
+### Maturity
+
+Spark has an active user and developer community, and recently releases 1.0.0 
version. It has a list of companies that use it on its [Powered 
by](https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark) page. 
Since Spark contains Spark Streaming, Spark SQL, MLlib, GraphX and Bagel, it's 
tough to tell what portion of companies on the list are actually using Spark 
Streaming, and not just Spark.
+
+Samza is still young, but has just released version 0.7.0. It has a responsive 
community and is being developed actively. That said, it is built on solid 
systems such as YARN and Kafka. Samza is heavily used at LinkedIn and we hope 
others will find it useful as well.
+
+## [API Overview &raquo;](../api/overview.html)


http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/comparisons/storm.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/comparisons/storm.md 
b/docs/learn/documentation/versioned/comparisons/storm.md
new file mode 100644
index 0000000..58cb508
--- /dev/null
+++ b/docs/learn/documentation/versioned/comparisons/storm.md
@@ -0,0 +1,124 @@
+---
+layout: page
+title: Storm
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+*People generally want to know how similar systems compare. We've done our 
best to fairly contrast the feature sets of Samza with other systems. But we 
aren't experts in these frameworks, and we are, of course, totally biased. If 
we have goofed anything, please let us know and we will correct it.*
+
+[Storm](http://storm-project.net/) and Samza are fairly similar. Both systems 
provide many of the same high-level features: a partitioned stream model, a 
distributed execution environment, an API for stream processing, fault 
tolerance, Kafka integration, etc.
+
+Storm and Samza use different words for similar concepts: *spouts* in Storm 
are similar to stream consumers in Samza, *bolts* are similar to tasks, and 
*tuples* are similar to messages in Samza. Storm also has some additional 
building blocks which don't have direct equivalents in Samza.
+
+### Ordering and Guarantees
+
+Storm allows you to choose the level of guarantee with which you want your 
messages to be processed:
+
+* The simplest mode is *at-most-once delivery*, which drops messages if they 
are not processed correctly, or if the machine doing the processing fails. This 
mode requires no special logic, and processes messages in the order they were 
produced by the spout.
+* There is also *at-least-once delivery*, which tracks whether each input 
tuple (and any downstream tuples it generated) was successfully processed 
within a configured timeout, by keeping an in-memory record of all emitted 
tuples. Any tuples that are not fully processed within the timeout are 
re-emitted by the spout. This implies that a bolt may see the same tuple more 
than once, and that messages can be processed out-of-order. This mechanism also 
requires some co-operation from the user code, which must maintain the ancestry 
of records in order to properly acknowledge its input. This is explained in 
depth on [Storm's 
wiki](https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing).
+* Finally, Storm offers *exactly-once semantics* using its 
[Trident](https://github.com/nathanmarz/storm/wiki/Trident-tutorial) 
abstraction. This mode uses the same failure detection mechanism as the 
at-least-once mode. Tuples are actually processed at least once, but Storm's 
state implementation allows duplicates to be detected and ignored. (The 
duplicate detection only applies to state managed by Storm. If your code has 
other side-effects, e.g. sending messages to a service outside of the topology, 
it will not have exactly-once semantics.) In this mode, the spout breaks the 
input stream into batches, and processes batches in strictly sequential order.
+
+Samza also offers guaranteed delivery &mdash; currently only at-least-once 
delivery, but support for exactly-once semantics is planned. Within each stream 
partition, Samza always processes messages in the order they appear in the 
partition, but there is no guarantee of ordering across different input streams 
or partitions. This model allows Samza to offer at-least-once delivery without 
the overhead of ancestry tracking. In Samza, there would be no performance 
advantage to using at-most-once delivery (i.e. dropping messages on failure), 
which is why we don't offer that mode &mdash; message delivery is always 
guaranteed.
+
+Moreover, because Samza never processes messages in a partition out-of-order, 
it is better suited for handling keyed data. For example, if you have a stream 
of database updates &mdash; where later updates may replace earlier updates 
&mdash; then reordering the messages may change the final result. Provided that 
all updates for the same key appear in the same stream partition, Samza is able 
to guarantee a consistent state.
+
+### State Management
+
+Storm's lower-level API of bolts does not offer any help for managing state in 
a stream process. A bolt can maintain in-memory state (which is lost if that 
bolt dies), or it can make calls to a remote database to read and write state. 
However, a topology can usually process messages at a much higher rate than 
calls to a remote database can be made, so making a remote call for each 
message quickly becomes a bottleneck.
+
+As part of its higher-level Trident API, Storm offers automatic [state 
management](https://github.com/nathanmarz/storm/wiki/Trident-state). It keeps 
state in memory, and periodically checkpoints it to a remote database (e.g. 
Cassandra) for durability, so the cost of the remote database call is amortized 
over several processed tuples. By maintaining metadata alongside the state, 
Trident is able to achieve exactly-once processing semantics &mdash; for 
example, if you are counting events, this mechanism allows the counters to be 
correct, even when machines fail and tuples are replayed.
+
+Storm's approach of caching and batching state changes works well if the 
amount of state in each bolt is fairly small &mdash; perhaps less than 100kB. 
That makes it suitable for keeping track of counters, minimum, maximum and 
average values of a metric, and the like. However, if you need to maintain a 
large amount of state, this approach essentially degrades to making a database 
call per processed tuple, with the associated performance cost.
+
+Samza takes a [completely different 
approach](../container/state-management.html) to state management. Rather than 
using a remote database for durable storage, each Samza task includes an 
embedded key-value store, located on the same machine. Reads and writes to this 
store are very fast, even when the contents of the store are larger than the 
available memory. Changes to this key-value store are replicated to other 
machines in the cluster, so that if one machine dies, the state of the tasks it 
was running can be restored on another machine.
+
+By co-locating storage and processing on the same machine, Samza is able to 
achieve very high throughput, even when there is a large amount of state. This 
is necessary if you want to perform stateful operations that are not just 
counters. For example, if you want to perform a window join of multiple 
streams, or join a stream with a database table (replicated to Samza through a 
changelog), or group several related messages into a bigger message, then you 
need to maintain so much state that it is much more efficient to keep the state 
local to the task.
+
+A limitation of Samza's state handling is that it currently does not support 
exactly-once semantics &mdash; only at-least-once is supported right now. But 
we're working on fixing that, so stay tuned for updates.
+
+### Partitioning and Parallelism
+
+Storm's [parallelism 
model](https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology)
 is fairly similar to Samza's. Both frameworks split processing into 
independent *tasks* that can run in parallel. Resource allocation is 
independent of the number of tasks: a small job can keep all tasks in a single 
process on a single machine; a large job can spread the tasks over many 
processes on many machines.
+
+The biggest difference is that Storm uses one thread per task by default, 
whereas Samza uses single-threaded processes (containers). A Samza container 
may contain multiple tasks, but there is only one thread that invokes each of 
the tasks in turn. This means each container is mapped to exactly one CPU core, 
which makes the resource model much simpler and reduces interference from other 
tasks running on the same machine. Storm's multithreaded model has the 
advantage of taking better advantage of excess capacity on an idle machine, at 
the cost of a less predictable resource model.
+
+Storm supports *dynamic rebalancing*, which means adding more threads or 
processes to a topology without restarting the topology or cluster. This is a 
convenient feature, especially during development. We haven't added this to 
Samza: philosophically we feel that this kind of change should go through a 
normal configuration management process (i.e. version control, notification, 
etc.) as it impacts production performance. In other words, the code and 
configuration of the jobs should fully recreate the state of the cluster.
+
+When using a transactional spout with Trident (a requirement for achieving 
exactly-once semantics), parallelism is potentially reduced. Trident relies on 
a global ordering in its input streams &mdash; that is, ordering across all 
partitions of a stream, not just within one partion. This means that the 
topology's input stream has to go through a single spout instance, effectively 
ignoring the partitioning of the input stream. This spout may become a 
bottleneck on high-volume streams. In Samza, all stream processing is parallel 
&mdash; there are no such choke points.
+
+### Deployment &amp; Execution
+
+A Storm cluster is composed of a set of nodes running a *Supervisor* daemon. 
The supervisor daemons talk to a single master node running a daemon called 
*Nimbus*. The Nimbus daemon is responsible for assigning work and managing 
resources in the cluster. See Storm's 
[Tutorial](https://github.com/nathanmarz/storm/wiki/Tutorial) page for details. 
This is quite similar to YARN; though YARN is a bit more fully featured and 
intended to be multi-framework, Nimbus is better integrated with Storm.
+
+Yahoo! has also released [Storm-YARN](https://github.com/yahoo/storm-yarn). As 
described in [this Yahoo! blog 
post](http://developer.yahoo.com/blogs/ydn/storm-yarn-released-open-source-143745133.html),
 Storm-YARN is a wrapper that starts a single Storm cluster (complete with 
Nimbus, and Supervisors) inside a YARN grid.
+
+There are a lot of similarities between Storm's Nimbus and YARN's 
ResourceManager, as well as between Storm's Supervisor and YARN's Node 
Managers. Rather than writing our own resource management framework, or running 
a second one inside of YARN, we decided that Samza should use YARN directly, as 
a first-class citizen in the YARN ecosystem. YARN is stable, well adopted, 
fully-featured, and inter-operable with Hadoop. It also provides a bunch of 
nice features like security (user authentication), cgroup process isolation, 
etc.
+
+The YARN support in Samza is pluggable, so you can swap it for a different 
execution framework if you wish.
+
+### Language Support
+
+Storm is written in Java and Clojure but has good support for non-JVM 
languages. It follows a model similar to MapReduce Streaming: the non-JVM task 
is launched in a separate process, data is sent to its stdin, and output is 
read from its stdout.
+
+Samza is written in Java and Scala. It is built with multi-language support in 
mind, but currently only supports JVM languages.
+
+### Workflow
+
+Storm provides modeling of *topologies* (a processing graph of multiple 
stages) [in code](https://github.com/nathanmarz/storm/wiki/Tutorial). Trident 
provides a further [higher-level 
API](https://github.com/nathanmarz/storm/wiki/Trident-tutorial) on top of this, 
including familiar relational-like operators such as filters, grouping, 
aggregation and joins. This means the entire topology is wired up in one place, 
which has the advantage that it is documented in code, but has the disadvantage 
that the entire topology needs to be developed and deployed as a whole.
+
+In Samza, each job is an independent entity. You can define multiple jobs in a 
single codebase, or you can have separate teams working on different jobs using 
different codebases. Each job is deployed, started and stopped independently. 
Jobs communicate only through named streams, and you can add jobs to the system 
without affecting any other jobs. This makes Samza well suited for handling the 
data flow in a large company.
+
+Samza's approach can be emulated in Storm by connecting two separate 
topologies via a broker, such as Kafka. However, Storm's implementation of 
exactly-once semantics only works within a single topology.
+
+### Maturity
+
+We can't speak to Storm's maturity, but it has an [impressive number of 
adopters](https://github.com/nathanmarz/storm/wiki/Powered-By), a strong 
feature set, and seems to be under active development. It integrates well with 
many common messaging systems (RabbitMQ, Kestrel, Kafka, etc).
+
+Samza is pretty immature, though it builds on solid components. YARN is fairly 
new, but is already being run on 3000+ node clusters at Yahoo!, and the project 
is under active development by both [Hortonworks](http://hortonworks.com/) and 
[Cloudera](http://www.cloudera.com/content/cloudera/en/home.html). Kafka has a 
strong [powered by](https://cwiki.apache.org/KAFKA/powered-by.html) page, and 
has seen increased adoption recently. It's also frequently used with Storm. 
Samza is a brand new project that is in use at LinkedIn. Our hope is that 
others will find it useful, and adopt it as well.
+
+### Buffering &amp; Latency
+
+Storm uses [ZeroMQ](http://zeromq.org/) for non-durable communication between 
bolts, which enables extremely low latency transmission of tuples. Samza does 
not have an equivalent mechanism, and always writes task output to a stream.
+
+On the flip side, when a bolt is trying to send messages using ZeroMQ, and the 
consumer can't read them fast enough, the ZeroMQ buffer in the producer's 
process begins to fill up with messages. If this buffer grows too much, the 
topology's processing timeout may be reached, which causes messages to be 
re-emitted at the spout and makes the problem worse by adding even more 
messages to the buffer. In order to prevent such overflow, you can configure a 
maximum number of messages that can be in flight in the topology at any one 
time; when that threshold is reached, the spout blocks until some of the 
messages in flight are fully processed. This mechanism allows back pressure, 
but requires 
[topology.max.spout.pending](http://nathanmarz.github.io/storm/doc/backtype/storm/Config.html#TOPOLOGY_MAX_SPOUT_PENDING)
 to be carefully configured. If a single bolt in a topology starts running 
slow, the processing in the entire topology grinds to a halt.
+
+A lack of a broker between bolts also adds complexity when trying to deal with 
fault tolerance and messaging semantics.  Storm has a [clever 
mechanism](https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing)
 for detecting tuples that failed to be processed, but Samza doesn't need such 
a mechanism because every input and output stream is fault-tolerant and 
replicated.
+
+Samza takes a different approach to buffering. We buffer to disk at every hop 
between a StreamTask. This decision, and its trade-offs, are described in 
detail on the [Comparison Introduction](introduction.html) page. This design 
decision makes durability guarantees easy, and has the advantage of allowing 
the buffer to absorb a large backlog of messages if a job has fallen behind in 
its processing. However, it comes at the price of slightly higher latency.
+
+As described in the *workflow* section above, Samza's approach can be emulated 
in Storm, but comes with a loss in functionality.
+
+### Isolation
+
+Storm provides standard UNIX process-level isolation. Your topology can impact 
another topology's performance (or vice-versa) if too much CPU, disk, network, 
or memory is used.
+
+Samza relies on YARN to provide resource-level isolation. Currently, YARN 
provides explicit controls for memory and CPU limits (through 
[cgroups](../yarn/isolation.html)), and both have been used successfully with 
Samza. No isolation for disk or network is provided by YARN at this time.
+
+### Distributed RPC
+
+In Storm, you can write topologies which not only accept a stream of fixed 
events, but also allow clients to run distributed computations on demand. The 
query is sent into the topology as a tuple on a special spout, and when the 
topology has computed the answer, it is returned to the client (who was 
synchronously waiting for the answer). This facility is called [Distributed 
RPC](https://github.com/nathanmarz/storm/wiki/Distributed-RPC) (DRPC).
+
+Samza does not currently have an equivalent API to DRPC, but you can build it 
yourself using Samza's stream processing primitives.
+
+### Data Model
+
+Storm models all messages as *tuples* with a defined data model but pluggable 
serialization.
+
+Samza's serialization and data model are both pluggable. We are not terribly 
opinionated about which approach is best.
+
+## [Spark Streaming &raquo;](spark-streaming.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/container/checkpointing.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/container/checkpointing.md 
b/docs/learn/documentation/versioned/container/checkpointing.md
new file mode 100644
index 0000000..6f8c6d6
--- /dev/null
+++ b/docs/learn/documentation/versioned/container/checkpointing.md
@@ -0,0 +1,124 @@
+---
+layout: page
+title: Checkpointing
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+Samza provides fault-tolerant processing of streams: Samza guarantees that 
messages won't be lost, even if your job crashes, if a machine dies, if there 
is a network fault, or something else goes wrong. In order to provide this 
guarantee, Samza expects the [input system](streams.html) to meet the following 
requirements:
+
+* The stream may be sharded into one or more *partitions*. Each partition is 
independent from the others, and is replicated across multiple machines (the 
stream continues to be available, even if a machine fails).
+* Each partition consists of a sequence of messages in a fixed order. Each 
message has an *offset*, which indicates its position in that sequence. 
Messages are always consumed sequentially within each partition.
+* A Samza job can start consuming the sequence of messages from any starting 
offset.
+
+Kafka meets these requirements, but they can also be implemented with other 
message broker systems.
+
+As described in the [section on SamzaContainer](samza-container.html), each 
task instance of your job consumes one partition of an input stream. Each task 
has a *current offset* for each input stream: the offset of the next message to 
be read from that stream partition. Every time a message is read from the 
stream, the current offset moves forwards.
+
+If a Samza container fails, it needs to be restarted (potentially on another 
machine) and resume processing where the failed container left off. In order to 
enable this, a container periodically checkpoints the current offset for each 
task instance.
+
+<img 
src="/img/{{site.version}}/learn/documentation/container/checkpointing.svg" 
alt="Illustration of checkpointing" class="diagram-large">
+
+When a Samza container starts up, it looks for the most recent checkpoint and 
starts consuming messages from the checkpointed offsets. If the previous 
container failed unexpectedly, the most recent checkpoint may be slightly 
behind the current offsets (i.e. the job may have consumed some more messages 
since the last checkpoint was written), but we can't know for sure. In that 
case, the job may process a few messages again.
+
+This guarantee is called *at-least-once processing*: Samza ensures that your 
job doesn't miss any messages, even if containers need to be restarted. 
However, it is possible for your job to see the same message more than once 
when a container is restarted. We are planning to address this in a future 
version of Samza, but for now it is just something to be aware of: for example, 
if you are counting page views, a forcefully killed container could cause 
events to be slightly over-counted. You can reduce duplication by checkpointing 
more frequently, at a slight performance cost.
+
+For checkpoints to be effective, they need to be written somewhere where they 
will survive faults. Samza allows you to write checkpoints to the file system 
(using FileSystemCheckpointManager), but that doesn't help if the machine fails 
and the container needs to be restarted on another machine. The most common 
configuration is to use Kafka for checkpointing. You can enable this with the 
following job configuration:
+
+{% highlight jproperties %}
+# The name of your job determines the name under which checkpoints will be 
stored
+job.name=example-job
+
+# Define a system called "kafka" for consuming and producing to a Kafka cluster
+systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactory
+
+# Declare that we want our job's checkpoints to be written to Kafka
+task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
+task.checkpoint.system=kafka
+
+# By default, a checkpoint is written every 60 seconds. You can change this if 
you like.
+task.commit.ms=60000
+{% endhighlight %}
+
+In this configuration, Samza writes checkpoints to a separate Kafka topic 
called \_\_samza\_checkpoint\_&lt;job-name&gt;\_&lt;job-id&gt; (in the example 
configuration above, the topic would be called 
\_\_samza\_checkpoint\_example-job\_1). Once per minute, Samza automatically 
sends a message to this topic, in which the current offsets of the input 
streams are encoded. When a Samza container starts up, it looks for the most 
recent offset message in this topic, and loads that checkpoint.
+
+Sometimes it can be useful to use checkpoints only for some input streams, but 
not for others. In this case, you can tell Samza to ignore any checkpointed 
offsets for a particular stream name:
+
+{% highlight jproperties %}
+# Ignore any checkpoints for the topic "my-special-topic"
+systems.kafka.streams.my-special-topic.samza.reset.offset=true
+
+# Always start consuming "my-special-topic" at the oldest available offset
+systems.kafka.streams.my-special-topic.samza.offset.default=oldest
+{% endhighlight %}
+
+The following table explains the meaning of these configuration parameters:
+
+<table class="table table-condensed table-bordered table-striped">
+  <thead>
+    <tr>
+      <th>Parameter name</th>
+      <th>Value</th>
+      <th>Meaning</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td rowspan="2" 
class="nowrap">systems.&lt;system&gt;.<br>streams.&lt;stream&gt;.<br>samza.reset.offset</td>
+      <td>false (default)</td>
+      <td>When container starts up, resume processing from last checkpoint</td>
+    </tr>
+    <tr>
+      <td>true</td>
+      <td>Ignore checkpoint (pretend that no checkpoint is present)</td>
+    </tr>
+    <tr>
+      <td rowspan="2" 
class="nowrap">systems.&lt;system&gt;.<br>streams.&lt;stream&gt;.<br>samza.offset.default</td>
+      <td>upcoming (default)</td>
+      <td>When container starts and there is no checkpoint (or the checkpoint 
is ignored), only process messages that are published after the job is started, 
but no old messages</td>
+    </tr>
+    <tr>
+      <td>oldest</td>
+      <td>When container starts and there is no checkpoint (or the checkpoint 
is ignored), jump back to the oldest available message in the system, and 
consume all messages from that point onwards (most likely this means repeated 
processing of messages already seen previously)</td>
+    </tr>
+  </tbody>
+</table>
+
+Note that the example configuration above causes your tasks to start consuming 
from the oldest offset *every time a container starts up*. This is useful in 
case you have some in-memory state in your tasks that you need to rebuild from 
source data in an input stream. If you are using streams in this way, you may 
also find [bootstrap streams](streams.html) useful.
+
+### Manipulating Checkpoints Manually
+
+If you want to make a one-off change to a job's consumer offsets, for example 
to force old messages to be [processed again](../jobs/reprocessing.html) with a 
new version of your code, you can use CheckpointTool to inspect and manipulate 
the job's checkpoint. The tool is included in Samza's [source 
repository](/contribute/code.html).
+
+To inspect a job's latest checkpoint, you need to specify your job's config 
file, so that the tool knows which job it is dealing with:
+
+{% highlight bash %}
+samza-example/target/bin/checkpoint-tool.sh \
+  --config-path=file:///path/to/job/config.properties
+{% endhighlight %}
+
+This command prints out the latest checkpoint in a properties file format. You 
can save the output to a file, and edit it as you wish. For example, to jump 
back to the oldest possible point in time, you can set all the offsets to 0. 
Then you can feed that properties file back into checkpoint-tool.sh and save 
the modified checkpoint:
+
+{% highlight bash %}
+samza-example/target/bin/checkpoint-tool.sh \
+  --config-path=file:///path/to/job/config.properties \
+  --new-offsets=file:///path/to/new/offsets.properties
+{% endhighlight %}
+
+Note that Samza only reads checkpoints on container startup. In order for your 
checkpoint change to take effect, you need to first stop the job, then save the 
modified offsets, and then start the job again. If you write a checkpoint while 
the job is running, it will most likely have no effect.
+
+## [State Management &raquo;](state-management.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/container/event-loop.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/container/event-loop.md 
b/docs/learn/documentation/versioned/container/event-loop.md
new file mode 100644
index 0000000..f0f21b0
--- /dev/null
+++ b/docs/learn/documentation/versioned/container/event-loop.md
@@ -0,0 +1,60 @@
+---
+layout: page
+title: Event Loop
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+The event loop is the [container](samza-container.html)'s single thread that 
is in charge of [reading and writing messages](streams.html), [flushing 
metrics](metrics.html), [checkpointing](checkpointing.html), and 
[windowing](windowing.html).
+
+Samza uses a single thread because every container is designed to use a single 
CPU core; to get more parallelism, simply run more containers. This uses a bit 
more memory than multithreaded parallelism, because each JVM has some overhead, 
but it simplifies resource management and improves isolation between jobs. This 
helps Samza jobs run reliably on a multitenant cluster, where many different 
jobs written by different people are running at the same time.
+
+You are strongly discouraged from using threads in your job's code. Samza uses 
multiple threads internally for communicating with input and output streams, 
but all message processing and user code runs on a single-threaded event loop. 
In general, Samza is not thread-safe.
+
+### Event Loop Internals
+
+A container may have multiple 
[SystemConsumers](../api/javadocs/org/apache/samza/system/SystemConsumer.html) 
for consuming messages from different input systems. Each SystemConsumer reads 
messages on its own thread, but writes messages into a shared in-process 
message queue. The container uses this queue to funnel all of the messages into 
the event loop.
+
+The event loop works as follows:
+
+1. Take a message from the incoming message queue;
+2. Give the message to the appropriate [task instance](samza-container.html) 
by calling process() on it;
+3. Call window() on the task instance if it implements 
[WindowableTask](../api/javadocs/org/apache/samza/task/WindowableTask.html), 
and the window time has expired;
+4. Send any output from the process() and window() calls to the appropriate 
[SystemProducers](../api/javadocs/org/apache/samza/system/SystemProducer.html);
+5. Write checkpoints for any tasks whose [commit interval](checkpointing.html) 
has elapsed.
+
+The container does this, in a loop, until it is shut down. Note that although 
there can be multiple task instances within a container (depending on the 
number of input stream partitions), their process() and window() methods are 
all called on the same thread, never concurrently on different threads.
+
+### Lifecycle Listeners
+
+Sometimes, you need to run your own code at specific points in a task's 
lifecycle. For example, you might want to set up some context in the container 
whenever a new message arrives, or perform some operations on startup or 
shutdown.
+
+To receive notifications when such events happen, you can implement the 
[TaskLifecycleListenerFactory](../api/javadocs/org/apache/samza/task/TaskLifecycleListenerFactory.html)
 interface. It returns a 
[TaskLifecycleListener](../api/javadocs/org/apache/samza/task/TaskLifecycleListener.html),
 whose methods are called by Samza at the appropriate times.
+
+You can then tell Samza to use your lifecycle listener with the following 
properties in your job configuration:
+
+{% highlight jproperties %}
+# Define a listener called "my-listener" by giving the factory class name
+task.lifecycle.listener.my-listener.class=com.example.foo.MyListenerFactory
+
+# Enable it in this job (multiple listeners can be separated by commas)
+task.lifecycle.listeners=my-listener
+{% endhighlight %}
+
+The Samza container creates one instance of your 
[TaskLifecycleListener](../api/javadocs/org/apache/samza/task/TaskLifecycleListener.html).
 If the container has multiple task instances (processing different input 
stream partitions), the beforeInit, afterInit, beforeClose and afterClose 
methods are called for each task instance. The 
[TaskContext](../api/javadocs/org/apache/samza/task/TaskContext.html) argument 
of those methods gives you more information about the partitions.
+
+## [JMX &raquo;](jmx.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/container/jmx.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/container/jmx.md 
b/docs/learn/documentation/versioned/container/jmx.md
new file mode 100644
index 0000000..bdd5614
--- /dev/null
+++ b/docs/learn/documentation/versioned/container/jmx.md
@@ -0,0 +1,40 @@
+---
+layout: page
+title: JMX
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+Samza's containers and YARN ApplicationMaster enable 
[JMX](http://docs.oracle.com/javase/tutorial/jmx/) by default. JMX can be used 
for managing the JVM; for example, you can connect to it using 
[jconsole](http://docs.oracle.com/javase/7/docs/technotes/guides/management/jconsole.html),
 which is included in the JDK.
+
+You can tell Samza to publish its internal [metrics](metrics.html), and any 
custom metrics you define, as JMX MBeans. To enable this, set the following 
properties in your job configuration:
+
+{% highlight jproperties %}
+# Define a Samza metrics reporter called "jmx", which publishes to JMX
+metrics.reporter.jmx.class=org.apache.samza.metrics.reporter.JmxReporterFactory
+
+# Use it (if you have multiple reporters defined, separate them with commas)
+metrics.reporters=jmx
+{% endhighlight %}
+
+JMX needs to be configured to use a specific port, but in a distributed 
environment, there is no way of knowing in advance which ports are available on 
the machines running your containers. Therefore Samza chooses the JMX port 
randomly. If you need to connect to it, you can find the port by looking in the 
container's logs, which report the JMX server details as follows:
+
+    2014-06-02 21:50:17 JmxServer [INFO] According to 
InetAddress.getLocalHost.getHostName we are samza-grid-1234.example.com
+    2014-06-02 21:50:17 JmxServer [INFO] Started JmxServer registry port=50214 
server port=50215 
url=service:jmx:rmi://localhost:50215/jndi/rmi://localhost:50214/jmxrmi
+    2014-06-02 21:50:17 JmxServer [INFO] If you are tunneling, you might want 
to try JmxServer registry port=50214 server port=50215 
url=service:jmx:rmi://samza-grid-1234.example.com:50215/jndi/rmi://samza-grid-1234.example.com:50214/jmxrmi
+
+## [JobRunner &raquo;](../jobs/job-runner.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/container/metrics.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/container/metrics.md 
b/docs/learn/documentation/versioned/container/metrics.md
new file mode 100644
index 0000000..8ec7740
--- /dev/null
+++ b/docs/learn/documentation/versioned/container/metrics.md
@@ -0,0 +1,102 @@
+---
+layout: page
+title: Metrics
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+When you're running a stream process in production, it's important that you 
have good metrics to track the health of your job. In order to make this easy, 
Samza includes a metrics library. It is used by Samza itself to generate some 
standard metrics such as message throughput, but you can also use it in your 
task code to emit custom metrics.
+
+Metrics can be reported in various ways. You can expose them via 
[JMX](jmx.html), which is useful in development. In production, a common setup 
is for each Samza container to periodically publish its metrics to a "metrics" 
Kafka topic, in which the metrics from all Samza jobs are aggregated. You can 
then consume this stream in another Samza job, and send the metrics to your 
favorite graphing system such as [Graphite](http://graphite.wikidot.com/).
+
+To set up your job to publish metrics to Kafka, you can use the following 
configuration:
+
+{% highlight jproperties %}
+# Define a metrics reporter called "snapshot", which publishes metrics
+# every 60 seconds.
+metrics.reporters=snapshot
+metrics.reporter.snapshot.class=org.apache.samza.metrics.reporter.MetricsSnapshotReporterFactory
+
+# Tell the snapshot reporter to publish to a topic called "metrics"
+# in the "kafka" system.
+metrics.reporter.snapshot.stream=kafka.metrics
+
+# Encode metrics data as JSON.
+serializers.registry.metrics.class=org.apache.samza.serializers.MetricsSnapshotSerdeFactory
+systems.kafka.streams.metrics.samza.msg.serde=metrics
+{% endhighlight %}
+
+With this configuration, the job automatically sends several JSON-encoded 
messages to the "metrics" topic in Kafka every 60 seconds. The messages look 
something like this:
+
+{% highlight json %}
+{
+  "header": {
+    "container-name": "samza-container-0",
+    "host": "samza-grid-1234.example.com",
+    "job-id": "1",
+    "job-name": "my-samza-job",
+    "reset-time": 1401729000347,
+    "samza-version": "0.0.1",
+    "source": "Partition-2",
+    "time": 1401729420566,
+    "version": "0.0.1"
+  },
+  "metrics": {
+    "org.apache.samza.container.TaskInstanceMetrics": {
+      "commit-calls": 7,
+      "commit-skipped": 77948,
+      "kafka-input-topic-offset": "1606",
+      "messages-sent": 985,
+      "process-calls": 1093,
+      "send-calls": 985,
+      "send-skipped": 76970,
+      "window-calls": 0,
+      "window-skipped": 77955
+    }
+  }
+}
+{% endhighlight %}
+
+There is a separate message for each task instance, and the header tells you 
the job name, job ID and partition of the task. The metrics allow you to see 
how many messages have been processed and sent, the current offset in the input 
stream partition, and other details. There are additional messages which give 
you metrics about the JVM (heap size, garbage collection information, threads 
etc.), internal metrics of the Kafka producers and consumers, and more.
+
+It's easy to generate custom metrics in your job, if there's some value you 
want to keep an eye on. You can use Samza's built-in metrics framework, which 
is similar in design to Coda Hale's [metrics](http://metrics.codahale.com/) 
library. 
+
+You can register your custom metrics through a 
[MetricsRegistry](../api/javadocs/org/apache/samza/metrics/MetricsRegistry.html).
 Your stream task needs to implement 
[InitableTask](../api/javadocs/org/apache/samza/task/InitableTask.html), so 
that you can get the metrics registry from the 
[TaskContext](../api/javadocs/org/apache/samza/task/TaskContext.html). This 
simple example shows how to count the number of messages processed by your task:
+
+{% highlight java %}
+public class MyJavaStreamTask implements StreamTask, InitableTask {
+  private Counter messageCount;
+
+  public void init(Config config, TaskContext context) {
+    this.messageCount = context
+      .getMetricsRegistry()
+      .newCounter(getClass().getName(), "message-count");
+  }
+
+  public void process(IncomingMessageEnvelope envelope,
+                      MessageCollector collector,
+                      TaskCoordinator coordinator) {
+    messageCount.inc();
+  }
+}
+{% endhighlight %}
+
+Samza currently supports two kind of metrics: 
[counters](../api/javadocs/org/apache/samza/metrics/Counter.html) and 
[gauges](../api/javadocs/org/apache/samza/metrics/Gauge.html). Use a counter 
when you want to track how often something occurs, and a gauge when you want to 
report the level of something, such as the size of a buffer. Each task instance 
(for each input stream partition) gets its own set of metrics.
+
+If you want to report metrics in some other way, e.g. directly to a graphing 
system (without going via Kafka), you can implement a 
[MetricsReporterFactory](../api/javadocs/org/apache/samza/metrics/MetricsReporterFactory.html)
 and reference it in your job configuration.
+
+## [Windowing &raquo;](windowing.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/container/samza-container.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/container/samza-container.md 
b/docs/learn/documentation/versioned/container/samza-container.md
new file mode 100644
index 0000000..9f46414
--- /dev/null
+++ b/docs/learn/documentation/versioned/container/samza-container.md
@@ -0,0 +1,105 @@
+---
+layout: page
+title: SamzaContainer
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+The SamzaContainer is responsible for managing the startup, execution, and 
shutdown of one or more [StreamTask](../api/overview.html) instances. Each 
SamzaContainer typically runs as an indepentent Java virtual machine. A Samza 
job can consist of several SamzaContainers, potentially running on different 
machines.
+
+When a SamzaContainer starts up, it does the following:
+
+1. Get last checkpointed offset for each input stream partition that it 
consumes
+2. Create a "reader" thread for every input stream partition that it consumes
+3. Start metrics reporters to report metrics
+4. Start a checkpoint timer to save your task's input stream offsets every so 
often
+5. Start a window timer to trigger your task's [window 
method](../api/javadocs/org/apache/samza/task/WindowableTask.html), if it is 
defined
+6. Instantiate and initialize your StreamTask once for each input stream 
partition
+7. Start an event loop that takes messages from the input stream reader 
threads, and gives them to your StreamTasks
+8. Notify lifecycle listeners during each one of these steps
+
+Let's start in the middle, with the instantiation of a StreamTask. The 
following sections of the documentation cover the other steps.
+
+### Tasks and Partitions
+
+When the container starts, it creates instances of the [task 
class](../api/overview.html) that you've written. If the task class implements 
the [InitableTask](../api/javadocs/org/apache/samza/task/InitableTask.html) 
interface, the SamzaContainer will also call the init() method.
+
+{% highlight java %}
+/** Implement this if you want a callback when your task starts up. */
+public interface InitableTask {
+  void init(Config config, TaskContext context);
+}
+{% endhighlight %}
+
+By default, how many instances of your task class are created depends on the 
number of partitions in the job's input streams. If your Samza job has ten 
partitions, there will be ten instantiations of your task class: one for each 
partition. The first task instance will receive all messages for partition one, 
the second instance will receive all messages for partition two, and so on.
+
+<img 
src="/img/{{site.version}}/learn/documentation/container/tasks-and-partitions.svg"
 alt="Illustration of tasks consuming partitions" class="diagram-large">
+
+The number of partitions in the input streams is determined by the systems 
from which you are consuming. For example, if your input system is Kafka, you 
can specify the number of partitions when you create a topic from the command 
line or using the num.partitions in Kafka's server properties file.
+
+If a Samza job has more than one input stream, the number of task instances 
for the Samza job is the maximum number of partitions across all input streams. 
For example, if a Samza job is reading from PageViewEvent (12 partitions), and 
ServiceMetricEvent (14 partitions), then the Samza job would have 14 task 
instances (numbered 0 through 13). Task instances 12 and 13 only receive events 
from ServiceMetricEvent, because there is no corresponding PageViewEvent 
partition.
+
+With this default approach to assigning input streams to task instances, Samza 
is effectively performing a group-by operation on the input streams with their 
partitions as the key. Other strategies for grouping input stream partitions 
are possible by implementing a new 
[SystemStreamPartitionGrouper](../api/javadocs/org/apache/samza/container/SystemStreamPartitionGrouper.html)
 and factory, and configuring the job to use it via the 
job.systemstreampartition.grouper.factory configuration value.
+
+Samza provides the above-discussed per-partition grouper as well as the 
[GroupBySystemStreamPartitionGrouper](../api/javadocs/org/apache/samza/container/systemstreampartition/groupers/GroupBySystemStreamPartition),
 which provides a separate task class instance for every input stream 
partition, effectively grouping by the input stream itself. This provides 
maximum scalability in terms of how many containers can be used to process 
those input streams and is appropriate for very high volume jobs that need no 
grouping of the input streams.
+
+Considering the above example of a PageViewEvent partitioned 12 ways and a 
ServiceMetricEvent partitioned 14 ways, the GroupBySystemStreamPartitionGrouper 
would create 12 + 14 = 26 task instances, which would then be distributed 
across the number of containers configured, as discussed below.
+
+Note that once a job has been started using a particular 
SystemStreamPartitionGrouper and that job is using state or checkpointing, it 
is not possible to change that grouping in subsequent job starts, as the 
previous checkpoints and state information would likely be incorrect under the 
new grouping approach.
+
+### Containers and resource allocation
+
+Although the number of task instances is fixed &mdash; determined by the 
number of input partitions &mdash; you can configure how many containers you 
want to use for your job. If you are [using YARN](../jobs/yarn-jobs.html), the 
number of containers determines what CPU and memory resources are allocated to 
your job.
+
+If the data volume on your input streams is small, it might be sufficient to 
use just one SamzaContainer. In that case, Samza still creates one task 
instance per input partition, but all those tasks run within the same 
container. At the other extreme, you can create as many containers as you have 
partitions, and Samza will assign one task instance to each container.
+
+Each SamzaContainer is designed to use one CPU core, so it uses a 
[single-threaded event loop](event-loop.html) for execution. It's not advisable 
to create your own threads within a SamzaContainer. If you need more 
parallelism, please configure your job to use more containers.
+
+Any [state](state-management.html) in your job belongs to a task instance, not 
to a container. This is a key design decision for Samza's scalability: as your 
job's resource requirements grow and shrink, you can simply increase or 
decrease the number of containers, but the number of task instances remains 
unchanged. As you scale up or down, the same state remains attached to each 
task instance. Task instances may be moved from one container to another, and 
any persistent state managed by Samza will be moved with it. This allows the 
job's processing semantics to remain unchanged, even as you change the job's 
parallelism.
+
+### Joining multiple input streams
+
+If your job has multiple input streams, Samza provides a simple but powerful 
mechanism for joining data from different streams: each task instance receives 
messages from one partition of *each* of the input streams. For example, say 
you have two input streams, A and B, each with four partitions. Samza creates 
four task instances to process them, and assigns the partitions as follows:
+
+<table class="table table-condensed table-bordered table-striped">
+  <thead>
+    <tr>
+      <th>Task instance</th>
+      <th>Consumes stream partitions</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>0</td><td>stream A partition 0, stream B partition 0</td>
+    </tr>
+    <tr>
+      <td>1</td><td>stream A partition 1, stream B partition 1</td>
+    </tr>
+    <tr>
+      <td>2</td><td>stream A partition 2, stream B partition 2</td>
+    </tr>
+    <tr>
+      <td>3</td><td>stream A partition 3, stream B partition 3</td>
+    </tr>
+  </tbody>
+</table>
+
+Thus, if you want two events in different streams to be processed by the same 
task instance, you need to ensure they are sent to the same partition number. 
You can achieve this by using the same partitioning key when [sending the 
messages](../api/overview.html). Joining streams is discussed in detail in the 
[state management](state-management.html) section.
+
+There is one caveat in all of this: Samza currently assumes that a stream's 
partition count will never change. Partition splitting or repartitioning is not 
supported. If an input stream has N partitions, it is expected that it has 
always had, and will always have N partitions. If you want to re-partition a 
stream, you can write a job that reads messages from the stream, and writes 
them out to a new stream with the required number of partitions. For example, 
you could read messages from PageViewEvent, and write them to 
PageViewEventRepartition.
+
+## [Streams &raquo;](streams.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/1e2cfe22/docs/learn/documentation/versioned/container/serialization.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/versioned/container/serialization.md 
b/docs/learn/documentation/versioned/container/serialization.md
new file mode 100644
index 0000000..ff7d8b9
--- /dev/null
+++ b/docs/learn/documentation/versioned/container/serialization.md
@@ -0,0 +1,64 @@
+---
+layout: page
+title: Serialization
+---
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+Every message that is read from or written to a [stream](streams.html) or a 
[persistent state store](state-management.html) needs to eventually be 
serialized to bytes (which are sent over the network or written to disk). There 
are various places where that serialization and deserialization can happen:
+
+1. In the client library: for example, the library for publishing to Kafka and 
consuming from Kafka supports pluggable serialization.
+2. In the task implementation: your [process method](../api/overview.html) can 
use raw byte arrays as inputs and outputs, and do any parsing and serialization 
itself.
+3. Between the two: Samza provides a layer of serializers and deserializers, 
or *serdes* for short.
+
+You can use whatever makes sense for your job; Samza doesn't impose any 
particular data model or serialization scheme on you. However, the cleanest 
solution is usually to use Samza's serde layer. The following configuration 
example shows how to use it.
+
+{% highlight jproperties %}
+# Define a system called "kafka"
+systems.kafka.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactory
+
+# The job is going to consume a topic called "PageViewEvent" from the "kafka" 
system
+task.inputs=kafka.PageViewEvent
+
+# Define a serde called "json" which parses/serializes JSON objects
+serializers.registry.json.class=org.apache.samza.serializers.JsonSerdeFactory
+
+# Define a serde called "integer" which encodes an integer as 4 binary bytes 
(big-endian)
+serializers.registry.integer.class=org.apache.samza.serializers.IntegerSerdeFactory
+
+# For messages in the "PageViewEvent" topic, the key (the ID of the user 
viewing the page)
+# is encoded as a binary integer, and the message is encoded as JSON.
+systems.kafka.streams.PageViewEvent.samza.key.serde=integer
+systems.kafka.streams.PageViewEvent.samza.msg.serde=json
+
+# Define a key-value store which stores the most recent page view for each 
user ID.
+# Again, the key is an integer user ID, and the value is JSON.
+stores.LastPageViewPerUser.factory=org.apache.samza.storage.kv.KeyValueStorageEngineFactory
+stores.LastPageViewPerUser.changelog=kafka.last-page-view-per-user
+stores.LastPageViewPerUser.key.serde=integer
+stores.LastPageViewPerUser.msg.serde=json
+{% endhighlight %}
+
+Each serde is defined with a factory class. Samza comes with several builtin 
serdes for UTF-8 strings, binary-encoded integers, JSON (requires the 
samza-serializers dependency) and more. You can also create your own serializer 
by implementing the 
[SerdeFactory](../api/javadocs/org/apache/samza/serializers/SerdeFactory.html) 
interface.
+
+The name you give to a serde (such as "json" and "integer" in the example 
above) is only for convenience in your job configuration; you can choose 
whatever name you like. For each stream and each state store, you can use the 
serde name to declare how messages should be serialized and deserialized.
+
+If you don't declare a serde, Samza simply passes objects through between your 
task instance and the system stream. In that case your task needs to send and 
receive whatever type of object the underlying client library uses.
+
+All the Samza APIs for sending and receiving messages are typed as *Object*. 
This means that you have to cast messages to the correct type before you can 
use them. It's a little bit more code, but it has the advantage that Samza is 
not restricted to any particular data model.
+
+## [Checkpointing &raquo;](checkpointing.html)

[05/39] SAMZA-259: Restructure documentation folders

Reply via email to