[1/2] SAMZA-136; copy-editing and updating docs

criccomini Mon, 10 Feb 2014 15:11:09 -0800

Updated Branches:
  refs/heads/master bf1904d64 -> c12965ddc


http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/c12965dd/docs/learn/documentation/0.7.0/introduction/background.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/introduction/background.md 
b/docs/learn/documentation/0.7.0/introduction/background.md
index 52d8e41..87717e5 100644
--- a/docs/learn/documentation/0.7.0/introduction/background.md
+++ b/docs/learn/documentation/0.7.0/introduction/background.md
@@ -7,23 +7,23 @@ This page provides some background about stream processing, 
describes what Samza
 
 ### What is messaging?
 
-Messaging systems are a popular way of implementing near-realtime asynchronous 
computation. Messages can be added to a message queue (Active MQ, Rabbit MQ), 
pub-sub system (Kestrel, Kafka), or log aggregation system (Flume, Scribe) when 
something happens. Downstream "consumers" read messages from these systems, and 
process or take action based on the message contents.
+Messaging systems are a popular way of implementing near-realtime asynchronous 
computation. Messages can be added to a message queue (ActiveMQ, RabbitMQ), 
pub-sub system (Kestrel, Kafka), or log aggregation system (Flume, Scribe) when 
something happens. Downstream *consumers* read messages from these systems, and 
process them or take actions based on the message contents.
 
-Suppose that you have a server that's serving web pages. You can have the web 
server send a "user viewed page" event to a messaging system. You might then 
have consumers:
+Suppose you have a website, and every time someone loads a page, you send a 
"user viewed page" event to a messaging system. You might then have consumers 
which do any of the following:
 
-* Put the message into Hadoop
+* Store the message in Hadoop for future analysis
 * Count page views and update a dashboard
 * Trigger an alert if a page view fails
-* Send an email notification to another use
+* Send an email notification to another user
 * Join the page view event with the user's profile, and send the message back 
to the messaging system
 
 A messaging system lets you decouple all of this work from the actual web page 
serving.
 
 ### What is stream processing?
 
-A messaging system is a fairly low-level piece of infrastructure---it stores 
messages and waits for consumers to consume them. When you start writing code 
that produces or consumes messages, you quickly find that there are a lot of 
tricky problems that have to be solved in the processing layer. Samza aims to 
help with these problems.
+A messaging system is a fairly low-level piece of infrastructure&mdash;it 
stores messages and waits for consumers to consume them. When you start writing 
code that produces or consumes messages, you quickly find that there are a lot 
of tricky problems that have to be solved in the processing layer. Samza aims 
to help with these problems.
 
-Consider the counting example, above (count page views and update a 
dashboard). What happens when the machine that your consumer is running on 
fails, and your "current count" is lost. How do you recover? Where should the 
processor be run when it restarts? What if the underlying messaging system 
sends you the same message twice, or loses a message? Your counts will be off. 
What if you want to count page views grouped by the page URL? How can you do 
that in a distributed environment?
+Consider the counting example, above (count page views and update a 
dashboard). What happens when the machine that your consumer is running on 
fails, and your current counter values are lost? How do you recover? Where 
should the processor be run when it restarts? What if the underlying messaging 
system sends you the same message twice, or loses a message? (Unless you are 
careful, your counts will be incorrect.) What if you want to count page views 
grouped by the page URL? How do you distribute the computation across multiple 
machines if it's too much for a single machine to handle?
 
 Stream processing is a higher level of abstraction on top of messaging 
systems, and it's meant to address precisely this category of problems.
 
@@ -31,24 +31,24 @@ Stream processing is a higher level of abstraction on top 
of messaging systems,
 
 Samza is a stream processing framework with the following features:
 
-* **Simple API:** Samza provides a very simple call-back based "process 
message" API.
-* **Managed state:** Samza manages snapshotting and restoration of a stream 
processor's state. Samza will restore a stream processor's state to a snapshot 
consistent with the processor's last read messages when the processor is 
restarted. Samza is built to handle large amounts of state (even many gigabytes 
per partition).
-* **Fault tolerance:** Samza will work with YARN to transparently migrate your 
tasks whenever a machine in the cluster fails.
-* **Durability:** Samza uses Kafka to guarantee that no messages will ever be 
lost.
+* **Simple API:** Unlike most low-level messaging system APIs, Samza provides 
a very simple callback-based "process message" API comparable to MapReduce.
+* **Managed state:** Samza manages snapshotting and restoration of a stream 
processor's state. When the processor is restarted, Samza restores its state to 
a consistent snapshot. Samza is built to handle large amounts of state (many 
gigabytes per partition).
+* **Fault tolerance:** Whenever a machine in the cluster fails, Samza works 
with YARN to transparently migrate your tasks to another machine.
+* **Durability:** Samza uses Kafka to guarantee that messages are processed in 
the order they were written to a partition, and that no messages are ever lost.
 * **Scalability:** Samza is partitioned and distributed at every level. Kafka 
provides ordered, partitioned, replayable, fault-tolerant streams. YARN 
provides a distributed environment for Samza containers to run in.
 * **Pluggable:** Though Samza works out of the box with Kafka and YARN, Samza 
provides a pluggable API that lets you run Samza with other messaging systems 
and execution environments.
-* **Processor isolation:** Samza works with Apache YARN, to give security and 
resource scheduling, and resource isolation through Linux CGroups.
+* **Processor isolation:** Samza works with Apache YARN, which supports 
Hadoop's security model, and resource isolation through Linux CGroups.
 
 ### Alternatives
 
-The open source stream processing systems that are available are actually 
quite young, and no single system offers a complete solution. Problems like how 
a stream processor's state should be managed, whether a stream should be 
buffered remotely on disk or not, what to do when duplicate messages are 
received or messages are lost, and how to model underlying messaging systems 
are all pretty new.
+The available open source stream processing systems are actually quite young, 
and no single system offers a complete solution. New problems in this area 
include: how a stream processor's state should be managed, whether or not a 
stream should be buffered remotely on disk, what to do when duplicate messages 
are received or messages are lost, and how to model underlying messaging 
systems.
 
 Samza's main differentiators are:
 
-* Samza supports fault-tolerant local state. State can be thought of as tables 
that are split up and maintained with the processing tasks. State is itself 
modeled as a stream. When a processor is restarted, the state stream is 
entirely replayed to restore it.
+* Samza supports fault-tolerant local state. State can be thought of as tables 
that are split up and co-located with the processing tasks. State is itself 
modeled as a stream. If the local state is lost due to machine failure, the 
state stream is replayed to restore it.
 * Streams are ordered, partitioned, replayable, and fault tolerant.
 * YARN is used for processor isolation, security, and fault tolerance.
-* All streams are materialized to disk.
+* Jobs are decoupled: if one job goes slow and builds up a backlog of 
unprocessed messages, the rest of the system is not affected.
 
 For a more in-depth discussion on Samza, and how it relates to other stream 
processing systems, have a look at Samza's 
[Comparisons](../comparisons/introduction.html) documentation.
 

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/c12965dd/docs/learn/documentation/0.7.0/introduction/concepts.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/introduction/concepts.md 
b/docs/learn/documentation/0.7.0/introduction/concepts.md
index 2736bf0..76012e2 100644
--- a/docs/learn/documentation/0.7.0/introduction/concepts.md
+++ b/docs/learn/documentation/0.7.0/introduction/concepts.md
@@ -7,9 +7,9 @@ This page gives an introduction to the high-level concepts in 
Samza.
 
 ### Streams
 
-Samza processes *streams*. A stream is composed of immutable *messages* of a 
similar type or category. Example streams might include all the clicks on a 
website, or all the updates to a particular database table, or any other type 
of event data. Messages can be appended to a stream or read from a stream. A 
stream can have any number of readers and reading from a stream doesn't delete 
the message so a message written to a stream is effectively broadcast out to 
all readers. Messages can optionally have an associated key which is used for 
partitioning, which we'll talk about in a second.
+Samza processes *streams*. A stream is composed of immutable *messages* of a 
similar type or category. For example, a stream could be all the clicks on a 
website, or all the updates to a particular database table, or all the logs 
produced by a service, or any other type of event data. Messages can be 
appended to a stream or read from a stream. A stream can have any number of 
*consumers*, and reading from a stream doesn't delete the message (so each 
message is effectively broadcast to all consumers). Messages can optionally 
have an associated key which is used for partitioning, which we'll talk about 
in a second.
 
-Samza supports pluggable *systems* that implement the stream abstraction: in 
Kafka a stream is a topic, in a database we might read a stream by consuming 
updates from a table, in Hadoop we might tail a directory of files in HDFS.
+Samza supports pluggable *systems* that implement the stream abstraction: in 
[Kafka](https://kafka.apache.org/) a stream is a topic, in a database we might 
read a stream by consuming updates from a table, in Hadoop we might tail a 
directory of files in HDFS.
 
 ![job](/img/0.7.0/learn/documentation/introduction/job.png)
 
@@ -17,42 +17,40 @@ Samza supports pluggable *systems* that implement the 
stream abstraction: in Kaf
 
 A Samza *job* is code that performs a logical transformation on a set of input 
streams to append output messages to set of output streams.
 
-If scalability were not a concern streams and jobs would be all we would need. 
But to let us scale our jobs and streams we chop these two things up into 
smaller unit of parallelism below the stream and job, namely *partitions* and 
*tasks*.
+If scalability were not a concern, streams and jobs would be all we need. 
However, in order to scale the throughput of the stream processor, we chop 
streams and jobs up into smaller units of parallelism: *partitions* and *tasks*.
 
 ### Partitions
 
 Each stream is broken into one or more partitions. Each partition in the 
stream is a totally ordered sequence of messages.
 
-Each position in this sequence has a unique identifier called the *offset*. 
The offset can be a sequential integer, byte offset, or string depending on the 
underlying system implementation.
+Each message in this sequence has an identifier called the *offset*, which is 
unique per partition. The offset can be a sequential integer, byte offset, or 
string depending on the underlying system implementation.
 
-Each message appended to a stream is appended to only one of the streams 
partitions. The assignment of the message to its partition is done with a key 
chosen by the writer (in the click example above, data might be partitioned by 
user id).
+When a message is appended to a stream, it is appended to only one of the 
stream's partitions. The assignment of the message to its partition is done 
with a key chosen by the writer. For example, if the user ID is used as the 
key, that ensures that all messages related to a particular user end up in the 
same partition.
 
 ![stream](/img/0.7.0/learn/documentation/introduction/stream.png)
 
 ### Tasks
 
-A job is itself distributed by breaking it into multiple *tasks*. The *task* 
is the unit of parallelism of the job, just as the partition is to the stream. 
Each task consumes data from one partition for each of the job's input streams.
+A job is scaled by breaking it into multiple *tasks*. The *task* is the unit 
of parallelism of the job, just as the partition is to the stream. Each task 
consumes data from one partition for each of the job's input streams.
 
-The task processes messages from each of its input partitions *in order by 
offset*. There is no defined ordering between partitions.
+A task processes messages from each of its input partitions sequentially, in 
the order of message offset. There is no defined ordering across partitions. 
This allows each task to operate independently. The YARN scheduler assigns each 
task to a machine, so the job as a whole can be distributed across many 
machines.
 
-The position of the task in its input partitions can be represented by a set 
of offsets, one for each partition.
+The number of tasks in a job is determined by the number of input partitions 
(there cannot be more tasks than input partitions, or there would be some tasks 
with no input). However, you can change the computational resources assigned to 
the job (the amount of memory, number of CPU cores, etc.) to satisfy the job's 
needs. See notes on *containers* below.
 
-The number of tasks a job has is fixed and does not change (though the 
computational resources assigned to the job may go up and down). The number of 
tasks a job has also determines the maximum parallelism of the job as each task 
processes messages sequentially. There cannot be more tasks than input 
partitions (or there would be some tasks with no input).
-
-The partitions assigned to a task will never change: if a task is on a machine 
that fails the task will be restarted elsewhere still consuming the same stream 
partitions.
+The assignment of partitions to tasks never changes: if a task is on a machine 
that fails, the task is restarted elsewhere, still consuming the same stream 
partitions.
 
 ![job-detail](/img/0.7.0/learn/documentation/introduction/job_detail.png)
 
 ### Dataflow Graphs
 
-We can compose multiple jobs to create data flow graph where the nodes are 
streams containing data and the edges are jobs performing transformations. This 
composition is done purely through the streams the jobs take as input and 
output&mdash;the jobs are otherwise totally decoupled: They need not be 
implemented in the same code base, and adding, removing, or restarting a 
downstream job will not impact an upstream job.
+We can compose multiple jobs to create a dataflow graph, where the nodes are 
streams containing data, and the edges are jobs performing transformations. 
This composition is done purely through the streams the jobs take as input and 
output. The jobs are otherwise totally decoupled: they need not be implemented 
in the same code base, and adding, removing, or restarting a downstream job 
will not impact an upstream job.
 
-These graphs are often acyclic&mdash;that is, data usually doesn't flow from a 
job, through other jobs, back to itself. However this is not a requirement.
+These graphs are often acyclic&mdash;that is, data usually doesn't flow from a 
job, through other jobs, back to itself. However, it is possible to create 
cyclic graphs if you need to.
 
-![dag](/img/0.7.0/learn/documentation/introduction/dag.png)
+<img src="/img/0.7.0/learn/documentation/introduction/dag.png" width="430" 
alt="Directed acyclic job graph">
 
 ### Containers
 
-Partitions and tasks are both *logical* units of parallelism, they don't 
actually correspond to any particular assignment of computational resources 
(CPU, memory, disk space, etc). Containers are the unit of physical 
parallelism, and a container is essentially just a unix process (or linux 
[cgroup](http://en.wikipedia.org/wiki/Cgroups)). Each container runs one or 
more tasks. The number of tasks is determined automatically from the number of 
partitions in the input and is fixed, but the number of containers (and the cpu 
and memory resources associated with them) is specified by the user at run time 
and can be changed at any time.
+Partitions and tasks are both *logical* units of parallelism&mdash;they don't 
correspond to any particular assignment of computational resources (CPU, 
memory, disk space, etc). Containers are the unit of physical parallelism, and 
a container is essentially a Unix process (or Linux 
[cgroup](http://en.wikipedia.org/wiki/Cgroups)). Each container runs one or 
more tasks. The number of tasks is determined automatically from the number of 
partitions in the input and is fixed, but the number of containers (and the CPU 
and memory resources associated with them) is specified by the user at run time 
and can be changed at any time.
 
 ## [Architecture &raquo;](architecture.html)

[1/2] SAMZA-136; copy-editing and updating docs

Reply via email to