[1/7] SAMZA-7: Rewrite Container section of docs to bring it up-to-date. Reviewed by Jakob Homan.

martinkl Mon, 09 Jun 2014 13:32:40 -0700

Repository: incubator-samza
Updated Branches:
  refs/heads/master a037d6f25 -> c72223f99



http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/c72223f9/docs/learn/documentation/0.7.0/jobs/configuration.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/jobs/configuration.md 
b/docs/learn/documentation/0.7.0/jobs/configuration.md
index d4a516e..3bb80ef 100644
--- a/docs/learn/documentation/0.7.0/jobs/configuration.md
+++ b/docs/learn/documentation/0.7.0/jobs/configuration.md
@@ -15,7 +15,7 @@ task.class=samza.task.example.MyJavaStreamerTask
 task.inputs=example-system.example-stream
 
 # Serializers
-serializers.registry.json.class=samza.serializers.JsonSerdeFactory
+serializers.registry.json.class=org.apache.samza.serializers.JsonSerdeFactory
 
serializers.registry.string.class=org.apache.samza.serializers.StringSerdeFactory
 
 # Systems
@@ -24,7 +24,12 @@ systems.example-system.samza.key.serde=string
 systems.example-system.samza.msg.serde=json
 ```
 
-There are four major sections to a configuration file. The job section defines 
things like the name of the job, and whether to use the YarnJobFactory or 
LocalJobFactory. The task section is where you specify the class name for your 
StreamTask. It's also where you define what the input streams are for your 
task. The serializers section defines the classes of the serdes used for 
serialization and deserialization of specific objects that are received and 
sent along different streams. The system section defines systems that your 
StreamTask can read from along with the types of serdes used for sending keys 
and messages from that system. Usually, you'll define a Kafka system, if you're 
reading from Kafka, although you can also specify your own self-implemented 
Samza-compatible systems. See the hello-samza example project's Wikipedia 
system for a good example of a self-implemented system.
+There are four major sections to a configuration file:
+
+1. The job section defines things like the name of the job, and whether to use 
the YarnJobFactory or LocalJobFactory.
+2. The task section is where you specify the class name for your 
[StreamTask](../api/overview.html). It's also where you define what the [input 
streams](../container/streams.html) are for your task.
+3. The serializers section defines the classes of the 
[serdes](../container/serialization.html) used for serialization and 
deserialization of specific objects that are received and sent along different 
streams.
+4. The system section defines systems that your StreamTask can read from along 
with the types of serdes used for sending keys and messages from that system. 
Usually, you'll define a Kafka system, if you're reading from Kafka, although 
you can also specify your own self-implemented Samza-compatible systems. See 
the [hello-samza example project](/startup/hello-samza/0.7.0)'s Wikipedia 
system for a good example of a self-implemented system.
 
 ### Required Configuration
 

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/c72223f9/docs/learn/documentation/0.7.0/jobs/job-runner.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/jobs/job-runner.md 
b/docs/learn/documentation/0.7.0/jobs/job-runner.md
index c73b234..b41a410 100644
--- a/docs/learn/documentation/0.7.0/jobs/job-runner.md
+++ b/docs/learn/documentation/0.7.0/jobs/job-runner.md
@@ -37,9 +37,7 @@ public interface StreamJob {
 }
 ```
 
-Once the JobRunner gets a job, it calls submit() on the job. This method is 
what tells the StreamJob implementation to start the TaskRunner. In the case of 
LocalJobRunner, it uses a run-container.sh script to execute the TaskRunner in 
a separate process, which will start one TaskRunner locally on the machine that 
you ran run-job.sh on.
-
-![diagram](/img/0.7.0/learn/documentation/container/job-flow.png)
+Once the JobRunner gets a job, it calls submit() on the job. This method is 
what tells the StreamJob implementation to start the SamzaContainer. In the 
case of LocalJobRunner, it uses a run-container.sh script to execute the 
SamzaContainer in a separate process, which will start one SamzaContainer 
locally on the machine that you ran run-job.sh on.
 
 This flow differs slightly when you use YARN, but we'll get to that later.
 

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/c72223f9/docs/learn/documentation/0.7.0/jobs/logging.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/jobs/logging.md 
b/docs/learn/documentation/0.7.0/jobs/logging.md
index 6bb6bf4..65a755c 100644
--- a/docs/learn/documentation/0.7.0/jobs/logging.md
+++ b/docs/learn/documentation/0.7.0/jobs/logging.md
@@ -7,7 +7,7 @@ Samza uses [SLF4J](http://www.slf4j.org/) for all of its 
logging. By default, Sa
 
 ### Log4j
 
-The [hello-samza](/startup/hello-samza/0.7.0) project shows how to use 
[log4j](http://logging.apache.org/log4j/1.2/) with Samza. To turn on log4j 
logging, you just need to make sure slf4j-log4j12 is in your Samza TaskRunner's 
classpath. In Maven, this can be done by adding the following dependency to 
your Samza package project.
+The [hello-samza](/startup/hello-samza/0.7.0) project shows how to use 
[log4j](http://logging.apache.org/log4j/1.2/) with Samza. To turn on log4j 
logging, you just need to make sure slf4j-log4j12 is in your SamzaContainer's 
classpath. In Maven, this can be done by adding the following dependency to 
your Samza package project.
 
     <dependency>
       <groupId>org.slf4j</groupId>
@@ -18,7 +18,7 @@ The [hello-samza](/startup/hello-samza/0.7.0) project shows 
how to use [log4j](h
 
 If you're not using Maven, just make sure that slf4j-log4j12 ends up in your 
Samza package's lib directory.
 
-#### log4j.xml
+#### Log4j configuration
 
 Samza's [run-class.sh](packaging.html) script will automatically set the 
following setting if log4j.xml exists in your [Samza package's](packaging.html) 
lib directory.
 
@@ -42,9 +42,7 @@ These settings are very useful if you're using a file-based 
appender. For exampl
 
 Setting up a file-based appender is recommended as a better alternative to 
using standard out. Standard out log files (see below) don't roll, and can get 
quite large if used for logging.
 
-<!-- TODO add notes showing how to use task.opts for gc logging
-#### task.opts
--->
+**NOTE:** If you use the task.opts configuration property, the log 
configuration is disrupted. This is a known bug; please see 
[SAMZA-109](https://issues.apache.org/jira/browse/SAMZA-109) for a workaround.
 
 ### Log Directory
 

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/c72223f9/docs/learn/documentation/0.7.0/jobs/packaging.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/jobs/packaging.md 
b/docs/learn/documentation/0.7.0/jobs/packaging.md
index 62c089a..4f06625 100644
--- a/docs/learn/documentation/0.7.0/jobs/packaging.md
+++ b/docs/learn/documentation/0.7.0/jobs/packaging.md
@@ -10,7 +10,7 @@ bin/run-am.sh
 bin/run-container.sh
 ```
 
-The run-container.sh script is responsible for starting the TaskRunner. The 
run-am.sh script is responsible for starting Samza's application master for 
YARN. Thus, the run-am.sh script is only used by the YarnJob, but both YarnJob 
and ProcessJob use run-container.sh.
+The run-container.sh script is responsible for starting the 
[SamzaContainer](../container/samza-container.html). The run-am.sh script is 
responsible for starting Samza's application master for YARN. Thus, the 
run-am.sh script is only used by the YarnJob, but both YarnJob and ProcessJob 
use run-container.sh.
 
 Typically, these two scripts are bundled into a tar.gz file that has a 
structure like this:
 

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/c72223f9/docs/learn/documentation/0.7.0/jobs/yarn-jobs.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/jobs/yarn-jobs.md 
b/docs/learn/documentation/0.7.0/jobs/yarn-jobs.md
index 3d971cd..5dbbe54 100644
--- a/docs/learn/documentation/0.7.0/jobs/yarn-jobs.md
+++ b/docs/learn/documentation/0.7.0/jobs/yarn-jobs.md
@@ -3,7 +3,7 @@ layout: page
 title: YARN Jobs
 ---
 
-When you define job.factory.class=samza.job.yarn.YarnJobFactory in your job's 
configuration, Samza will use YARN to execute your job. The YarnJobFactory will 
use the YARN_HOME environment variable on the machine that run-job.sh is 
executed on to get the appropriate YARN configuration, which will define where 
the YARN resource manager is. The YarnJob will work with the resource manager 
to get your job started on the YARN cluster.
+When you define job.factory.class=org.apache.samza.job.yarn.YarnJobFactory in 
your job's configuration, Samza will use YARN to execute your job. The 
YarnJobFactory will use the YARN_HOME environment variable on the machine that 
run-job.sh is executed on to get the appropriate YARN configuration, which will 
define where the YARN resource manager is. The YarnJob will work with the 
resource manager to get your job started on the YARN cluster.
 
 If you want to use YARN to run your Samza job, you'll also need to define the 
location of your Samza job's package. For example, you might say:
 
@@ -11,6 +11,8 @@ If you want to use YARN to run your Samza job, you'll also 
need to define the lo
 yarn.package.path=http://my.http.server/jobs/ingraphs-package-0.0.55.tgz
 ```
 
-This .tgz file follows the conventions outlined on the 
[Packaging](packaging.html) page (it has bin/run-am.sh and 
bin/run-container.sh). YARN NodeManagers will take responsibility for 
downloading this .tgz file on the appropriate machines, and untar'ing them. 
From there, YARN will execute run-am.sh or run-container.sh for the Samza 
Application Master, and TaskRunner, respectively.
+This .tgz file follows the conventions outlined on the 
[Packaging](packaging.html) page (it has bin/run-am.sh and 
bin/run-container.sh). YARN NodeManagers will take responsibility for 
downloading this .tgz file on the appropriate machines, and untar'ing them. 
From there, YARN will execute run-am.sh or run-container.sh for the Samza 
Application Master, and SamzaContainer, respectively.
+
+<!-- TODO document yarn.container.count and other key configs -->
 
 ## [Logging &raquo;](logging.html)

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/c72223f9/docs/learn/documentation/0.7.0/yarn/application-master.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/yarn/application-master.md 
b/docs/learn/documentation/0.7.0/yarn/application-master.md
index 0da6dc0..92e1e18 100644
--- a/docs/learn/documentation/0.7.0/yarn/application-master.md
+++ b/docs/learn/documentation/0.7.0/yarn/application-master.md
@@ -7,7 +7,7 @@ YARN is Hadoop's next-generation cluster manager. It allows 
developers to deploy
 
 ### Integration
 
-Samza's main integration with YARN comes in the form of a Samza 
ApplicationMaster. This is the chunk of code responsible for managing a Samza 
job in a YARN grid. It decides what to do when a stream processor fails, which 
machines a Samza job's [TaskRunner](../container/task-runner.html) should run 
on, and so on.
+Samza's main integration with YARN comes in the form of a Samza 
ApplicationMaster. This is the chunk of code responsible for managing a Samza 
job in a YARN grid. It decides what to do when a stream processor fails, which 
machines a Samza job's [containers](../container/samza-container.html) should 
run on, and so on.
 
 When the Samza ApplicationMaster starts up, it does the following:
 
@@ -25,11 +25,11 @@ From this point on, the ApplicationMaster just reacts to 
events from the RM.
 
 ### Fault Tolerance
 
-Whenever a container is allocated, the AM will work with the YARN NM to start 
a TaskRunner (with appropriate partitions assigned to it) in the container. If 
a container fails with a non-zero return code, the AM will request a new 
container, and restart the TaskRunner. If a TaskRunner fails too many times, 
too quickly, the ApplicationMaster will fail the whole Samza job with a 
non-zero return code. See the yarn.countainer.retry.count and 
yarn.container.retry.window.ms [configuration](../jobs/configuration.html) 
parameters for details.
+Whenever a container is allocated, the AM will work with the YARN NM to start 
a SamzaContainer (with appropriate partitions assigned to it) in the container. 
If a container fails with a non-zero return code, the AM will request a new 
container, and restart the SamzaContainer. If a SamzaContainer fails too many 
times, too quickly, the ApplicationMaster will fail the whole Samza job with a 
non-zero return code. See the yarn.countainer.retry.count and 
yarn.container.retry.window.ms [configuration](../jobs/configuration.html) 
parameters for details.
 
 When the AM receives a reboot signal from YARN, it will throw a 
SamzaException. This will trigger a clean and successful shutdown of the AM 
(YARN won't think the AM failed).
 
-If the AM, itself, fails, YARN will handle restarting the AM. When the AM is 
restarted, all containers that were running will be killed, and the AM will 
start from scratch. The same list of operations, shown above, will be executed. 
The AM will request new containers for its TaskRunners, and proceed as though 
it has just started for the first time. YARN has a 
yarn.resourcemanager.am.max-retries configuration parameter that's defined in 
[yarn-site.xml](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml).
 This configuration defaults to 1, which means that, by default, a single AM 
failure will cause your Samza job to stop running.
+If the AM, itself, fails, YARN will handle restarting the AM. When the AM is 
restarted, all containers that were running will be killed, and the AM will 
start from scratch. The same list of operations, shown above, will be executed. 
The AM will request new containers for its SamzaContainers, and proceed as 
though it has just started for the first time. YARN has a 
yarn.resourcemanager.am.max-retries configuration parameter that's defined in 
[yarn-site.xml](http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml).
 This configuration defaults to 1, which means that, by default, a single AM 
failure will cause your Samza job to stop running.
 
 ### Dashboard
 
@@ -42,7 +42,7 @@ Samza's ApplicationMaster comes with a dashboard to show 
useful information such
 
 You can find this dashboard by going to your YARN grid's ResourceManager page 
(usually something like 
[http://localhost:8088/cluster](http://localhost:8088/cluster)), and clicking 
on the "ApplicationMaster" link of a running Samza job.
 
-![diagram](/img/0.7.0/learn/documentation/yarn/samza-am-dashboard.png)
+<img src="/img/0.7.0/learn/documentation/yarn/samza-am-dashboard.png" 
alt="Screenshot of ApplicationMaster dashboard" class="diagram-large">
 
 ### Security
 

http://git-wip-us.apache.org/repos/asf/incubator-samza/blob/c72223f9/docs/learn/documentation/0.7.0/yarn/isolation.md
----------------------------------------------------------------------
diff --git a/docs/learn/documentation/0.7.0/yarn/isolation.md 
b/docs/learn/documentation/0.7.0/yarn/isolation.md
index c685729..1a4f315 100644
--- a/docs/learn/documentation/0.7.0/yarn/isolation.md
+++ b/docs/learn/documentation/0.7.0/yarn/isolation.md
@@ -13,7 +13,7 @@ YARN currently supports resource management for memory and 
CPU.
 
 YARN will automatically enforce memory limits for all containers that it 
executes. All containers must have a max-memory size defined when they're 
created. If the sum of all memory usage for processes associated with a single 
YARN container exceeds this maximum, YARN will kill the container.
 
-Samza supports memory limits using the yarn.container.memory.mb and 
yarn.am.container.memory.mb configuration parameters. Keep in mind that this is 
simply the amount of memory YARN will allow a Samza 
[TaskRunner](../container/task-runner.html) or 
[ApplicationMaster](application-master.html) to have. You'll still need to 
configure your heap settings appropriately using task.opts, when using Java 
(the default is -Xmx160M). See the [Configuration](../jobs/configuration.html) 
and [Packaging](../jobs/packaging.html) pages for details.
+Samza supports memory limits using the yarn.container.memory.mb and 
yarn.am.container.memory.mb configuration parameters. Keep in mind that this is 
simply the amount of memory YARN will allow a 
[SamzaContainer](../container/samza-container.html) or 
[ApplicationMaster](application-master.html) to have. You'll still need to 
configure your heap settings appropriately using task.opts, when using Java 
(the default is -Xmx160M). See the [Configuration](../jobs/configuration.html) 
and [Packaging](../jobs/packaging.html) pages for details.
 
 ### CPU

[1/7] SAMZA-7: Rewrite Container section of docs to bring it up-to-date. Reviewed by Jakob Homan.

Reply via email to