[02/51] [partial] storm git commit: merge STORM-950 into asf-site. This closes #670

ptgoetz Fri, 11 Sep 2015 11:26:02 -0700

http://git-wip-us.apache.org/repos/asf/storm/blob/1d09012e/documentation/Common-patterns.md
----------------------------------------------------------------------
diff --git a/documentation/Common-patterns.md b/documentation/Common-patterns.md
new file mode 100644
index 0000000..7b15cfb
--- /dev/null
+++ b/documentation/Common-patterns.md
@@ -0,0 +1,100 @@
+---
+title: Common Topology Patterns
+layout: documentation
+documentation: true
+---
+
+This page lists a variety of common patterns in Storm topologies.
+
+1. Streaming joins
+2. Batching
+3. BasicBolt
+4. In-memory caching + fields grouping combo
+5. Streaming top N
+6. TimeCacheMap for efficiently keeping a cache of things that have been 
recently updated
+7. CoordinatedBolt and KeyedFairBolt for Distributed RPC
+
+### Joins
+
+A streaming join combines two or more data streams together based on some 
common field. Whereas a normal database join has finite input and clear 
semantics for a join, a streaming join has infinite input and unclear semantics 
for what a join should be.
+
+The join type you need will vary per application. Some applications join all 
tuples for two streams over a finite window of time, whereas other applications 
expect exactly one tuple for each side of the join for each join field. Other 
applications may do the join completely differently. The common pattern among 
all these join types is partitioning multiple input streams in the same way. 
This is easily accomplished in Storm by using a fields grouping on the same 
fields for many input streams to the joiner bolt. For example:
+
+```java
+builder.setBolt("join", new MyJoiner(), parallelism)
+  .fieldsGrouping("1", new Fields("joinfield1", "joinfield2"))
+  .fieldsGrouping("2", new Fields("joinfield1", "joinfield2"))
+  .fieldsGrouping("3", new Fields("joinfield1", "joinfield2"));
+```
+
+The different streams don't have to have the same field names, of course.
+
+
+### Batching
+
+Oftentimes for efficiency reasons or otherwise, you want to process a group of 
tuples in batch rather than individually. For example, you may want to batch 
updates to a database or do a streaming aggregation of some sort.
+
+If you want reliability in your data processing, the right way to do this is 
to hold on to tuples in an instance variable while the bolt waits to do the 
batching. Once you do the batch operation, you then ack all the tuples you were 
holding onto.
+
+If the bolt emits tuples, then you may want to use multi-anchoring to ensure 
reliability. It all depends on the specific application. See [Guaranteeing 
message processing](Guaranteeing-message-processing.html) for more details on 
how reliability works.
+
+### BasicBolt
+Many bolts follow a similar pattern of reading an input tuple, emitting zero 
or more tuples based on that input tuple, and then acking that input tuple 
immediately at the end of the execute method. Bolts that match this pattern are 
things like functions and filters. This is such a common pattern that Storm 
exposes an interface called 
[IBasicBolt](/javadoc/apidocs/backtype/storm/topology/IBasicBolt.html) that 
automates this pattern for you. See [Guaranteeing message 
processing](Guaranteeing-message-processing.html) for more information.
+
+### In-memory caching + fields grouping combo
+
+It's common to keep caches in-memory in Storm bolts. Caching becomes 
particularly powerful when you combine it with a fields grouping. For example, 
suppose you have a bolt that expands short URLs (like bit.ly, t.co, etc.) into 
long URLs. You can increase performance by keeping an LRU cache of short URL to 
long URL expansions to avoid doing the same HTTP requests over and over. 
Suppose component "urls" emits short URLS, and component "expand" expands short 
URLs into long URLs and keeps a cache internally. Consider the difference 
between the two following snippets of code:
+
+```java
+builder.setBolt("expand", new ExpandUrl(), parallelism)
+  .shuffleGrouping(1);
+```
+
+```java
+builder.setBolt("expand", new ExpandUrl(), parallelism)
+  .fieldsGrouping("urls", new Fields("url"));
+```
+
+The second approach will have vastly more effective caches, since the same URL 
will always go to the same task. This avoids having duplication across any of 
the caches in the tasks and makes it much more likely that a short URL will hit 
the cache.
+
+### Streaming top N
+
+A common continuous computation done on Storm is a "streaming top N" of some 
sort. Suppose you have a bolt that emits tuples of the form ["value", "count"] 
and you want a bolt that emits the top N tuples based on count. The simplest 
way to do this is to have a bolt that does a global grouping on the stream and 
maintains a list in memory of the top N items.
+
+This approach obviously doesn't scale to large streams since the entire stream 
has to go through one task. A better way to do the computation is to do many 
top N's in parallel across partitions of the stream, and then merge those top 
N's together to get the global top N. The pattern looks like this:
+
+```java
+builder.setBolt("rank", new RankObjects(), parallelism)
+  .fieldsGrouping("objects", new Fields("value"));
+builder.setBolt("merge", new MergeObjects())
+  .globalGrouping("rank");
+```
+
+This pattern works because of the fields grouping done by the first bolt which 
gives the partitioning you need for this to be semantically correct. You can 
see an example of this pattern in storm-starter 
[here](https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/storm/starter/RollingTopWords.java).
+
+If however you have a known skew in the data being processed it can be 
advantageous to use partialKeyGrouping instead of fieldsGrouping.  This will 
distribute the load for each key between two downstream bolts instead of a 
single one.
+
+```java
+builder.setBolt("count", new CountObjects(), parallelism)
+  .partialKeyGrouping("objects", new Fields("value"));
+builder.setBolt("rank" new AggregateCountsAndRank(), parallelism)
+  .fieldsGrouping("count", new Fields("key"))
+builder.setBolt("merge", new MergeRanksObjects())
+  .globalGrouping("rank");
+``` 
+
+The topology needs an extra layer of processing to aggregate the partial 
counts from the upstream bolts but this only processes aggregated values now so 
the bolt it is not subject to the load caused by the skewed data. You can see 
an example of this pattern in storm-starter 
[here](https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/storm/starter/SkewedRollingTopWords.java).
+
+### TimeCacheMap for efficiently keeping a cache of things that have been 
recently updated
+
+You sometimes want to keep a cache in memory of items that have been recently 
"active" and have items that have been inactive for some time be automatically 
expires. 
[TimeCacheMap](/javadoc/apidocs/backtype/storm/utils/TimeCacheMap.html) is an 
efficient data structure for doing this and provides hooks so you can insert 
callbacks whenever an item is expired.
+
+### CoordinatedBolt and KeyedFairBolt for Distributed RPC
+
+When building distributed RPC applications on top of Storm, there are two 
common patterns that are usually needed. These are encapsulated by 
[CoordinatedBolt](/javadoc/apidocs/backtype/storm/task/CoordinatedBolt.html) 
and [KeyedFairBolt](/javadoc/apidocs/backtype/storm/task/KeyedFairBolt.html) 
which are part of the "standard library" that ships with the Storm codebase.
+
+`CoordinatedBolt` wraps the bolt containing your logic and figures out when 
your bolt has received all the tuples for any given request. It makes heavy use 
of direct streams to do this.
+
+`KeyedFairBolt` also wraps the bolt containing your logic and makes sure your 
topology processes multiple DRPC invocations at the same time, instead of doing 
them serially one at a time.
+
+See [Distributed RPC](Distributed-RPC.html) for more details.


http://git-wip-us.apache.org/repos/asf/storm/blob/1d09012e/documentation/Concepts.md
----------------------------------------------------------------------
diff --git a/documentation/Concepts.md b/documentation/Concepts.md
new file mode 100644
index 0000000..706d5b5
--- /dev/null
+++ b/documentation/Concepts.md
@@ -0,0 +1,118 @@
+---
+title: Concepts
+layout: documentation
+documentation: true
+---
+
+This page lists the main concepts of Storm and links to resources where you 
can find more information. The concepts discussed are:
+
+1. Topologies
+2. Streams
+3. Spouts
+4. Bolts
+5. Stream groupings
+6. Reliability
+7. Tasks
+8. Workers
+
+### Topologies
+
+The logic for a realtime application is packaged into a Storm topology. A 
Storm topology is analogous to a MapReduce job. One key difference is that a 
MapReduce job eventually finishes, whereas a topology runs forever (or until 
you kill it, of course). A topology is a graph of spouts and bolts that are 
connected with stream groupings. These concepts are described below.
+
+**Resources:**
+
+* 
[TopologyBuilder](/javadoc/apidocs/backtype/storm/topology/TopologyBuilder.html):
 use this class to construct topologies in Java
+* [Running topologies on a production 
cluster](Running-topologies-on-a-production-cluster.html)
+* [Local mode](Local-mode.html): Read this to learn how to develop and test 
topologies in local mode.
+
+### Streams
+
+The stream is the core abstraction in Storm. A stream is an unbounded sequence 
of tuples that is processed and created in parallel in a distributed fashion. 
Streams are defined with a schema that names the fields in the stream's tuples. 
By default, tuples can contain integers, longs, shorts, bytes, strings, 
doubles, floats, booleans, and byte arrays. You can also define your own 
serializers so that custom types can be used natively within tuples.
+
+Every stream is given an id when declared. Since single-stream spouts and 
bolts are so common, 
[OutputFieldsDeclarer](/javadoc/apidocs/backtype/storm/topology/OutputFieldsDeclarer.html)
 has convenience methods for declaring a single stream without specifying an 
id. In this case, the stream is given the default id of "default".
+
+
+**Resources:**
+
+* [Tuple](/javadoc/apidocs/backtype/storm/tuple/Tuple.html): streams are 
composed of tuples
+* 
[OutputFieldsDeclarer](/javadoc/apidocs/backtype/storm/topology/OutputFieldsDeclarer.html):
 used to declare streams and their schemas
+* [Serialization](Serialization.html): Information about Storm's dynamic 
typing of tuples and declaring custom serializations
+* 
[ISerialization](/javadoc/apidocs/backtype/storm/serialization/ISerialization.html):
 custom serializers must implement this interface
+* 
[CONFIG.TOPOLOGY_SERIALIZATIONS](/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_SERIALIZATIONS):
 custom serializers can be registered using this configuration
+
+### Spouts
+
+A spout is a source of streams in a topology. Generally spouts will read 
tuples from an external source and emit them into the topology (e.g. a Kestrel 
queue or the Twitter API). Spouts can either be __reliable__ or __unreliable__. 
A reliable spout is capable of replaying a tuple if it failed to be processed 
by Storm, whereas an unreliable spout forgets about the tuple as soon as it is 
emitted.
+
+Spouts can emit more than one stream. To do so, declare multiple streams using 
the `declareStream` method of 
[OutputFieldsDeclarer](/javadoc/apidocs/backtype/storm/topology/OutputFieldsDeclarer.html)
 and specify the stream to emit to when using the `emit` method on 
[SpoutOutputCollector](/javadoc/apidocs/backtype/storm/spout/SpoutOutputCollector.html).
+
+The main method on spouts is `nextTuple`. `nextTuple` either emits a new tuple 
into the topology or simply returns if there are no new tuples to emit. It is 
imperative that `nextTuple` does not block for any spout implementation, 
because Storm calls all the spout methods on the same thread.
+
+The other main methods on spouts are `ack` and `fail`. These are called when 
Storm detects that a tuple emitted from the spout either successfully completed 
through the topology or failed to be completed. `ack` and `fail` are only 
called for reliable spouts. See [the 
Javadoc](/javadoc/apidocs/backtype/storm/spout/ISpout.html) for more 
information.
+
+**Resources:**
+
+* [IRichSpout](/javadoc/apidocs/backtype/storm/topology/IRichSpout.html): this 
is the interface that spouts must implement.
+* [Guaranteeing message processing](Guaranteeing-message-processing.html)
+
+### Bolts
+
+All processing in topologies is done in bolts. Bolts can do anything from 
filtering, functions, aggregations, joins, talking to databases, and more. 
+
+Bolts can do simple stream transformations. Doing complex stream 
transformations often requires multiple steps and thus multiple bolts. For 
example, transforming a stream of tweets into a stream of trending images 
requires at least two steps: a bolt to do a rolling count of retweets for each 
image, and one or more bolts to stream out the top X images (you can do this 
particular stream transformation in a more scalable way with three bolts than 
with two). 
+
+Bolts can emit more than one stream. To do so, declare multiple streams using 
the `declareStream` method of 
[OutputFieldsDeclarer](/javadoc/apidocs/backtype/storm/topology/OutputFieldsDeclarer.html)
 and specify the stream to emit to when using the `emit` method on 
[OutputCollector](/javadoc/apidocs/backtype/storm/task/OutputCollector.html).
+
+When you declare a bolt's input streams, you always subscribe to specific 
streams of another component. If you want to subscribe to all the streams of 
another component, you have to subscribe to each one individually. 
[InputDeclarer](/javadoc/apidocs/backtype/storm/topology/InputDeclarer.html) 
has syntactic sugar for subscribing to streams declared on the default stream 
id. Saying `declarer.shuffleGrouping("1")` subscribes to the default stream on 
component "1" and is equivalent to `declarer.shuffleGrouping("1", 
DEFAULT_STREAM_ID)`.
+
+The main method in bolts is the `execute` method which takes in as input a new 
tuple. Bolts emit new tuples using the 
[OutputCollector](/javadoc/apidocs/backtype/storm/task/OutputCollector.html) 
object. Bolts must call the `ack` method on the `OutputCollector` for every 
tuple they process so that Storm knows when tuples are completed (and can 
eventually determine that its safe to ack the original spout tuples). For the 
common case of processing an input tuple, emitting 0 or more tuples based on 
that tuple, and then acking the input tuple, Storm provides an 
[IBasicBolt](/javadoc/apidocs/backtype/storm/topology/IBasicBolt.html) 
interface which does the acking automatically.
+
+Please note that 
[OutputCollector](/javadoc/apidocs/backtype/storm/task/OutputCollector.html) is 
not thread-safe, and all emits, acks, and fails must happen on the same thread. 
Please refer [Troubleshooting](Troubleshooting.html) for more details.
+
+**Resources:**
+
+* [IRichBolt](/javadoc/apidocs/backtype/storm/topology/IRichBolt.html): this 
is general interface for bolts.
+* [IBasicBolt](/javadoc/apidocs/backtype/storm/topology/IBasicBolt.html): this 
is a convenience interface for defining bolts that do filtering or simple 
functions.
+* 
[OutputCollector](/javadoc/apidocs/backtype/storm/task/OutputCollector.html): 
bolts emit tuples to their output streams using an instance of this class
+* [Guaranteeing message processing](Guaranteeing-message-processing.html)
+
+### Stream groupings
+
+Part of defining a topology is specifying for each bolt which streams it 
should receive as input. A stream grouping defines how that stream should be 
partitioned among the bolt's tasks.
+
+There are eight built-in stream groupings in Storm, and you can implement a 
custom stream grouping by implementing the 
[CustomStreamGrouping](/javadoc/apidocs/backtype/storm/grouping/CustomStreamGrouping.html)
 interface:
+
+1. **Shuffle grouping**: Tuples are randomly distributed across the bolt's 
tasks in a way such that each bolt is guaranteed to get an equal number of 
tuples.
+2. **Fields grouping**: The stream is partitioned by the fields specified in 
the grouping. For example, if the stream is grouped by the "user-id" field, 
tuples with the same "user-id" will always go to the same task, but tuples with 
different "user-id"'s may go to different tasks.
+3. **Partial Key grouping**: The stream is partitioned by the fields specified 
in the grouping, like the Fields grouping, but are load balanced between two 
downstream bolts, which provides better utilization of resources when the 
incoming data is skewed. [This 
paper](https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf)
 provides a good explanation of how it works and the advantages it provides.
+4. **All grouping**: The stream is replicated across all the bolt's tasks. Use 
this grouping with care.
+5. **Global grouping**: The entire stream goes to a single one of the bolt's 
tasks. Specifically, it goes to the task with the lowest id.
+6. **None grouping**: This grouping specifies that you don't care how the 
stream is grouped. Currently, none groupings are equivalent to shuffle 
groupings. Eventually though, Storm will push down bolts with none groupings to 
execute in the same thread as the bolt or spout they subscribe from (when 
possible).
+7. **Direct grouping**: This is a special kind of grouping. A stream grouped 
this way means that the __producer__ of the tuple decides which task of the 
consumer will receive this tuple. Direct groupings can only be declared on 
streams that have been declared as direct streams. Tuples emitted to a direct 
stream must be emitted using one of the 
[emitDirect](https://storm.apache.org/javadoc/apidocs/backtype/storm/task/OutputCollector.html#emitDirect-int-java.util.Collection-java.util.List-)
 methods. A bolt can get the task ids of its consumers by either using the 
provided 
[TopologyContext](/javadoc/apidocs/backtype/storm/task/TopologyContext.html) or 
by keeping track of the output of the `emit` method in 
[OutputCollector](/javadoc/apidocs/backtype/storm/task/OutputCollector.html) 
(which returns the task ids that the tuple was sent to).
+8. **Local or shuffle grouping**: If the target bolt has one or more tasks in 
the same worker process, tuples will be shuffled to just those in-process 
tasks. Otherwise, this acts like a normal shuffle grouping.
+
+**Resources:**
+
+* 
[TopologyBuilder](/javadoc/apidocs/backtype/storm/topology/TopologyBuilder.html):
 use this class to define topologies
+* 
[InputDeclarer](/javadoc/apidocs/backtype/storm/topology/InputDeclarer.html): 
this object is returned whenever `setBolt` is called on `TopologyBuilder` and 
is used for declaring a bolt's input streams and how those streams should be 
grouped
+* 
[CoordinatedBolt](/javadoc/apidocs/backtype/storm/task/CoordinatedBolt.html): 
this bolt is useful for distributed RPC topologies and makes heavy use of 
direct streams and direct groupings
+
+### Reliability
+
+Storm guarantees that every spout tuple will be fully processed by the 
topology. It does this by tracking the tree of tuples triggered by every spout 
tuple and determining when that tree of tuples has been successfully completed. 
Every topology has a "message timeout" associated with it. If Storm fails to 
detect that a spout tuple has been completed within that timeout, then it fails 
the tuple and replays it later. 
+
+To take advantage of Storm's reliability capabilities, you must tell Storm 
when new edges in a tuple tree are being created and tell Storm whenever you've 
finished processing an individual tuple. These are done using the 
[OutputCollector](/javadoc/apidocs/backtype/storm/task/OutputCollector.html) 
object that bolts use to emit tuples. Anchoring is done in the `emit` method, 
and you declare that you're finished with a tuple using the `ack` method.
+
+This is all explained in much more detail in [Guaranteeing message 
processing](Guaranteeing-message-processing.html). 
+
+### Tasks
+
+Each spout or bolt executes as many tasks across the cluster. Each task 
corresponds to one thread of execution, and stream groupings define how to send 
tuples from one set of tasks to another set of tasks. You set the parallelism 
for each spout or bolt in the `setSpout` and `setBolt` methods of 
[TopologyBuilder](/javadoc/apidocs/backtype/storm/topology/TopologyBuilder.html).
+
+### Workers
+
+Topologies execute across one or more worker processes. Each worker process is 
a physical JVM and executes a subset of all the tasks for the topology. For 
example, if the combined parallelism of the topology is 300 and 50 workers are 
allocated, then each worker will execute 6 tasks (as threads within the 
worker). Storm tries to spread the tasks evenly across all the workers.
+
+**Resources:**
+
+* 
[Config.TOPOLOGY_WORKERS](/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_WORKERS):
 this config sets the number of workers to allocate for executing the topology

http://git-wip-us.apache.org/repos/asf/storm/blob/1d09012e/documentation/Configuration.md
----------------------------------------------------------------------
diff --git a/documentation/Configuration.md b/documentation/Configuration.md
new file mode 100644
index 0000000..6d85aa1
--- /dev/null
+++ b/documentation/Configuration.md
@@ -0,0 +1,31 @@
+---
+title: Configuration
+layout: documentation
+documentation: true
+---
+Storm has a variety of configurations for tweaking the behavior of nimbus, 
supervisors, and running topologies. Some configurations are system 
configurations and cannot be modified on a topology by topology basis, whereas 
other configurations can be modified per topology. 
+
+Every configuration has a default value defined in 
[defaults.yaml](https://github.com/apache/storm/blob/master/conf/defaults.yaml) 
in the Storm codebase. You can override these configurations by defining a 
storm.yaml in the classpath of Nimbus and the supervisors. Finally, you can 
define a topology-specific configuration that you submit along with your 
topology when using 
[StormSubmitter](/javadoc/apidocs/backtype/storm/StormSubmitter.html). However, 
the topology-specific configuration can only override configs prefixed with 
"TOPOLOGY".
+
+Storm 0.7.0 and onwards lets you override configuration on a 
per-bolt/per-spout basis. The only configurations that can be overriden this 
way are:
+
+1. "topology.debug"
+2. "topology.max.spout.pending"
+3. "topology.max.task.parallelism"
+4. "topology.kryo.register": This works a little bit differently than the 
other ones, since the serializations will be available to all components in the 
topology. More details on [Serialization](Serialization.html). 
+
+The Java API lets you specify component specific configurations in two ways:
+
+1. *Internally:* Override `getComponentConfiguration` in any spout or bolt and 
return the component-specific configuration map.
+2. *Externally:* `setSpout` and `setBolt` in `TopologyBuilder` return an 
object with methods `addConfiguration` and `addConfigurations` that can be used 
to override the configurations for the component.
+
+The preference order for configuration values is defaults.yaml < storm.yaml < 
topology specific configuration < internal component specific configuration < 
external component specific configuration. 
+
+
+**Resources:**
+
+* [Config](/javadoc/apidocs/backtype/storm/Config.html): a listing of all 
configurations as well as a helper class for creating topology specific 
configurations
+* 
[defaults.yaml](https://github.com/apache/storm/blob/master/conf/defaults.yaml):
 the default values for all configurations
+* [Setting up a Storm cluster](Setting-up-a-Storm-cluster.html): explains how 
to create and configure a Storm cluster
+* [Running topologies on a production 
cluster](Running-topologies-on-a-production-cluster.html): lists useful 
configurations when running topologies on a cluster
+* [Local mode](Local-mode.html): lists useful configurations when using local 
mode
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/storm/blob/1d09012e/documentation/Creating-a-new-Storm-project.md
----------------------------------------------------------------------
diff --git a/documentation/Creating-a-new-Storm-project.md 
b/documentation/Creating-a-new-Storm-project.md
new file mode 100644
index 0000000..b6186b5
--- /dev/null
+++ b/documentation/Creating-a-new-Storm-project.md
@@ -0,0 +1,25 @@
+---
+title: Creating a New Storm Project
+layout: documentation
+documentation: true
+---
+This page outlines how to set up a Storm project for development. The steps 
are:
+
+1. Add Storm jars to classpath
+2. If using multilang, add multilang dir to classpath
+
+Follow along to see how to set up the 
[storm-starter](https://github.com/apache/storm/blob/master/examples/storm-starter)
 project in Eclipse.
+
+### Add Storm jars to classpath
+
+You'll need the Storm jars on your classpath to develop Storm topologies. 
Using [Maven](Maven.html) is highly recommended. [Here's an 
example](https://github.com/apache/storm/blob/master/examples/storm-starter/pom.xml)
 of how to setup your pom.xml for a Storm project. If you don't want to use 
Maven, you can include the jars from the Storm release on your classpath.
+
+To set up the classpath in Eclipse, create a new Java project, include 
`src/jvm/` as a source path, and make sure all the jars in `lib/` and 
`lib/dev/` are in the `Referenced Libraries` section of the project.
+
+### If using multilang, add multilang dir to classpath
+
+If you implement spouts or bolts in languages other than Java, then those 
implementations should be under the `multilang/resources/` directory of the 
project. For Storm to find these files in local mode, the `resources/` dir 
needs to be on the classpath. You can do this in Eclipse by adding `multilang/` 
as a source folder. You may also need to add multilang/resources as a source 
directory.
+
+For more information on writing topologies in other languages, see [Using 
non-JVM languages with Storm](Using-non-JVM-languages-with-Storm.html).
+
+To test that everything is working in Eclipse, you should now be able to `Run` 
the `WordCountTopology.java` file. You will see messages being emitted at the 
console for 10 seconds.

http://git-wip-us.apache.org/repos/asf/storm/blob/1d09012e/documentation/DSLs-and-multilang-adapters.md
----------------------------------------------------------------------
diff --git a/documentation/DSLs-and-multilang-adapters.md 
b/documentation/DSLs-and-multilang-adapters.md
new file mode 100644
index 0000000..0ed5450
--- /dev/null
+++ b/documentation/DSLs-and-multilang-adapters.md
@@ -0,0 +1,10 @@
+---
+title: Storm DSLs and Multi-Lang Adapters
+layout: documentation
+documentation: true
+---
+* [Scala DSL](https://github.com/velvia/ScalaStorm)
+* [JRuby DSL](https://github.com/colinsurprenant/redstorm)
+* [Clojure DSL](Clojure-DSL.html)
+* [Storm/Esper integration](https://github.com/tomdz/storm-esper): Streaming 
SQL on top of Storm
+* [io-storm](https://github.com/dan-blanchard/io-storm): Perl multilang adapter

http://git-wip-us.apache.org/repos/asf/storm/blob/1d09012e/documentation/Defining-a-non-jvm-language-dsl-for-storm.md
----------------------------------------------------------------------
diff --git a/documentation/Defining-a-non-jvm-language-dsl-for-storm.md 
b/documentation/Defining-a-non-jvm-language-dsl-for-storm.md
new file mode 100644
index 0000000..77eb392
--- /dev/null
+++ b/documentation/Defining-a-non-jvm-language-dsl-for-storm.md
@@ -0,0 +1,38 @@
+---
+title: Defining a Non-JVM DSL for Storm
+layout: documentation
+documentation: true
+---
+The right place to start to learn how to make a non-JVM DSL for Storm is 
[storm-core/src/storm.thrift](https://github.com/apache/storm/blob/master/storm-core/src/storm.thrift).
 Since Storm topologies are just Thrift structures, and Nimbus is a Thrift 
daemon, you can create and submit topologies in any language.
+
+When you create the Thrift structs for spouts and bolts, the code for the 
spout or bolt is specified in the ComponentObject struct:
+
+```
+union ComponentObject {
+  1: binary serialized_java;
+  2: ShellComponent shell;
+  3: JavaObject java_object;
+}
+```
+
+For a Python DSL, you would want to make use of "2" and "3". ShellComponent 
lets you specify a script to run that component (e.g., your python code). And 
JavaObject lets you specify native java spouts and bolts for the component (and 
Storm will use reflection to create that spout or bolt).
+
+There's a "storm shell" command that will help with submitting a topology. Its 
usage is like this:
+
+```
+storm shell resources/ python topology.py arg1 arg2
+```
+
+storm shell will then package resources/ into a jar, upload the jar to Nimbus, 
and call your topology.py script like this:
+
+```
+python topology.py arg1 arg2 {nimbus-host} {nimbus-port} 
{uploaded-jar-location}
+```
+
+Then you can connect to Nimbus using the Thrift API and submit the topology, 
passing {uploaded-jar-location} into the submitTopology method. For reference, 
here's the submitTopology definition:
+
+```java
+void submitTopology(1: string name, 2: string uploadedJarLocation, 3: string 
jsonConf, 4: StormTopology topology) throws (1: AlreadyAliveException e, 2: 
InvalidTopologyException ite);
+```
+
+Finally, one of the key things to do in a non-JVM DSL is make it easy to 
define the entire topology in one file (the bolts, spouts, and the definition 
of the topology).
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/storm/blob/1d09012e/documentation/Distributed-RPC.md
----------------------------------------------------------------------
diff --git a/documentation/Distributed-RPC.md b/documentation/Distributed-RPC.md
new file mode 100644
index 0000000..484178b
--- /dev/null
+++ b/documentation/Distributed-RPC.md
@@ -0,0 +1,199 @@
+---
+title: Distributed RPC
+layout: documentation
+documentation: true
+---
+The idea behind distributed RPC (DRPC) is to parallelize the computation of 
really intense functions on the fly using Storm. The Storm topology takes in as 
input a stream of function arguments, and it emits an output stream of the 
results for each of those function calls. 
+
+DRPC is not so much a feature of Storm as it is a pattern expressed from 
Storm's primitives of streams, spouts, bolts, and topologies. DRPC could have 
been packaged as a separate library from Storm, but it's so useful that it's 
bundled with Storm.
+
+### High level overview
+
+Distributed RPC is coordinated by a "DRPC server" (Storm comes packaged with 
an implementation of this). The DRPC server coordinates receiving an RPC 
request, sending the request to the Storm topology, receiving the results from 
the Storm topology, and sending the results back to the waiting client. From a 
client's perspective, a distributed RPC call looks just like a regular RPC 
call. For example, here's how a client would compute the results for the 
"reach" function with the argument "http://twitter.com":
+
+```java
+DRPCClient client = new DRPCClient("drpc-host", 3772);
+String result = client.execute("reach", "http://twitter.com";);
+```
+
+The distributed RPC workflow looks like this:
+
+![Tasks in a topology](images/drpc-workflow.png)
+
+A client sends the DRPC server the name of the function to execute and the 
arguments to that function. The topology implementing that function uses a 
`DRPCSpout` to receive a function invocation stream from the DRPC server. Each 
function invocation is tagged with a unique id by the DRPC server. The topology 
then computes the result and at the end of the topology a bolt called 
`ReturnResults` connects to the DRPC server and gives it the result for the 
function invocation id. The DRPC server then uses the id to match up that 
result with which client is waiting, unblocks the waiting client, and sends it 
the result.
+
+### LinearDRPCTopologyBuilder
+
+Storm comes with a topology builder called 
[LinearDRPCTopologyBuilder](/javadoc/apidocs/backtype/storm/drpc/LinearDRPCTopologyBuilder.html)
 that automates almost all the steps involved for doing DRPC. These include:
+
+1. Setting up the spout
+2. Returning the results to the DRPC server
+3. Providing functionality to bolts for doing finite aggregations over groups 
of tuples
+
+Let's look at a simple example. Here's the implementation of a DRPC topology 
that returns its input argument with a "!" appended:
+
+```java
+public static class ExclaimBolt extends BaseBasicBolt {
+    public void execute(Tuple tuple, BasicOutputCollector collector) {
+        String input = tuple.getString(1);
+        collector.emit(new Values(tuple.getValue(0), input + "!"));
+    }
+
+    public void declareOutputFields(OutputFieldsDeclarer declarer) {
+        declarer.declare(new Fields("id", "result"));
+    }
+}
+
+public static void main(String[] args) throws Exception {
+    LinearDRPCTopologyBuilder builder = new 
LinearDRPCTopologyBuilder("exclamation");
+    builder.addBolt(new ExclaimBolt(), 3);
+    // ...
+}
+```
+
+As you can see, there's very little to it. When creating the 
`LinearDRPCTopologyBuilder`, you tell it the name of the DRPC function for the 
topology. A single DRPC server can coordinate many functions, and the function 
name distinguishes the functions from one another. The first bolt you declare 
will take in as input 2-tuples, where the first field is the request id and the 
second field is the arguments for that request. `LinearDRPCTopologyBuilder` 
expects the last bolt to emit an output stream containing 2-tuples of the form 
[id, result]. Finally, all intermediate tuples must contain the request id as 
the first field.
+
+In this example, `ExclaimBolt` simply appends a "!" to the second field of the 
tuple. `LinearDRPCTopologyBuilder` handles the rest of the coordination of 
connecting to the DRPC server and sending results back.
+
+### Local mode DRPC
+
+DRPC can be run in local mode. Here's how to run the above example in local 
mode:
+
+```java
+LocalDRPC drpc = new LocalDRPC();
+LocalCluster cluster = new LocalCluster();
+
+cluster.submitTopology("drpc-demo", conf, builder.createLocalTopology(drpc));
+
+System.out.println("Results for 'hello':" + drpc.execute("exclamation", 
"hello"));
+
+cluster.shutdown();
+drpc.shutdown();
+```
+
+First you create a `LocalDRPC` object. This object simulates a DRPC server in 
process, just like how `LocalCluster` simulates a Storm cluster in process. 
Then you create the `LocalCluster` to run the topology in local mode. 
`LinearDRPCTopologyBuilder` has separate methods for creating local topologies 
and remote topologies. In local mode the `LocalDRPC` object does not bind to 
any ports so the topology needs to know about the object to communicate with 
it. This is why `createLocalTopology` takes in the `LocalDRPC` object as input.
+
+After launching the topology, you can do DRPC invocations using the `execute` 
method on `LocalDRPC`.
+
+### Remote mode DRPC
+
+Using DRPC on an actual cluster is also straightforward. There's three steps:
+
+1. Launch DRPC server(s)
+2. Configure the locations of the DRPC servers
+3. Submit DRPC topologies to Storm cluster
+
+Launching a DRPC server can be done with the `storm` script and is just like 
launching Nimbus or the UI:
+
+```
+bin/storm drpc
+```
+
+Next, you need to configure your Storm cluster to know the locations of the 
DRPC server(s). This is how `DRPCSpout` knows from where to read function 
invocations. This can be done through the `storm.yaml` file or the topology 
configurations. Configuring this through the `storm.yaml` looks something like 
this:
+
+```yaml
+drpc.servers:
+  - "drpc1.foo.com"
+  - "drpc2.foo.com"
+```
+
+Finally, you launch DRPC topologies using `StormSubmitter` just like you 
launch any other topology. To run the above example in remote mode, you do 
something like this:
+
+```java
+StormSubmitter.submitTopology("exclamation-drpc", conf, 
builder.createRemoteTopology());
+```
+
+`createRemoteTopology` is used to create topologies suitable for Storm 
clusters.
+
+### A more complex example
+
+The exclamation DRPC example was a toy example for illustrating the concepts 
of DRPC. Let's look at a more complex example which really needs the 
parallelism a Storm cluster provides for computing the DRPC function. The 
example we'll look at is computing the reach of a URL on Twitter.
+
+The reach of a URL is the number of unique people exposed to a URL on Twitter. 
To compute reach, you need to:
+
+1. Get all the people who tweeted the URL
+2. Get all the followers of all those people
+3. Unique the set of followers
+4. Count the unique set of followers
+
+A single reach computation can involve thousands of database calls and tens of 
millions of follower records during the computation. It's a really, really 
intense computation. As you're about to see, implementing this function on top 
of Storm is dead simple. On a single machine, reach can take minutes to 
compute; on a Storm cluster, you can compute reach for even the hardest URLs in 
a couple seconds.
+
+A sample reach topology is defined in storm-starter 
[here](https://github.com/apache/storm/blob/master/examples/storm-starter/src/jvm/storm/starter/ReachTopology.java).
 Here's how you define the reach topology:
+
+```java
+LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder("reach");
+builder.addBolt(new GetTweeters(), 3);
+builder.addBolt(new GetFollowers(), 12)
+        .shuffleGrouping();
+builder.addBolt(new PartialUniquer(), 6)
+        .fieldsGrouping(new Fields("id", "follower"));
+builder.addBolt(new CountAggregator(), 2)
+        .fieldsGrouping(new Fields("id"));
+```
+
+The topology executes as four steps:
+
+1. `GetTweeters` gets the users who tweeted the URL. It transforms an input 
stream of `[id, url]` into an output stream of `[id, tweeter]`. Each `url` 
tuple will map to many `tweeter` tuples.
+2. `GetFollowers` gets the followers for the tweeters. It transforms an input 
stream of `[id, tweeter]` into an output stream of `[id, follower]`. Across all 
the tasks, there may of course be duplication of follower tuples when someone 
follows multiple people who tweeted the same URL.
+3. `PartialUniquer` groups the followers stream by the follower id. This has 
the effect of the same follower going to the same task. So each task of 
`PartialUniquer` will receive mutually independent sets of followers. Once 
`PartialUniquer` receives all the follower tuples directed at it for the 
request id, it emits the unique count of its subset of followers.
+4. Finally, `CountAggregator` receives the partial counts from each of the 
`PartialUniquer` tasks and sums them up to complete the reach computation.
+
+Let's take a look at the `PartialUniquer` bolt:
+
+```java
+public class PartialUniquer extends BaseBatchBolt {
+    BatchOutputCollector _collector;
+    Object _id;
+    Set<String> _followers = new HashSet<String>();
+    
+    @Override
+    public void prepare(Map conf, TopologyContext context, 
BatchOutputCollector collector, Object id) {
+        _collector = collector;
+        _id = id;
+    }
+
+    @Override
+    public void execute(Tuple tuple) {
+        _followers.add(tuple.getString(1));
+    }
+    
+    @Override
+    public void finishBatch() {
+        _collector.emit(new Values(_id, _followers.size()));
+    }
+
+    @Override
+    public void declareOutputFields(OutputFieldsDeclarer declarer) {
+        declarer.declare(new Fields("id", "partial-count"));
+    }
+}
+```
+
+`PartialUniquer` implements `IBatchBolt` by extending `BaseBatchBolt`. A batch 
bolt provides a first class API to processing a batch of tuples as a concrete 
unit. A new instance of the batch bolt is created for each request id, and 
Storm takes care of cleaning up the instances when appropriate. 
+
+When `PartialUniquer` receives a follower tuple in the `execute` method, it 
adds it to the set for the request id in an internal `HashSet`. 
+
+Batch bolts provide the `finishBatch` method which is called after all the 
tuples for this batch targeted at this task have been processed. In the 
callback, `PartialUniquer` emits a single tuple containing the unique count for 
its subset of follower ids.
+
+Under the hood, `CoordinatedBolt` is used to detect when a given bolt has 
received all of the tuples for any given request id. `CoordinatedBolt` makes 
use of direct streams to manage this coordination.
+
+The rest of the topology should be self-explanatory. As you can see, every 
single step of the reach computation is done in parallel, and defining the DRPC 
topology was extremely simple.
+
+### Non-linear DRPC topologies
+
+`LinearDRPCTopologyBuilder` only handles "linear" DRPC topologies, where the 
computation is expressed as a sequence of steps (like reach). It's not hard to 
imagine functions that would require a more complicated topology with branching 
and merging of the bolts. For now, to do this you'll need to drop down into 
using `CoordinatedBolt` directly. Be sure to talk about your use case for 
non-linear DRPC topologies on the mailing list to inform the construction of 
more general abstractions for DRPC topologies.
+
+### How LinearDRPCTopologyBuilder works
+
+* DRPCSpout emits [args, return-info]. return-info is the host and port of the 
DRPC server as well as the id generated by the DRPC server
+* constructs a topology comprising of:
+  * DRPCSpout
+  * PrepareRequest (generates a request id and creates a stream for the return 
info and a stream for the args)
+  * CoordinatedBolt wrappers and direct groupings
+  * JoinResult (joins the result with the return info)
+  * ReturnResult (connects to the DRPC server and returns the result)
+* LinearDRPCTopologyBuilder is a good example of a higher level abstraction 
built on top of Storm's primitives
+
+### Advanced
+* KeyedFairBolt for weaving the processing of multiple requests at the same 
time
+* How to use `CoordinatedBolt` directly
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/storm/blob/1d09012e/documentation/FAQ.md
----------------------------------------------------------------------
diff --git a/documentation/FAQ.md b/documentation/FAQ.md
new file mode 100644
index 0000000..a69862e
--- /dev/null
+++ b/documentation/FAQ.md
@@ -0,0 +1,126 @@
+---
+title: FAQ
+layout: documentation
+documentation: true
+---
+
+## Best Practices
+
+### What rules of thumb can you give me for configuring Storm+Trident?
+
+* number of workers a multiple of number of machines; parallelism a multiple 
of number of workers; number of kafka partitions a multiple of number of spout 
parallelism
+* Use one worker per topology per machine
+* Start with fewer, larger aggregators, one per machine with workers on it
+* Use the isolation scheduler
+* Use one acker per worker -- 0.9 makes that the default, but earlier versions 
do not.
+* enable GC logging; you should see very few major GCs if things are in 
reasonable shape.
+* set the trident batch millis to about 50% of your typical end-to-end latency.
+* Start with a max spout pending that is for sure too small -- one for 
trident, or the number of executors for storm -- and increase it until you stop 
seeing changes in the flow. You'll probably end up with something near 
`2*(throughput in recs/sec)*(end-to-end latency)` (2x the Little's law 
capacity).
+
+### What are some of the best ways to get a worker to mysteriously and 
bafflingly die?
+
+* Do you have write access to the log directory
+* Are you blowing out your heap?
+* Are all the right libraries installed on all of the workers?
+* Is the zookeeper hostname still set to localhost?
+* Did you supply a correct, unique hostname -- one that resolves back to the 
machine -- to each worker, and put it in the storm conf file?
+* Have you opened firewall/securitygroup permissions _bidirectionally_ among 
a) all the workers, b) the storm master, c) zookeeper? Also, from the workers 
to any kafka/kestrel/database/etc that your topology accesses? Use netcat to 
poke the appropriate ports and be sure. 
+
+### Halp! I cannot see:
+
+* **my logs** Logs by default go to $STORM_HOME/logs. Check that you have 
write permissions to that directory. They are configured in 
+    * log4j2/{cluster, worker}.xml (> 0.9);
+    * logback/cluster.xml (0.9);
+    * log4j/*.properties in earlier versions (< 0.9).
+* **final JVM settings** Add the `-XX+PrintFlagsFinal` commandline option in 
the childopts (see the conf file)
+* **final Java system properties** Add `Properties props = 
System.getProperties(); props.list(System.out);` near where you build your 
topology.
+
+### How many Workers should I use?
+
+The total number of workers is set by the supervisors -- there's some number 
of JVM slots each supervisor will superintend. The thing you set on the 
topology is how many worker slots it will try to claim.
+
+There's no great reason to use more than one worker per topology per machine.
+
+With one topology running on three 8-core nodes, and parallelism hint 24, each 
bolt gets 8 executors per machine, i.e. one for each core. There are three big 
benefits to running three workers (with 8 assigned executors each) compare to 
running say 24 workers (one assigned executor each).
+
+First, data that is repartitioned (shuffles or group-bys) to executors in the 
same worker will not have to hit the transfer buffer. Instead, tuples are 
deposited directly from send to receive buffer. That's a big win. By contrast, 
if the destination executor were on the same machine in a different worker, it 
would have to go send -> worker transfer -> local socket -> worker recv -> exec 
recv buffer. It doesn't hit the network card, but it's not as big a win as when 
executors are in the same worker.
+
+Second, you're typically better off with three aggregators having very large 
backing cache than having twenty-four aggregators having small backing caches. 
This reduces the effect of skew, and improves LRU efficiency.
+
+Lastly, fewer workers reduces control flow chatter.
+
+## Topology
+
+### Can a Trident topology have Multiple Streams?
+
+> Can a Trident Topology work like a workflow with conditional paths 
(if-else)? e.g. A Spout (S1) connects to a bolt (B0) which based on certain 
values in the incoming tuple routes them to either bolt (B1) or bolt (B2) but 
not both.
+
+A Trident "each" operator returns a Stream object, which you can store in a 
variable. You can then run multiple eaches on the same Stream to split it, 
e.g.: 
+
+        Stream s = topology.each(...).groupBy(...).aggregate(...) 
+        Stream branch1 = s.each(..., FilterA) 
+        Stream branch2 = s.each(..., FilterB) 
+
+You can join streams with join, merge or multiReduce.
+
+At time of writing, you can't emit to multiple output streams from Trident -- 
see [STORM-68](https://issues.apache.org/jira/browse/STORM-68)
+
+## Spouts
+
+### What is a coordinator, and why are there several?
+
+A trident-spout is actually run within a storm _bolt_. The storm-spout of a 
trident topology is the MasterBatchCoordinator -- it coordinates trident 
batches and is the same no matter what spouts you use. A batch is born when the 
MBC dispenses a seed tuple to each of the spout-coordinators. The 
spout-coordinator bolts know how your particular spouts should cooperate -- so 
in the kafka case, it's what helps figure out what partition and offset range 
each spout should pull from.
+
+### What can I store into the spout's metadata record?
+
+You should only store static data, and as little of it as possible, into the 
metadata record (note: maybe you _can_ store more interesting things; you 
shouldn't, though)
+
+### How often is the 'emitPartitionBatchNew' function called?
+
+Since the MBC is the actual spout, all the tuples in a batch are just members 
of its tupletree. That means storm's "max spout pending" config effectively 
defines the number of concurrent batches trident runs. The MBC emits a new 
batch if it has fewer than max-spending tuples pending and if at least one 
[trident batch 
interval](https://github.com/apache/storm/blob/master/conf/defaults.yaml#L115)'s
 worth of seconds has passed since the last batch.
+
+### If nothing was emitted does Trident slow down the calls?
+
+Yes, there's a pluggable "spout wait strategy"; the default is to sleep for a 
[configurable amount of 
time](https://github.com/apache/storm/blob/master/conf/defaults.yaml#L110)
+
+### OK, then what is the trident batch interval for?
+
+You know how computers of the 486 era had a [turbo 
button](http://en.wikipedia.org/wiki/Turbo_button) on them? It's like that. 
+
+Actually, it has two practical uses. One is to throttle spouts that poll a 
remote source without throttling processing. For example, we have a spout that 
looks in a given S3 bucket for a new batch-uploaded file to read, linebreak and 
emit. We don't want it hitting S3 more than every few seconds: files don't show 
up more than once every few minutes, and a batch takes a few seconds to process.
+
+The other is to limit overpressure on the internal queues during startup or 
under a heavy burst load -- if the spouts spring to life and suddenly jam ten 
batches' worth of records into the system, you could have a mass of less-urgent 
tuples from batch 7 clog up the transfer buffer and prevent the $commit tuple 
from batch 3 to get through (or even just the regular old tuples from batch 3). 
What we do is set the trident batch interval to about half the typical 
end-to-end processing latency -- if it takes 600ms to process a batch, it's OK 
to only kick off a batch every 300ms.
+
+Note that this is a cap, not an additional delay -- with a period of 300ms, if 
your batch takes 258ms Trident will only delay an additional 42ms.
+
+### How do you set the batch size?
+
+Trident doesn't place its own limits on the batch count. In the case of the 
Kafka spout, the max fetch bytes size divided by the average record size 
defines an effective records per subbatch partition.
+
+### How do I resize a batch?
+
+The trident batch is a somewhat overloaded facility. Together with the number 
of partitions, the batch size is constrained by or serves to define
+
+1. the unit of transactional safety (tuples at risk vs time)
+2. per partition, an effective windowing mechanism for windowed stream 
analytics
+3. per partition, the number of simultaneous queries that will be made by a 
partitionQuery, partitionPersist, etc;
+4. per partition, the number of records convenient for the spout to dispatch 
at the same time;
+
+You can't change the overall batch size once generated, but you can change the 
number of partitions -- do a shuffle and then change the parallelism hint
+
+## Time Series
+
+### How do I aggregate events by time?
+
+If have records with an immutable timestamp, and you would like to count, 
average or otherwise aggregate them into discrete time buckets, Trident is an 
excellent and scalable solution.
+
+Write an `Each` function that turns the timestamp into a time bucket: if the 
bucket size was "by hour", then the timestamp `2013-08-08 12:34:56` would be 
mapped to the `2013-08-08 12:00:00` time bucket, and so would everything else 
in the twelve o'clock hour. Then group on that timebucket and use a grouped 
persistentAggregate. The persistentAggregate uses a local cacheMap backed by a 
data store. Groups with many records require very few reads from the data 
store, and use efficient bulk reads and writes; as long as your data feed is 
relatively prompt Trident will make very efficient use of memory and network. 
Even if a server drops off line for a day, then delivers that full day's worth 
of data in a rush, the old results will be calmly retrieved and updated -- and 
without interfering with calculating the current results.
+
+### How can I know that all records for a time bucket have been received?
+
+You cannot know that all events are collected -- this is an epistemological 
challenge, not a distributed systems challenge. You can:
+
+* Set a time limit using domain knowledge
+* Introduce a _punctuation_: a record known to come after all records in the 
given time bucket. Trident uses this scheme to know when a batch is complete. 
If you for instance receive records from a set of sensors, each in order for 
that sensor, then once all sensors have sent you a 3:02:xx or later timestamp 
lets you know you can commit. 
+* When possible, make your process incremental: each value that comes in makes 
the answer more an more true. A Trident ReducerAggregator is an operator that 
takes a prior result and a set of new records and returns a new result. This 
lets the result be cached and serialized to a datastore; if a server drops off 
line for a day and then comes back with a full day's worth of data in a rush, 
the old results will be calmly retrieved and updated.
+* Lambda architecture: Record all events into an archival store (S3, HBase, 
HDFS) on receipt. in the fast layer, once the time window is clear, process the 
bucket to get an actionable answer, and ignore everything older than the time 
window. Periodically run a global aggregation to calculate a "correct" answer.

http://git-wip-us.apache.org/repos/asf/storm/blob/1d09012e/documentation/Fault-tolerance.md
----------------------------------------------------------------------
diff --git a/documentation/Fault-tolerance.md b/documentation/Fault-tolerance.md
new file mode 100644
index 0000000..d70fd1d
--- /dev/null
+++ b/documentation/Fault-tolerance.md
@@ -0,0 +1,30 @@
+---
+title: Fault Tolerance
+layout: documentation
+documentation: true
+---
+This page explains the design details of Storm that make it a fault-tolerant 
system.
+
+## What happens when a worker dies?
+
+When a worker dies, the supervisor will restart it. If it continuously fails 
on startup and is unable to heartbeat to Nimbus, Nimbus will reassign the 
worker to another machine.
+
+## What happens when a node dies?
+
+The tasks assigned to that machine will time-out and Nimbus will reassign 
those tasks to other machines.
+
+## What happens when Nimbus or Supervisor daemons die?
+
+The Nimbus and Supervisor daemons are designed to be fail-fast (process 
self-destructs whenever any unexpected situation is encountered) and stateless 
(all state is kept in Zookeeper or on disk). As described in [Setting up a 
Storm cluster](Setting-up-a-Storm-cluster.html), the Nimbus and Supervisor 
daemons must be run under supervision using a tool like daemontools or monit. 
So if the Nimbus or Supervisor daemons die, they restart like nothing happened.
+
+Most notably, no worker processes are affected by the death of Nimbus or the 
Supervisors. This is in contrast to Hadoop, where if the JobTracker dies, all 
the running jobs are lost. 
+
+## Is Nimbus a single point of failure?
+
+If you lose the Nimbus node, the workers will still continue to function. 
Additionally, supervisors will continue to restart workers if they die. 
However, without Nimbus, workers won't be reassigned to other machines when 
necessary (like if you lose a worker machine). 
+
+So the answer is that Nimbus is "sort of" a SPOF. In practice, it's not a big 
deal since nothing catastrophic happens when the Nimbus daemon dies. There are 
plans to make Nimbus highly available in the future.
+
+## How does Storm guarantee data processing?
+
+Storm provides mechanisms to guarantee data processing even if nodes die or 
messages are lost. See [Guaranteeing message 
processing](Guaranteeing-message-processing.html) for the details.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/storm/blob/1d09012e/documentation/Guaranteeing-message-processing.md
----------------------------------------------------------------------
diff --git a/documentation/Guaranteeing-message-processing.md 
b/documentation/Guaranteeing-message-processing.md
new file mode 100644
index 0000000..9ade3fa
--- /dev/null
+++ b/documentation/Guaranteeing-message-processing.md
@@ -0,0 +1,181 @@
+---
+title: Guaranteeing Message Processing
+layout: documentation
+documentation: true
+---
+Storm guarantees that each message coming off a spout will be fully processed. 
This page describes how Storm accomplishes this guarantee and what you have to 
do as a user to benefit from Storm's reliability capabilities.
+
+### What does it mean for a message to be "fully processed"?
+
+A tuple coming off a spout can trigger thousands of tuples to be created based 
on it. Consider, for example, the streaming word count topology:
+
+```java
+TopologyBuilder builder = new TopologyBuilder();
+builder.setSpout("sentences", new KestrelSpout("kestrel.backtype.com",
+                                               22133,
+                                               "sentence_queue",
+                                               new StringScheme()));
+builder.setBolt("split", new SplitSentence(), 10)
+        .shuffleGrouping("sentences");
+builder.setBolt("count", new WordCount(), 20)
+        .fieldsGrouping("split", new Fields("word"));
+```
+
+This topology reads sentences off of a Kestrel queue, splits the sentences 
into its constituent words, and then emits for each word the number of times it 
has seen that word before. A tuple coming off the spout triggers many tuples 
being created based on it: a tuple for each word in the sentence and a tuple 
for the updated count for each word. The tree of messages looks something like 
this:
+
+![Tuple tree](images/tuple_tree.png)
+
+Storm considers a tuple coming off a spout "fully processed" when the tuple 
tree has been exhausted and every message in the tree has been processed. A 
tuple is considered failed when its tree of messages fails to be fully 
processed within a specified timeout. This timeout can be configured on a 
topology-specific basis using the 
[Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS](/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_MESSAGE_TIMEOUT_SECS)
 configuration and defaults to 30 seconds.
+
+### What happens if a message is fully processed or fails to be fully 
processed?
+
+To understand this question, let's take a look at the lifecycle of a tuple 
coming off of a spout. For reference, here is the interface that spouts 
implement (see the [Javadoc](/javadoc/apidocs/backtype/storm/spout/ISpout.html) 
for more information):
+
+```java
+public interface ISpout extends Serializable {
+    void open(Map conf, TopologyContext context, SpoutOutputCollector 
collector);
+    void close();
+    void nextTuple();
+    void ack(Object msgId);
+    void fail(Object msgId);
+}
+```
+
+First, Storm requests a tuple from the `Spout` by calling the `nextTuple` 
method on the `Spout`. The `Spout` uses the `SpoutOutputCollector` provided in 
the `open` method to emit a tuple to one of its output streams. When emitting a 
tuple, the `Spout` provides a "message id" that will be used to identify the 
tuple later. For example, the `KestrelSpout` reads a message off of the kestrel 
queue and emits as the "message id" the id provided by Kestrel for the message. 
Emitting a message to the `SpoutOutputCollector` looks like this:
+
+```java
+_collector.emit(new Values("field1", "field2", 3) , msgId);
+```
+
+Next, the tuple gets sent to consuming bolts and Storm takes care of tracking 
the tree of messages that is created. If Storm detects that a tuple is fully 
processed, Storm will call the `ack` method on the originating `Spout` task 
with the message id that the `Spout` provided to Storm. Likewise, if the tuple 
times-out Storm will call the `fail` method on the `Spout`. Note that a tuple 
will be acked or failed by the exact same `Spout` task that created it. So if a 
`Spout` is executing as many tasks across the cluster, a tuple won't be acked 
or failed by a different task than the one that created it.
+
+Let's use `KestrelSpout` again to see what a `Spout` needs to do to guarantee 
message processing. When `KestrelSpout` takes a message off the Kestrel queue, 
it "opens" the message. This means the message is not actually taken off the 
queue yet, but instead placed in a "pending" state waiting for acknowledgement 
that the message is completed. While in the pending state, a message will not 
be sent to other consumers of the queue. Additionally, if a client disconnects 
all pending messages for that client are put back on the queue. When a message 
is opened, Kestrel provides the client with the data for the message as well as 
a unique id for the message. The `KestrelSpout` uses that exact id as the 
"message id" for the tuple when emitting the tuple to the 
`SpoutOutputCollector`. Sometime later on, when `ack` or `fail` are called on 
the `KestrelSpout`, the `KestrelSpout` sends an ack or fail message to Kestrel 
with the message id to take the message off the queue or have it put back on.
+
+### What is Storm's reliability API?
+
+There's two things you have to do as a user to benefit from Storm's 
reliability capabilities. First, you need to tell Storm whenever you're 
creating a new link in the tree of tuples. Second, you need to tell Storm when 
you have finished processing an individual tuple. By doing both these things, 
Storm can detect when the tree of tuples is fully processed and can ack or fail 
the spout tuple appropriately. Storm's API provides a concise way of doing both 
of these tasks. 
+
+Specifying a link in the tuple tree is called _anchoring_. Anchoring is done 
at the same time you emit a new tuple. Let's use the following bolt as an 
example. This bolt splits a tuple containing a sentence into a tuple for each 
word:
+
+```java
+public class SplitSentence extends BaseRichBolt {
+        OutputCollector _collector;
+        
+        public void prepare(Map conf, TopologyContext context, OutputCollector 
collector) {
+            _collector = collector;
+        }
+
+        public void execute(Tuple tuple) {
+            String sentence = tuple.getString(0);
+            for(String word: sentence.split(" ")) {
+                _collector.emit(tuple, new Values(word));
+            }
+            _collector.ack(tuple);
+        }
+
+        public void declareOutputFields(OutputFieldsDeclarer declarer) {
+            declarer.declare(new Fields("word"));
+        }        
+    }
+```
+
+Each word tuple is _anchored_ by specifying the input tuple as the first 
argument to `emit`. Since the word tuple is anchored, the spout tuple at the 
root of the tree will be replayed later on if the word tuple failed to be 
processed downstream. In contrast, let's look at what happens if the word tuple 
is emitted like this:
+
+```java
+_collector.emit(new Values(word));
+```
+
+Emitting the word tuple this way causes it to be _unanchored_. If the tuple 
fails be processed downstream, the root tuple will not be replayed. Depending 
on the fault-tolerance guarantees you need in your topology, sometimes it's 
appropriate to emit an unanchored tuple.
+
+An output tuple can be anchored to more than one input tuple. This is useful 
when doing streaming joins or aggregations. A multi-anchored tuple failing to 
be processed will cause multiple tuples to be replayed from the spouts. 
Multi-anchoring is done by specifying a list of tuples rather than just a 
single tuple. For example:
+
+```java
+List<Tuple> anchors = new ArrayList<Tuple>();
+anchors.add(tuple1);
+anchors.add(tuple2);
+_collector.emit(anchors, new Values(1, 2, 3));
+```
+
+Multi-anchoring adds the output tuple into multiple tuple trees. Note that 
it's also possible for multi-anchoring to break the tree structure and create 
tuple DAGs, like so:
+
+![Tuple DAG](images/tuple-dag.png)
+
+Storm's implementation works for DAGs as well as trees (pre-release it only 
worked for trees, and the name "tuple tree" stuck).
+
+Anchoring is how you specify the tuple tree -- the next and final piece to 
Storm's reliability API is specifying when you've finished processing an 
individual tuple in the tuple tree. This is done by using the `ack` and `fail` 
methods on the `OutputCollector`. If you look back at the `SplitSentence` 
example, you can see that the input tuple is acked after all the word tuples 
are emitted.
+
+You can use the `fail` method on the `OutputCollector` to immediately fail the 
spout tuple at the root of the tuple tree. For example, your application may 
choose to catch an exception from a database client and explicitly fail the 
input tuple. By failing the tuple explicitly, the spout tuple can be replayed 
faster than if you waited for the tuple to time-out.
+
+Every tuple you process must be acked or failed. Storm uses memory to track 
each tuple, so if you don't ack/fail every tuple, the task will eventually run 
out of memory. 
+
+A lot of bolts follow a common pattern of reading an input tuple, emitting 
tuples based on it, and then acking the tuple at the end of the `execute` 
method. These bolts fall into the categories of filters and simple functions. 
Storm has an interface called `BasicBolt` that encapsulates this pattern for 
you. The `SplitSentence` example can be written as a `BasicBolt` like follows:
+
+```java
+public class SplitSentence extends BaseBasicBolt {
+        public void execute(Tuple tuple, BasicOutputCollector collector) {
+            String sentence = tuple.getString(0);
+            for(String word: sentence.split(" ")) {
+                collector.emit(new Values(word));
+            }
+        }
+
+        public void declareOutputFields(OutputFieldsDeclarer declarer) {
+            declarer.declare(new Fields("word"));
+        }        
+    }
+```
+
+This implementation is simpler than the implementation from before and is 
semantically identical. Tuples emitted to `BasicOutputCollector` are 
automatically anchored to the input tuple, and the input tuple is acked for you 
automatically when the execute method completes.
+
+In contrast, bolts that do aggregations or joins may delay acking a tuple 
until after it has computed a result based on a bunch of tuples. Aggregations 
and joins will commonly multi-anchor their output tuples as well. These things 
fall outside the simpler pattern of `IBasicBolt`.
+
+### How do I make my applications work correctly given that tuples can be 
replayed?
+
+As always in software design, the answer is "it depends." Storm 0.7.0 
introduced the "transactional topologies" feature, which enables you to get 
fully fault-tolerant exactly-once messaging semantics for most computations. 
Read more about transactional topologies [here](Transactional-topologies.html). 
+
+
+### How does Storm implement reliability in an efficient way?
+
+A Storm topology has a set of special "acker" tasks that track the DAG of 
tuples for every spout tuple. When an acker sees that a DAG is complete, it 
sends a message to the spout task that created the spout tuple to ack the 
message. You can set the number of acker executors for a topology in the 
topology configuration using 
[Config.TOPOLOGY_ACKER_EXECUTORS](/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_ACKER_EXECUTORS).
 Storm defaults TOPOLOGY_ACKER_EXECUTORS to be equal to the number of workers 
configured in the topology -- you will need to increase this number for 
topologies processing large amounts of messages.
+
+The best way to understand Storm's reliability implementation is to look at 
the lifecycle of tuples and tuple DAGs. When a tuple is created in a topology, 
whether in a spout or a bolt, it is given a random 64 bit id. These ids are 
used by ackers to track the tuple DAG for every spout tuple.
+
+Every tuple knows the ids of all the spout tuples for which it exists in their 
tuple trees. When you emit a new tuple in a bolt, the spout tuple ids from the 
tuple's anchors are copied into the new tuple. When a tuple is acked, it sends 
a message to the appropriate acker tasks with information about how the tuple 
tree changed. In particular it tells the acker "I am now completed within the 
tree for this spout tuple, and here are the new tuples in the tree that were 
anchored to me". 
+
+For example, if tuples "D" and "E" were created based on tuple "C", here's how 
the tuple tree changes when "C" is acked: 
+
+![What happens on an ack](images/ack_tree.png)
+
+Since "C" is removed from the tree at the same time that "D" and "E" are added 
to it, the tree can never be prematurely completed.
+
+There are a few more details to how Storm tracks tuple trees. As mentioned 
already, you can have an arbitrary number of acker tasks in a topology. This 
leads to the following question: when a tuple is acked in the topology, how 
does it know to which acker task to send that information? 
+
+Storm uses mod hashing to map a spout tuple id to an acker task. Since every 
tuple carries with it the spout tuple ids of all the trees they exist within, 
they know which acker tasks to communicate with. 
+
+Another detail of Storm is how the acker tasks track which spout tasks are 
responsible for each spout tuple they're tracking. When a spout task emits a 
new tuple, it simply sends a message to the appropriate acker telling it that 
its task id is responsible for that spout tuple. Then when an acker sees a tree 
has been completed, it knows to which task id to send the completion message.
+
+Acker tasks do not track the tree of tuples explicitly. For large tuple trees 
with tens of thousands of nodes (or more), tracking all the tuple trees could 
overwhelm the memory used by the ackers. Instead, the ackers take a different 
strategy that only requires a fixed amount of space per spout tuple (about 20 
bytes). This tracking algorithm is the key to how Storm works and is one of its 
major breakthroughs.
+
+An acker task stores a map from a spout tuple id to a pair of values. The 
first value is the task id that created the spout tuple which is used later on 
to send completion messages. The second value is a 64 bit number called the 
"ack val". The ack val is a representation of the state of the entire tuple 
tree, no matter how big or how small.  It is simply the xor of all tuple ids 
that have been created and/or acked in the tree.
+
+When an acker task sees that an "ack val" has become 0, then it knows that the 
tuple tree is completed. Since tuple ids are random 64 bit numbers, the chances 
of an "ack val" accidentally becoming 0 is extremely small. If you work the 
math, at 10K acks per second, it will take 50,000,000 years until a mistake is 
made. And even then, it will only cause data loss if that tuple happens to fail 
in the topology.
+
+Now that you understand the reliability algorithm, let's go over all the 
failure cases and see how in each case Storm avoids data loss:
+
+- **A tuple isn't acked because the task died**: In this case the spout tuple 
ids at the root of the trees for the failed tuple will time out and be replayed.
+- **Acker task dies**: In this case all the spout tuples the acker was 
tracking will time out and be replayed.
+- **Spout task dies**: In this case the source that the spout talks to is 
responsible for replaying the messages. For example, queues like Kestrel and 
RabbitMQ will place all pending messages back on the queue when a client 
disconnects.
+
+As you have seen, Storm's reliability mechanisms are completely distributed, 
scalable, and fault-tolerant. 
+
+### Tuning reliability
+
+Acker tasks are lightweight, so you don't need very many of them in a 
topology. You can track their performance through the Storm UI (component id 
"__acker"). If the throughput doesn't look right, you'll need to add more acker 
tasks. 
+
+If reliability isn't important to you -- that is, you don't care about losing 
tuples in failure situations -- then you can improve performance by not 
tracking the tuple tree for spout tuples. Not tracking a tuple tree halves the 
number of messages transferred since normally there's an ack message for every 
tuple in the tuple tree. Additionally, it requires fewer ids to be kept in each 
downstream tuple, reducing bandwidth usage.
+
+There are three ways to remove reliability. The first is to set 
Config.TOPOLOGY_ACKERS to 0. In this case, Storm will call the `ack` method on 
the spout immediately after the spout emits a tuple. The tuple tree won't be 
tracked.
+
+The second way is to remove reliability on a message by message basis. You can 
turn off tracking for an individual spout tuple by omitting a message id in the 
`SpoutOutputCollector.emit` method.
+
+Finally, if you don't care if a particular subset of the tuples downstream in 
the topology fail to be processed, you can emit them as unanchored tuples. 
Since they're not anchored to any spout tuples, they won't cause any spout 
tuples to fail if they aren't acked.

http://git-wip-us.apache.org/repos/asf/storm/blob/1d09012e/documentation/Home.md
----------------------------------------------------------------------
diff --git a/documentation/Home.md b/documentation/Home.md
new file mode 100644
index 0000000..452dfba
--- /dev/null
+++ b/documentation/Home.md
@@ -0,0 +1,69 @@
+---
+title: Storm Documentation
+layout: documentation
+---
+Storm is a distributed realtime computation system. Similar to how Hadoop 
provides a set of general primitives for doing batch processing, Storm provides 
a set of general primitives for doing realtime computation. Storm is simple, 
can be used with any programming language, [is used by many 
companies](/documentation/Powered-By.html), and is a lot of fun to use!
+
+### Read these first
+
+* [Rationale](Rationale.html)
+* [Tutorial](Tutorial.html)
+* [Setting up development environment](Setting-up-development-environment.html)
+* [Creating a new Storm project](Creating-a-new-Storm-project.html)
+
+### Documentation
+
+* [Documentation Index](/doc-index.html)
+* [Manual](Documentation.html)
+* [Javadoc](/javadoc/apidocs/index.html)
+* [FAQ](FAQ.html)
+
+### Getting help
+
+__NOTE:__ The google groups account [email protected] is now 
officially deprecated in favor of the Apache-hosted user/dev mailing lists.
+
+#### Storm Users
+Storm users should send messages and subscribe to 
[[email protected]](mailto:[email protected]).
+
+You can subscribe to this list by sending an email to 
[[email protected]](mailto:[email protected]). 
Likewise, you can cancel a subscription by sending an email to 
[[email protected]](mailto:[email protected]).
+
+You can view the archives of the mailing list 
[here](http://mail-archives.apache.org/mod_mbox/storm-user/).
+
+#### Storm Developers
+Storm developers should send messages and subscribe to 
[[email protected]](mailto:[email protected]).
+
+You can subscribe to this list by sending an email to 
[[email protected]](mailto:[email protected]). 
Likewise, you can cancel a subscription by sending an email to 
[[email protected]](mailto:[email protected]).
+
+You can view the archives of the mailing list 
[here](http://mail-archives.apache.org/mod_mbox/storm-dev/).
+
+#### Which list should I send/subscribe to?
+If you are using a pre-built binary distribution of Storm, then chances are 
you should send questions, comments, storm-related announcements, etc. to 
[[email protected]]([email protected]). 
+
+If you are building storm from source, developing new features, or otherwise 
hacking storm source code, then [[email protected]]([email protected]) 
is more appropriate. 
+
+#### What will happen with [email protected]?
+All existing messages will remain archived there, and can be accessed/searched 
[here](https://groups.google.com/forum/#!forum/storm-user).
+
+New messages sent to [email protected] will either be 
rejected/bounced or replied to with a message to direct the email to the 
appropriate Apache-hosted group.
+
+#### IRC
+You can also come to the #storm-user room on [freenode](http://freenode.net/). 
You can usually find a Storm developer there to help you out.
+
+
+
+### Related projects
+
+* [storm-contrib](https://github.com/nathanmarz/storm-contrib)
+* [storm-deploy](http://github.com/nathanmarz/storm-deploy): One click deploys 
for Storm clusters on AWS
+* [Spout implementations](Spout-implementations.html)
+* [DSLs and multilang adapters](DSLs-and-multilang-adapters.html)
+* [Serializers](Serializers.html)
+
+### Contributing to Storm
+
+* [Contributing to Storm](Contributing-to-Storm.html)
+* [Project ideas](Project-ideas.html)
+
+### Powered by Storm
+
+[Companies and projects powered by Storm](Powered-By.html)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/storm/blob/1d09012e/documentation/Hooks.md
----------------------------------------------------------------------
diff --git a/documentation/Hooks.md b/documentation/Hooks.md
new file mode 100644
index 0000000..6a62422
--- /dev/null
+++ b/documentation/Hooks.md
@@ -0,0 +1,9 @@
+---
+title: Hooks
+layout: documentation
+documentation: true
+---
+Storm provides hooks with which you can insert custom code to run on any 
number of events within Storm. You create a hook by extending the 
[BaseTaskHook](/javadoc/apidocs/backtype/storm/hooks/BaseTaskHook.html) class 
and overriding the appropriate method for the event you want to catch. There 
are two ways to register your hook:
+
+1. In the open method of your spout or prepare method of your bolt using the 
[TopologyContext](/javadoc/apidocs/backtype/storm/task/TopologyContext.html#addTaskHook)
 method.
+2. Through the Storm configuration using the 
["topology.auto.task.hooks"](/javadoc/apidocs/backtype/storm/Config.html#TOPOLOGY_AUTO_TASK_HOOKS)
 config. These hooks are automatically registered in every spout or bolt, and 
are useful for doing things like integrating with a custom monitoring system.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/storm/blob/1d09012e/documentation/Implementation-docs.md
----------------------------------------------------------------------
diff --git a/documentation/Implementation-docs.md 
b/documentation/Implementation-docs.md
new file mode 100644
index 0000000..9c3a427
--- /dev/null
+++ b/documentation/Implementation-docs.md
@@ -0,0 +1,20 @@
+---
+title: Storm Internal Implementation
+layout: documentation
+documentation: true
+---
+This section of the wiki is dedicated to explaining how Storm is implemented. 
You should have a good grasp of how to use Storm before reading these sections. 
+
+- [Structure of the codebase](Structure-of-the-codebase.html)
+- [Lifecycle of a topology](Lifecycle-of-a-topology.html)
+- [Message passing implementation](Message-passing-implementation.html)
+- [Acking framework implementation](Acking-framework-implementation.html)
+- [Metrics](Metrics.html)
+- How transactional topologies work
+   - subtopology for TransactionalSpout
+   - how state is stored in ZK
+   - subtleties around what to do when emitting batches out of order
+- Unit testing
+  - time simulation
+  - complete-topology
+  - tracker clusters

http://git-wip-us.apache.org/repos/asf/storm/blob/1d09012e/documentation/Installing-native-dependencies.md
----------------------------------------------------------------------
diff --git a/documentation/Installing-native-dependencies.md 
b/documentation/Installing-native-dependencies.md
new file mode 100644
index 0000000..3207b8e
--- /dev/null
+++ b/documentation/Installing-native-dependencies.md
@@ -0,0 +1,38 @@
+---
+layout: documentation
+---
+The native dependencies are only needed on actual Storm clusters. When running 
Storm in local mode, Storm uses a pure Java messaging system so that you don't 
need to install native dependencies on your development machine.
+
+Installing ZeroMQ and JZMQ is usually straightforward. Sometimes, however, 
people run into issues with autoconf and get strange errors. If you run into 
any issues, please email the [Storm mailing 
list](http://groups.google.com/group/storm-user) or come get help in the 
#storm-user room on freenode. 
+
+Storm has been tested with ZeroMQ 2.1.7, and this is the recommended ZeroMQ 
release that you install. You can download a ZeroMQ release 
[here](http://download.zeromq.org/). Installing ZeroMQ should look something 
like this:
+
+```
+wget http://download.zeromq.org/zeromq-2.1.7.tar.gz
+tar -xzf zeromq-2.1.7.tar.gz
+cd zeromq-2.1.7
+./configure
+make
+sudo make install
+```
+
+JZMQ is the Java bindings for ZeroMQ. JZMQ doesn't have any releases (we're 
working with them on that), so there is risk of a regression if you always 
install from the master branch. To prevent a regression from happening, you 
should instead install from [this fork](http://github.com/nathanmarz/jzmq) 
which is tested to work with Storm. Installing JZMQ should look something like 
this:
+
+```
+#install jzmq
+git clone https://github.com/nathanmarz/jzmq.git
+cd jzmq
+./autogen.sh
+./configure
+make
+sudo make install
+```
+
+To get the JZMQ build to work, you may need to do one or all of the following:
+
+1. Set JAVA_HOME environment variable appropriately
+2. Install Java dev package (more info 
[here](http://codeslinger.posterous.com/getting-zeromq-and-jzmq-running-on-mac-os-x)
 for Mac OSX users)
+3. Upgrade autoconf on your machine
+4. Follow the instructions in [this blog 
post](http://blog.pmorelli.com/getting-zeromq-and-jzmq-running-on-mac-os-x)
+
+If you run into any errors when running `./configure`, [this 
thread](http://stackoverflow.com/questions/3522248/how-do-i-compile-jzmq-for-zeromq-on-osx)
 may provide a solution.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/storm/blob/1d09012e/documentation/Kestrel-and-Storm.md
----------------------------------------------------------------------
diff --git a/documentation/Kestrel-and-Storm.md 
b/documentation/Kestrel-and-Storm.md
new file mode 100644
index 0000000..d079b81
--- /dev/null
+++ b/documentation/Kestrel-and-Storm.md
@@ -0,0 +1,200 @@
+---
+title: Storm and Kestrel
+layout: documentation
+documentation: true
+---
+This page explains how to use to Storm to consume items from a Kestrel cluster.
+
+## Preliminaries
+### Storm
+This tutorial uses examples from the 
[storm-kestrel](https://github.com/nathanmarz/storm-kestrel) project and the 
[storm-starter](http://github.com/apache/storm/blob/master/examples/storm-starter)
 project. It's recommended that you clone those projects and follow along with 
the examples. Read [Setting up development 
environment](https://github.com/apache/storm/wiki/Setting-up-development-environment)
 and [Creating a new Storm 
project](https://github.com/apache/storm/wiki/Creating-a-new-Storm-project) to 
get your machine set up.
+### Kestrel
+It assumes you are able to run locally a Kestrel server as described 
[here](https://github.com/nathanmarz/storm-kestrel).
+
+## Kestrel Server and Queue
+A single kestrel server has a set of queues. A Kestrel queue is a very simple 
message queue that runs on the JVM and uses the memcache protocol (with some 
extensions) to talk to clients. For details, look at the implementation of the 
[KestrelThriftClient](https://github.com/nathanmarz/storm-kestrel/blob/master/src/jvm/backtype/storm/spout/KestrelThriftClient.java)
 class provided in [storm-kestrel](https://github.com/nathanmarz/storm-kestrel) 
project.
+
+Each queue is strictly ordered following the FIFO (first in, first out) 
principle. To keep up with performance items are cached in system memory; 
though, only the first 128MB is kept in memory. When stopping the server, the 
queue state is stored in a journal file.
+
+Further, details can be found 
[here](https://github.com/nathanmarz/kestrel/blob/master/docs/guide.md).
+
+Kestrel is:
+* fast
+* small
+* durable
+* reliable
+
+For instance, Twitter uses Kestrel as the backbone of its messaging 
infrastructure as described [here] 
(http://bhavin.directi.com/notes-on-kestrel-the-open-source-twitter-queue/).
+
+## Add items to Kestrel
+At first, we need to have a program that can add items to a Kestrel queue. The 
following method takes benefit of the KestrelClient implementation in 
[storm-kestrel](https://github.com/nathanmarz/storm-kestrel). It adds sentences 
into a Kestrel queue randomly chosen out of an array that holds five possible 
sentences.
+
+```
+    private static void queueSentenceItems(KestrelClient kestrelClient, String 
queueName)
+                       throws ParseError, IOException {
+
+               String[] sentences = new String[] {
+                   "the cow jumped over the moon",
+                   "an apple a day keeps the doctor away",
+                   "four score and seven years ago",
+                   "snow white and the seven dwarfs",
+                   "i am at two with nature"};
+
+               Random _rand = new Random();
+
+               for(int i=1; i<=10; i++){
+
+                       String sentence = 
sentences[_rand.nextInt(sentences.length)];
+
+                       String val = "ID " + i + " " + sentence;
+
+                       boolean queueSucess = kestrelClient.queue(queueName, 
val);
+
+                       System.out.println("queueSucess=" +queueSucess+ " [" + 
val +"]");
+               }
+       }
+```
+
+## Remove items from Kestrel
+
+This method dequeues items from a queue without removing them.
+```
+    private static void dequeueItems(KestrelClient kestrelClient, String 
queueName) throws IOException, ParseError
+                        {
+               for(int i=1; i<=12; i++){
+
+                       Item item = kestrelClient.dequeue(queueName);
+
+                       if(item==null){
+                               System.out.println("The queue (" + queueName + 
") contains no items.");
+                       }
+                       else
+                       {
+                               byte[] data = item._data;
+
+                               String receivedVal = new String(data);
+
+                               System.out.println("receivedItem=" + 
receivedVal);
+                       }
+               }
+```
+
+This method dequeues items from a queue and then removes them.
+```
+    private static void dequeueAndRemoveItems(KestrelClient kestrelClient, 
String queueName)
+    throws IOException, ParseError
+                {
+                       for(int i=1; i<=12; i++){
+
+                               Item item = kestrelClient.dequeue(queueName);
+
+
+                               if(item==null){
+                                       System.out.println("The queue (" + 
queueName + ") contains no items.");
+                               }
+                               else
+                               {
+                                       int itemID = item._id;
+
+
+                                       byte[] data = item._data;
+
+                                       String receivedVal = new String(data);
+
+                                       kestrelClient.ack(queueName, itemID);
+
+                                       System.out.println("receivedItem=" + 
receivedVal);
+                               }
+                       }
+       }
+```
+
+## Add Items continuously to Kestrel
+
+This is our final program to run in order to add continuously sentence items 
to a queue called **sentence_queue** of a locally running Kestrel server.
+
+In order to stop it type a closing bracket char ']' in console and hit 'Enter'.
+
+```
+    import java.io.IOException;
+    import java.io.InputStream;
+    import java.util.Random;
+
+    import backtype.storm.spout.KestrelClient;
+    import backtype.storm.spout.KestrelClient.Item;
+    import backtype.storm.spout.KestrelClient.ParseError;
+
+    public class AddSentenceItemsToKestrel {
+
+       /**
+        * @param args
+        */
+       public static void main(String[] args) {
+
+               InputStream is = System.in;
+
+                       char closing_bracket = ']';
+
+                       int val = closing_bracket;
+
+                       boolean aux = true;
+
+                       try {
+
+                               KestrelClient kestrelClient = null;
+                               String queueName = "sentence_queue";
+
+                               while(aux){
+
+                                       kestrelClient = new 
KestrelClient("localhost",22133);
+
+                                       queueSentenceItems(kestrelClient, 
queueName);
+
+                                       kestrelClient.close();
+
+                                       Thread.sleep(1000);
+
+                                       if(is.available()>0){
+                                        if(val==is.read())
+                                                aux=false;
+                                       }
+                               }
+                       } catch (IOException e) {
+                               // TODO Auto-generated catch block
+                               e.printStackTrace();
+                       }
+                       catch (ParseError e) {
+                               // TODO Auto-generated catch block
+                               e.printStackTrace();
+                       } catch (InterruptedException e) {
+                               // TODO Auto-generated catch block
+                               e.printStackTrace();
+                       }
+
+                       System.out.println("end");
+
+           }
+       }
+```
+## Using KestrelSpout
+
+This topology reads sentences off of a Kestrel queue using KestrelSpout, 
splits the sentences into its constituent words (Bolt: SplitSentence), and then 
emits for each word the number of times it has seen that word before (Bolt: 
WordCount). How data is processed is described in detail in [Guaranteeing 
message processing](Guaranteeing-message-processing.html).
+
+```
+    TopologyBuilder builder = new TopologyBuilder();
+    builder.setSpout("sentences", new 
KestrelSpout("localhost",22133,"sentence_queue",new StringScheme()));
+    builder.setBolt("split", new SplitSentence(), 10)
+               .shuffleGrouping("sentences");
+    builder.setBolt("count", new WordCount(), 20)
+               .fieldsGrouping("split", new Fields("word"));
+```
+
+## Execution
+
+At first, start your local kestrel server in production or development mode.
+
+Than, wait about 5 seconds in order to avoid a ConnectException.
+
+Now execute the program to add items to the queue and launch the Storm 
topology. The order in which you launch the programs is of no importance.
+
+If you run the topology with TOPOLOGY_DEBUG you should see tuples being 
emitted in the topology.
\ No newline at end of file

[02/51] [partial] storm git commit: merge STORM-950 into asf-site. This closes #670

Reply via email to