Modified: incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-Avro-Files.html URL: http://svn.apache.org/viewvc/incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-Avro-Files.html?rev=1737551&r1=1737550&r2=1737551&view=diff ============================================================================== --- incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-Avro-Files.html (original) +++ incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-Avro-Files.html Sun Apr 3 08:17:59 2016 @@ -76,15 +76,15 @@ <p>In this tutorial page we describe how to execute SAMOA with data files in Apache Avro file format. Here is an outline of this tutorial</p> <ol> -<li>Overview of Apache Avro</li> -<li>Avro Input Format for SAMOA</li> -<li>SAMOA task execution with Avro</li> -<li>Sample Avro Data for SAMOA</li> + <li>Overview of Apache Avro</li> + <li>Avro Input Format for SAMOA</li> + <li>SAMOA task execution with Avro</li> + <li>Sample Avro Data for SAMOA</li> </ol> <h3 id="overview-of-apache-avro">Overview of Apache Avro</h3> -<p>Users of Apache SAMOA can now use Binary/JSON encoded Avro data as an alternate to the default ARFF file format as the data source. Avro is a remote procedure call and data serialization framework developed within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Avro specifies two serialization encodings for the data: Binary and JSON, default being Binary. However the meta-data is always in JSON. Avro data is always serialized with its schema. Files that store Avro data should also include the schema for that data in the same file. </p> +<p>Users of Apache SAMOA can now use Binary/JSON encoded Avro data as an alternate to the default ARFF file format as the data source. Avro is a remote procedure call and data serialization framework developed within Apacheâs Hadoop project. It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Avro specifies two serialization encodings for the data: Binary and JSON, default being Binary. However the meta-data is always in JSON. Avro data is always serialized with its schema. Files that store Avro data should also include the schema for that data in the same file.</p> <p>You can find the latest Apache Avro documentation <a href="https://avro.apache.org/docs/current/">here</a> for more details.</p> @@ -93,42 +93,53 @@ <p>It is required that the input Avro files to the SAMOA framework follow certain Input Format Rules to seamlessly work with the SAMOA Instances. The first line of Avro Source file for SAMOA (irrespective of whether data is encoded in binary or JSON) will be the metadata (schema). The data would be by default one record per line following the schema and will be mapped into 1 SAMOA instance per record.</p> <ol> -<li>Avro Primitive Types & Enums are allowed for the data as is. </li> -<li>Avro Complex-types (e.g maps/arrays) may not be used with the exception of enum & union. I.e. no sub-structure will be allowed.</li> -<li>Label (if any) would be the last attribute.</li> -<li>Timestamps are not supported as of now within SAMOA.</li> -<li>Avro Enums may be used to represent nominal attributes.</li> -<li>Avro unions may be used to represent nullability of value. However unions may not be used for different data types.<br></li> + <li>Avro Primitive Types & Enums are allowed for the data as is.</li> + <li>Avro Complex-types (e.g maps/arrays) may not be used with the exception of enum & union. I.e. no sub-structure will be allowed.</li> + <li>Label (if any) would be the last attribute.</li> + <li>Timestamps are not supported as of now within SAMOA.</li> + <li>Avro Enums may be used to represent nominal attributes.</li> + <li>Avro unions may be used to represent nullability of value. However unions may not be used for different data types.</li> </ol> -<div class="highlight"><pre><code class="language-" data-lang="">E.g Enums + +<p><code class="highlighter-rouge"> +E.g Enums {"name":"species","type":{"type":"enum","name":"Labels","symbols":["setosa","versicolor","virginica"]}} E.g Unions {"name":"attribute1","type":["null","int"]} -Allowed to denote that value for attribute1 is optional {"name":" attribute2","type":["string","int"]} -Not allowed -</code></pre></div> +</code></p> + <h3 id="samoa-task-execution-with-avro">SAMOA task execution with Avro</h3> -<p>You may execute a SAMOA task using the aforementioned <code>bin/samoa</code> script with the following format: <code>bin/samoa <platform> <jar> "<task>"</code>. -Follow this <a href="Executing-SAMOA-with-Apache-S4">link</a> and this <a href="Executing-SAMOA-with-Apache-Storm">link</a> to learn more about deploying SAMOA on Apache S4 and Apache Storm respectively. The Avro files can be used as data sources for any of the aforementioned platforms. The only addition that needs to be made in the commands is as follows: <code>AvroFileStream <file_name> -e <file_format></code> . Examples are given below for different modes. Though the examples below use <a href="Prequential-Evaluation-Task">Prequential Evaluation task</a> the commands are applicable to all other tasks as well.</p> +<p>You may execute a SAMOA task using the aforementioned <code class="highlighter-rouge">bin/samoa</code> script with the following format: <code class="highlighter-rouge">bin/samoa <platform> <jar> "<task>"</code>. +Follow this <a href="Executing-SAMOA-with-Apache-S4">link</a> and this <a href="Executing-SAMOA-with-Apache-Storm">link</a> to learn more about deploying SAMOA on Apache S4 and Apache Storm respectively. The Avro files can be used as data sources for any of the aforementioned platforms. The only addition that needs to be made in the commands is as follows: <code class="highlighter-rouge">AvroFileStream <file_name> -e <file_format></code> . Examples are given below for different modes. Though the examples below use <a href="Prequential-Evaluation-Task">Prequential Evaluation task</a> the commands are applicable to all other tasks as well.</p> + +<h4 id="local---avro-json">Local - Avro JSON</h4> +<p><code class="highlighter-rouge"> +bin/samoa local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f covtypeNorm_json.avro -e json) -f 100000" +</code></p> + +<h4 id="local---avro-binary">Local - Avro Binary</h4> +<p><code class="highlighter-rouge"> +bin/samoa local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f covtypeNorm_binary.avro -e binary) -f 100000" +</code> +#### Storm - Avro JSON +<code class="highlighter-rouge"> +bin/samoa storm target/SAMOA-Storm-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f covtypeNorm_json.avro -e json) -f 100000" +</code> +#### Storm - Avro Binary +<code class="highlighter-rouge"> +bin/samoa storm target/SAMOA-Storm-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f covtypeNorm_binary.avro -e binary) -f 100000" +</code></p> -<h4 id="local-avro-json">Local - Avro JSON</h4> -<div class="highlight"><pre><code class="language-" data-lang="">bin/samoa local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f covtypeNorm_json.avro -e json) -f 100000" -</code></pre></div> -<h4 id="local-avro-binary">Local - Avro Binary</h4> -<div class="highlight"><pre><code class="language-" data-lang="">bin/samoa local target/SAMOA-Local-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f covtypeNorm_binary.avro -e binary) -f 100000" -</code></pre></div> -<h4 id="storm-avro-json">Storm - Avro JSON</h4> -<div class="highlight"><pre><code class="language-" data-lang="">bin/samoa storm target/SAMOA-Storm-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f covtypeNorm_json.avro -e json) -f 100000" -</code></pre></div> -<h4 id="storm-avro-binary">Storm - Avro Binary</h4> -<div class="highlight"><pre><code class="language-" data-lang="">bin/samoa storm target/SAMOA-Storm-0.4.0-incubating-SNAPSHOT.jar "PrequentialEvaluation -l classifiers.ensemble.Bagging -s (AvroFileStream -f covtypeNorm_binary.avro -e binary) -f 100000" -</code></pre></div> <h3 id="sample-avro-data-for-samoa">Sample Avro Data for SAMOA</h3> <p>The samples below describe how the default ARFF file formats may be converted to JSON/Binary encoded Avro formats.</p> -<h4 id="iris-dataset-default-arff-format">Iris Dataset - Default ARFF Format</h4> -<div class="highlight"><pre><code class="language-" data-lang="">@RELATION iris +<h4 id="iris-dataset---default-arff-format">Iris Dataset - Default ARFF Format</h4> + +<p><code class="highlighter-rouge"> +@RELATION iris @ATTRIBUTE sepallength NUMERIC @ATTRIBUTE sepalwidth NUMERIC @ATTRIBUTE petallength NUMERIC @@ -139,20 +150,27 @@ Follow this <a href="Executing-SAMOA-wit 4.9,3.0,1.4,0.2,virginica 4.7,3.2,1.3,0.2,virginica 4.6,3.1,1.5,0.2,setosa -</code></pre></div> -<h4 id="iris-dataset-json-encoded-avro-format">Iris Dataset - JSON Encoded AVRO Format</h4> -<div class="highlight"><pre><code class="language-" data-lang=""><span class="p">{</span><span class="nt">"type"</span><span class="p">:</span><span class="s2">"record"</span><span class="p">,</span><span class="nt">"name"</span><span class="p">:</span><span class="s2">"Iris"</span><span class="p">,</span><span class="nt">"namespace"</span><span class="p">:</span><span class="s2">"org.apache.samoa.avro.iris"</span><span class="p">,</span><span class="nt">"fields"</span><span class="p">:[{</span><span class="nt">"name"</span><span class="p">:</span><span class="s2">"sepallength"</span><span class="p">,</span><span class="nt">"type"</span><span class="p">:</span><span class="s2">"double"</span><span class="p">},{</span><span class="nt">"name"</span><span class="p">:</span><span class="s2">"sepalwidth"</span><span class="p">,</span><span class="nt">"type"</span><span class="p">:</span><span class="s2">"double"</span><span class="p">},{</span><span class="nt">"name"</span><span class="p ">:</span><span class="s2">"petallength"</span><span class="p">,</span><span class="nt">"type"</span><span class="p">:</span><span class="s2">"double"</span><span class="p">},{</span><span class="nt">"name"</span><span class="p">:</span><span class="s2">"petalwidth"</span><span class="p">,</span><span class="nt">"type"</span><span class="p">:</span><span class="s2">"double"</span><span class="p">},{</span><span class="nt">"name"</span><span class="p">:</span><span class="s2">"class"</span><span class="p">,</span><span class="nt">"type"</span><span class="p">:{</span><span class="nt">"type"</span><span class="p">:</span><span class="s2">"enum"</span><span class="p">,</span><span class="nt">"name"</span><span class="p">:</span><span class="s2">"Labels"</span><span class="p">,</span><span class="nt">"symbols"</span><span class="p">:[</span><span class="s2">"setosa"</span><span class="p">,</span><span class="s2">"versicolor"</span><span class="p">,</span><span class="s2">"virginica"</sp an><span class="p">]}}]}</span><span class="w"> +</code></p> + +<h4 id="iris-dataset---json-encoded-avro-format">Iris Dataset - JSON Encoded AVRO Format</h4> + +<p><code class="highlighter-rouge"><span class="w"> +</span><span class="p">{</span><span class="nt">"type"</span><span class="p">:</span><span class="s2">"record"</span><span class="p">,</span><span class="nt">"name"</span><span class="p">:</span><span class="s2">"Iris"</span><span class="p">,</span><span class="nt">"namespace"</span><span class="p">:</span><span class="s2">"org.apache.samoa.avro.iris"</span><span class="p">,</span><span class="nt">"fields"</span><span class="p">:[{</span><span class="nt">"name"</span><span class="p">:</span><span class="s2">"sepallength"</span><span class="p">,</span><span class="nt">"type"</span><span class="p">:</span><span class="s2">"double"</span><span class="p">},{</span><span class="nt">"name"</span><span class="p">:</span><span class="s2">"sepalwidth"</span><span class="p">,</span><span class="nt">"type"</span><span class="p">:</span><span class="s2">"double"</span><span class="p">},{</span><span class="nt">"name"</span><span class="p">:</span><span class="s2">"petallength"</span><span class ="p">,</span><span class="nt">"type"</span><span class="p">:</span><span class="s2">"double"</span><span class="p">},{</span><span class="nt">"name"</span><span class="p">:</span><span class="s2">"petalwidth"</span><span class="p">,</span><span class="nt">"type"</span><span class="p">:</span><span class="s2">"double"</span><span class="p">},{</span><span class="nt">"name"</span><span class="p">:</span><span class="s2">"class"</span><span class="p">,</span><span class="nt">"type"</span><span class="p">:{</span><span class="nt">"type"</span><span class="p">:</span><span class="s2">"enum"</span><span class="p">,</span><span class="nt">"name"</span><span class="p">:</span><span class="s2">"Labels"</span><span class="p">,</span><span class="nt">"symbols"</span><span class="p">:[</span><span class="s2">"setosa"</span><span class="p">,</span><span class="s2">"versicolor"</span><span class="p">,</span><span class="s2">"virginica"</span><span class="p">]}}]}</span><span class="w"> </span><span class="p">{</span><span class="nt">"sepallength"</span><span class="p">:</span><span class="mf">5.1</span><span class="p">,</span><span class="nt">"sepalwidth"</span><span class="p">:</span><span class="mf">3.5</span><span class="p">,</span><span class="nt">"petallength"</span><span class="p">:</span><span class="mf">1.4</span><span class="p">,</span><span class="nt">"petalwidth"</span><span class="p">:</span><span class="mf">0.2</span><span class="p">,</span><span class="nt">"class"</span><span class="p">:</span><span class="s2">"setosa"</span><span class="p">}</span><span class="w"> </span><span class="p">{</span><span class="nt">"sepallength"</span><span class="p">:</span><span class="mf">3.0</span><span class="p">,</span><span class="nt">"sepalwidth"</span><span class="p">:</span><span class="mf">1.4</span><span class="p">,</span><span class="nt">"petallength"</span><span class="p">:</span><span class="mf">4.9</span><span class="p">,</span><span class="nt">"petalwidth"</span><span class="p">:</span><span class="mf">0.2</span><span class="p">,</span><span class="nt">"class"</span><span class="p">:</span><span class="s2">"virginica"</span><span class="p">}</span><span class="w"> </span><span class="p">{</span><span class="nt">"sepallength"</span><span class="p">:</span><span class="mf">4.7</span><span class="p">,</span><span class="nt">"sepalwidth"</span><span class="p">:</span><span class="mf">3.2</span><span class="p">,</span><span class="nt">"petallength"</span><span class="p">:</span><span class="mf">1.3</span><span class="p">,</span><span class="nt">"petalwidth"</span><span class="p">:</span><span class="mf">0.2</span><span class="p">,</span><span class="nt">"class"</span><span class="p">:</span><span class="s2">"virginica"</span><span class="p">}</span><span class="w"> </span><span class="p">{</span><span class="nt">"sepallength"</span><span class="p">:</span><span class="mf">3.1</span><span class="p">,</span><span class="nt">"sepalwidth"</span><span class="p">:</span><span class="mf">1.5</span><span class="p">,</span><span class="nt">"petallength"</span><span class="p">:</span><span class="mf">4.6</span><span class="p">,</span><span class="nt">"petalwidth"</span><span class="p">:</span><span class="mf">0.2</span><span class="p">,</span><span class="nt">"class"</span><span class="p">:</span><span class="s2">"setosa"</span><span class="p">}</span><span class="w"> -</span></code></pre></div> -<h4 id="iris-dataset-binary-encoded-avro-format">Iris Dataset - Binary Encoded AVRO Format</h4> -<div class="highlight"><pre><code class="language-" data-lang="">Objavro.schemaÎ {"type":"record","name":"Iris","namespace":"org.apache.samoa.avro.iris","fields":[{"name":"sepallength","type":"double"},{"name":"sepalwidth","type":"double"},{"name":"petallength","type":"double"},{"name":"petalwidth","type":"double"},{"name":"class","type":{"type":"enum","name":"Labels","symbols":["setosa","versicolor","virginica"]}}]} !<khCrÖ±Së¹§Þ©Èffffff@ @ffffffÙÙÉ¿ @ffffffÙÙ@ÚÙÙÉ¿ÎÍÍ@ÚÙÙ @ÎÍÍÙÙÉ¿ÎÍÍ@ 𿦦ffff@ÚÙÙÉ¿ !<khCrÖ±Së¹§Þ© -</code></pre></div> +</span></code></p> + +<h4 id="iris-dataset---binary-encoded-avro-format">Iris Dataset - Binary Encoded AVRO Format</h4> + +<p><code class="highlighter-rouge"> +Objavro.schemaÎ {"type":"record","name":"Iris","namespace":"org.apache.samoa.avro.iris","fields":[{"name":"sepallength","type":"double"},{"name":"sepalwidth","type":"double"},{"name":"petallength","type":"double"},{"name":"petalwidth","type":"double"},{"name":"class","type":{"type":"enum","name":"Labels","symbols":["setosa","versicolor","virginica"]}}]} !<khCrÖ±Së¹§Þ©Èffffff@ @ffffffÙÙÉ¿ @ffffffÙÙ@ÚÙÙÉ¿ÎÍÍ@ÚÙÙ @ÎÍÍÙÙÉ¿ÎÍÍ@ 𿦦ffff@ÚÙÙÉ¿ !<khCrÖ±Së¹§Þ© +</code></p> + <h4 id="forest-covertype-dataset">Forest CoverType Dataset</h4> +<p>The JSON & Binary encoded AVRO Files covtypeNorm_json.avro & covtypeNorm_binary.avro for the Forest CoverType dataset can be found at <a href="https://cwiki.apache.org/confluence/display/SAMOA/SAMOA+Home">Wiki</a></p> -<p>The JSON & Binary encoded AVRO Files covtypeNorm_json.avro & covtypeNorm_binary.avro for the Forest CoverType dataset can be found at <a href="https://cwiki.apache.org/confluence/display/SAMOA/SAMOA+Home">Wiki</a> </p> </article>
Modified: incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-S4.html URL: http://svn.apache.org/viewvc/incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-S4.html?rev=1737551&r1=1737550&r2=1737551&view=diff ============================================================================== --- incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-S4.html (original) +++ incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-S4.html Sun Apr 3 08:17:59 2016 @@ -76,101 +76,115 @@ <p>In this tutorial page we describe how to execute SAMOA on top of Apache S4.</p> <h2 id="prerequisites">Prerequisites</h2> - <p>The following dependencies are needed to run SAMOA smoothly on Apache S4</p> <ul> -<li><a href="http://www.gradle.org/">Gradle</a></li> -<li><a href="https://incubator.apache.org/s4/">Apache S4</a></li> + <li><a href="http://www.gradle.org/">Gradle</a></li> + <li><a href="https://incubator.apache.org/s4/">Apache S4</a></li> </ul> <h2 id="gradle">Gradle</h2> - <p>Gradle is a build automation tool and is used to build Apache S4. The installation guide can be found <a href="http://www.gradle.org/docs/current/userguide/installation.html">here.</a> The following instructions is a simplified installation guide.</p> <ol> -<li>Download Gradle binaries from <a href="http://services.gradle.org/distributions/gradle-1.6-bin.zip">downloads</a>, or from the console type <code>wget http://services.gradle.org/distributions/gradle-1.6-bin.zip</code></li> -<li>Unzip the file <code>unzip gradle-1.6-bin.zip</code></li> -<li>Set the Gradle environment variable: <code>export GRADLE_HOME=/foo/bar/gradle-1.6</code></li> -<li>Add to the systems path <code>export PATH=$PATH:$GRADLE_HOME/bin</code></li> -<li>Install Gradle by running <code>gradle</code></li> + <li>Download Gradle binaries from <a href="http://services.gradle.org/distributions/gradle-1.6-bin.zip">downloads</a>, or from the console type <code class="highlighter-rouge">wget http://services.gradle.org/distributions/gradle-1.6-bin.zip</code></li> + <li>Unzip the file <code class="highlighter-rouge">unzip gradle-1.6-bin.zip</code></li> + <li>Set the Gradle environment variable: <code class="highlighter-rouge">export GRADLE_HOME=/foo/bar/gradle-1.6</code></li> + <li>Add to the systems path <code class="highlighter-rouge">export PATH=$PATH:$GRADLE_HOME/bin</code></li> + <li>Install Gradle by running <code class="highlighter-rouge">gradle</code></li> </ol> <p>Now you are all set to install Apache S4</p> <h2 id="apache-s4">Apache S4</h2> - <p>S4 is a general-purpose, distributed, scalable, fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. The installation process is as follows:</p> <ol> -<li>Download the latest Apache S4 release from <a href="http://www.apache.org/dist/incubator/s4/s4-0.6.0-incubating/apache-s4-0.6.0-incubating-src.zip">Apache S4 0.6.0</a> or from command line <code>wget http://www.apache.org/dist/incubator/s4/s4-0.6.0-incubating/apache-s4-0.6.0-incubating-src.zip</code> or clone from git. -<code>git clone https://git-wip-us.apache.org/repos/asf/incubator-s4.git</code>.</li> -<li>Unzip the file <code>unzip apache-s4-0.6.0-incubating-src.zip</code> or go in the cloned directory.</li> -<li>Set the Apache S4 environment variable <code>export S4_HOME=/foo/bar/apache-s4-0.6.0-incubating-src</code>.</li> -<li>Add the S4_HOME to the system PATH. <code>export PATH=$PATH:$S4_HOME</code>.</li> -<li>Once the previous steps are done we can proceed to build and install Apache S4.</li> -<li>You can have a look at the available build tasks by typing <code>gradle tasks</code>.</li> -<li>There are some dependencies issues, therefore you should run the wrapper task first by typing <code>gradle wrapper</code>.</li> -<li>Install the artifacts for Apache S4 by running <code>gradle install</code> in the S4_HOME directory.</li> -<li>Install the S4-TOOLS, <code>gradle s4-tools::installApp</code>.</li> + <li>Download the latest Apache S4 release from <a href="http://www.apache.org/dist/incubator/s4/s4-0.6.0-incubating/apache-s4-0.6.0-incubating-src.zip">Apache S4 0.6.0</a> or from command line <code class="highlighter-rouge">wget http://www.apache.org/dist/incubator/s4/s4-0.6.0-incubating/apache-s4-0.6.0-incubating-src.zip</code> or clone from git. +<code class="highlighter-rouge">git clone https://git-wip-us.apache.org/repos/asf/incubator-s4.git</code>.</li> + <li>Unzip the file <code class="highlighter-rouge">unzip apache-s4-0.6.0-incubating-src.zip</code> or go in the cloned directory.</li> + <li>Set the Apache S4 environment variable <code class="highlighter-rouge">export S4_HOME=/foo/bar/apache-s4-0.6.0-incubating-src</code>.</li> + <li>Add the S4_HOME to the system PATH. <code class="highlighter-rouge">export PATH=$PATH:$S4_HOME</code>.</li> + <li>Once the previous steps are done we can proceed to build and install Apache S4.</li> + <li>You can have a look at the available build tasks by typing <code class="highlighter-rouge">gradle tasks</code>.</li> + <li>There are some dependencies issues, therefore you should run the wrapper task first by typing <code class="highlighter-rouge">gradle wrapper</code>.</li> + <li>Install the artifacts for Apache S4 by running <code class="highlighter-rouge">gradle install</code> in the S4_HOME directory.</li> + <li>Install the S4-TOOLS, <code class="highlighter-rouge">gradle s4-tools::installApp</code>.</li> </ol> <p>Done. Now you can configure and run your Apache S4 cluster.</p> -<hr> +<hr /> <h2 id="building-samoa">Building SAMOA</h2> - <p>Once the S4 dependencies are installed, you can simply clone the repository and install SAMOA.</p> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">git clone http://git.apache.org/incubator-samoa.git -<span class="nb">cd </span>incubator-samoa + +<p><code class="highlighter-rouge">bash +git clone http://git.apache.org/incubator-samoa.git +cd incubator-samoa mvn -Ps4 package -</code></pre></div> -<p>The deployable jars for SAMOA will be in <code>target/SAMOA-<variant>-<version>-SNAPSHOT.jar</code>. For example, in our case for S4 <code>target/SAMOA-S4-0.3.0-SNAPSHOT.jar</code>.</p> +</code></p> -<hr> +<p>The deployable jars for SAMOA will be in <code class="highlighter-rouge">target/SAMOA-<variant>-<version>-SNAPSHOT.jar</code>. For example, in our case for S4 <code class="highlighter-rouge">target/SAMOA-S4-0.3.0-SNAPSHOT.jar</code>.</p> -<h2 id="samoa-s4-configuration">SAMOA-S4 Configuration</h2> +<hr /> -<p>This section will go through the <code>bin/samoa-s4.properties</code> file and how to configure it. +<h2 id="samoa-s4-configuration">SAMOA-S4 Configuration</h2> +<p>This section will go through the <code class="highlighter-rouge">bin/samoa-s4.properties</code> file and how to configure it. In order for SAMOA to run correctly in a distributed environment there are some variables that need to be defined. Since Apache S4 uses <a href="https://zookeeper.apache.org/">ZooKeeper</a> for cluster management we need to define where it is running.</p> -<div class="highlight"><pre><code class="language-" data-lang=""># Zookeeper Server + +<div class="highlighter-rouge"><pre class="highlight"><code># Zookeeper Server zookeeper.server=localhost zookeeper.port=2181 -</code></pre></div> +</code></pre> +</div> + <p>Apache S4 also distributes the application via HTTP, therefore the server and port which contains the S4 application must be provided.</p> -<div class="highlight"><pre><code class="language-" data-lang=""># Simple HTTP Server providing the packaged S4 jar + +<div class="highlighter-rouge"><pre class="highlight"><code># Simple HTTP Server providing the packaged S4 jar http.server.ip=localhost http.server.port=8000 -</code></pre></div> +</code></pre> +</div> + <p>Apache S4 uses the concept of logical clusters to define a group of machines, which are identified by an ID and start serving on a specific port.</p> -<div class="highlight"><pre><code class="language-" data-lang=""># Name of the S4 cluster + +<div class="highlighter-rouge"><pre class="highlight"><code># Name of the S4 cluster cluster.name=cluster cluster.port=12000 -</code></pre></div> -<p>SAMOA can be deployed on a single machine using only one resource or in a cluster environments. The following property can be defined to deploy as a <code>local</code> application or on a <code>cluster</code>.</p> -<div class="highlight"><pre><code class="language-" data-lang=""># Deployment strategy +</code></pre> +</div> + +<p>SAMOA can be deployed on a single machine using only one resource or in a cluster environments. The following property can be defined to deploy as a <code class="highlighter-rouge">local</code> application or on a <code class="highlighter-rouge">cluster</code>.</p> + +<div class="highlighter-rouge"><pre class="highlight"><code># Deployment strategy samoa.deploy.mode=local -</code></pre></div> -<hr> +</code></pre> +</div> + +<hr /> <h2 id="samoa-s4-deployment">SAMOA S4 Deployment</h2> -<p>In order to deploy SAMOA in a distributed environment you <strong>MUST</strong> configure the <code>bin/samoa-s4.properties</code> file correctly. If you are running locally it is optional to modify the properties file.</p> +<p>In order to deploy SAMOA in a distributed environment you <strong>MUST</strong> configure the <code class="highlighter-rouge">bin/samoa-s4.properties</code> file correctly. If you are running locally it is optional to modify the properties file.</p> -<p>The deployment is done by running the SAMOA execution script <code>bin/samoa</code> with some additional parameters. +<p>The deployment is done by running the SAMOA execution script <code class="highlighter-rouge">bin/samoa</code> with some additional parameters. The execution syntax is as follows: -<code>bin/samoa <platform> <jar-location> <task & options></code></p> +<code class="highlighter-rouge">bin/samoa <platform> <jar-location> <task & options></code></p> <p>Example:</p> -<div class="highlight"><pre><code class="language-" data-lang="">bin/samoa S4 target/SAMOA-S4-0.0.1-SNAPSHOT.jar "ClusteringEvaluation" -</code></pre></div> + +<div class="highlighter-rouge"><pre class="highlight"><code>bin/samoa S4 target/SAMOA-S4-0.0.1-SNAPSHOT.jar "ClusteringEvaluation" +</code></pre> +</div> + <p>The <platform> can be s4 or storm.</p> <p>The <jar-location> must be the absolute path to the platform specific jar file.</p> <p>The <task & options> should be the name of a known task and the options belonging to that task.</p> + </article> <!-- </div> --> Modified: incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-Samza.html URL: http://svn.apache.org/viewvc/incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-Samza.html?rev=1737551&r1=1737550&r2=1737551&view=diff ============================================================================== --- incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-Samza.html (original) +++ incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-Samza.html Sun Apr 3 08:17:59 2016 @@ -77,221 +77,313 @@ The steps included in this tutorial are:</p> <ol> -<li><p>Setup and configure a cluster with the required dependencies. This applies for single-node (local) execution as well.</p></li> -<li><p>Build SAMOA deployables</p></li> -<li><p>Configure SAMOA-Samza</p></li> -<li><p>Deploy SAMOA-Samza and execute a task</p></li> -<li><p>Observe the execution and the result</p></li> + <li> + <p>Setup and configure a cluster with the required dependencies. This applies for single-node (local) execution as well.</p> + </li> + <li> + <p>Build SAMOA deployables</p> + </li> + <li> + <p>Configure SAMOA-Samza</p> + </li> + <li> + <p>Deploy SAMOA-Samza and execute a task</p> + </li> + <li> + <p>Observe the execution and the result</p> + </li> </ol> <h2 id="setup-cluster">Setup cluster</h2> - <p>The following are needed to to run SAMOA on top of Samza:</p> <ul> -<li><a href="http://zookeeper.apache.org/">Apache Zookeeper</a></li> -<li><a href="http://kafka.apache.org/">Apache Kafka</a></li> -<li><a href="http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YARN.html">Apache Hadoop YARN and HDFS</a></li> + <li><a href="http://zookeeper.apache.org/">Apache Zookeeper</a></li> + <li><a href="http://kafka.apache.org/">Apache Kafka</a></li> + <li><a href="http://hadoop.apache.org/docs/r2.2.0/hadoop-yarn/hadoop-yarn-site/YARN.html">Apache Hadoop YARN and HDFS</a></li> </ul> <h3 id="zookeeper">Zookeeper</h3> - -<p>Zookeeper is used by Kafka to coordinate its brokers. The detail instructions to setup a Zookeeper cluster can be found <a href="http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html">here</a>. </p> +<p>Zookeeper is used by Kafka to coordinate its brokers. The detail instructions to setup a Zookeeper cluster can be found <a href="http://zookeeper.apache.org/doc/trunk/zookeeperStarted.html">here</a>.</p> <p>To quickly setup a single-node Zookeeper cluster:</p> <ol> -<li><p>Download the binary release from the <a href="http://zookeeper.apache.org/releases.html">release page</a>.</p></li> -<li><p>Untar the archive</p></li> + <li> + <p>Download the binary release from the <a href="http://zookeeper.apache.org/releases.html">release page</a>.</p> + </li> + <li> + <p>Untar the archive</p> + </li> </ol> -<div class="highlight"><pre><code class="language-" data-lang="">tar -xf $DOWNLOAD_DIR/zookeeper-3.4.6.tar.gz -C ~/ -</code></pre></div> + +<p><code class="highlighter-rouge"> +tar -xf $DOWNLOAD_DIR/zookeeper-3.4.6.tar.gz -C ~/ +</code></p> + <ol> -<li>Copy the default configuration file</li> + <li>Copy the default configuration file</li> </ol> -<div class="highlight"><pre><code class="language-" data-lang="">cp zookeeper-3.4.6/conf/zoo_sample.cfg zookeeper-3.4.6/conf/zoo.cfg -</code></pre></div> + +<p><code class="highlighter-rouge"> +cp zookeeper-3.4.6/conf/zoo_sample.cfg zookeeper-3.4.6/conf/zoo.cfg +</code></p> + <ol> -<li>Start the single-node cluster</li> + <li>Start the single-node cluster</li> </ol> -<div class="highlight"><pre><code class="language-" data-lang="">~/zookeeper-3.4.6/bin/zkServer.sh start -</code></pre></div> -<h3 id="kafka">Kafka</h3> -<p>Kafka is a distributed, partitioned, replicated commit log service which Samza uses as its default messaging system. </p> +<p><code class="highlighter-rouge"> +~/zookeeper-3.4.6/bin/zkServer.sh start +</code></p> + +<h3 id="kafka">Kafka</h3> +<p>Kafka is a distributed, partitioned, replicated commit log service which Samza uses as its default messaging system.</p> <ol> -<li><p>Download a binary release of Kafka <a href="http://kafka.apache.org/downloads.html">here</a>. As mentioned in the page, the Scala version does not matter. However, 2.10 is recommended as Samza has recently been moved to Scala 2.10.</p></li> -<li><p>Untar the archive </p></li> + <li> + <p>Download a binary release of Kafka <a href="http://kafka.apache.org/downloads.html">here</a>. As mentioned in the page, the Scala version does not matter. However, 2.10 is recommended as Samza has recently been moved to Scala 2.10.</p> + </li> + <li> + <p>Untar the archive</p> + </li> </ol> -<div class="highlight"><pre><code class="language-" data-lang="">tar -xzf $DOWNLOAD_DIR/kafka_2.10-0.8.1.tgz -C ~/ -</code></pre></div> + +<p><code class="highlighter-rouge"> +tar -xzf $DOWNLOAD_DIR/kafka_2.10-0.8.1.tgz -C ~/ +</code></p> + <p>If you are running in local mode or a single-node cluster, you can now start Kafka with the command:</p> -<div class="highlight"><pre><code class="language-" data-lang="">~/kafka_2.10-0.8.1/bin/kafka-server-start.sh kafka_2.10-0.8.1/config/server.properties -</code></pre></div> -<p>In multi-node cluster, it is typical and convenient to have a Kafka broker on each node (although you can totally have a smaller Kafka cluster, or even a single-node Kafka cluster). The number of brokers in Kafka cluster will affect disk bandwidth and space (the more brokers we have, the higher value we will get for the two). In each node, you need to set the following properties in <code>~/kafka_2.10-0.8.1/config/server.properties</code> before starting Kafka service.</p> -<div class="highlight"><pre><code class="language-" data-lang="">broker.id=a-unique-number-for-each-node + +<p><code class="highlighter-rouge"> +~/kafka_2.10-0.8.1/bin/kafka-server-start.sh kafka_2.10-0.8.1/config/server.properties +</code></p> + +<p>In multi-node cluster, it is typical and convenient to have a Kafka broker on each node (although you can totally have a smaller Kafka cluster, or even a single-node Kafka cluster). The number of brokers in Kafka cluster will affect disk bandwidth and space (the more brokers we have, the higher value we will get for the two). In each node, you need to set the following properties in <code class="highlighter-rouge">~/kafka_2.10-0.8.1/config/server.properties</code> before starting Kafka service.</p> + +<p><code class="highlighter-rouge"> +broker.id=a-unique-number-for-each-node zookeeper.connect=zookeeper-host0-url:2181[,zookeeper-host1-url:2181,...] -</code></pre></div> +</code></p> + <p>You might want to change the retention hours or retention bytes of the logs to avoid the logs size from growing too big.</p> -<div class="highlight"><pre><code class="language-" data-lang="">log.retention.hours=number-of-hours-to-keep-the-logs + +<p><code class="highlighter-rouge"> +log.retention.hours=number-of-hours-to-keep-the-logs log.retention.bytes=number-of-bytes-to-keep-in-the-logs -</code></pre></div> -<h3 id="hadoop-yarn-and-hdfs">Hadoop YARN and HDFS</h3> +</code></p> +<h3 id="hadoop-yarn-and-hdfs">Hadoop YARN and HDFS</h3> <blockquote> -<p>Hadoop YARN and HDFS are <strong>not</strong> required to run SAMOA in Samza local mode. </p> + <p>Hadoop YARN and HDFS are <strong>not</strong> required to run SAMOA in Samza local mode.</p> </blockquote> <p>To set up a YARN cluster, first download a binary release of Hadoop <a href="http://www.apache.org/dyn/closer.cgi/hadoop/common/">here</a> on each node in the cluster and untar the archive -<code>tar -xf $DOWNLOAD_DIR/hadoop-2.2.0.tar.gz -C ~/</code>. We have tested SAMOA with Hadoop 2.2.0 but Hadoop 2.3.0 should work too.</p> +<code class="highlighter-rouge">tar -xf $DOWNLOAD_DIR/hadoop-2.2.0.tar.gz -C ~/</code>. We have tested SAMOA with Hadoop 2.2.0 but Hadoop 2.3.0 should work too.</p> <p><strong>HDFS</strong></p> -<p>Set the following properties in <code>~/hadoop-2.2.0/etc/hadoop/hdfs-site.xml</code> in all nodes.</p> -<div class="highlight"><pre><code class="language-" data-lang=""><configuration> - <property> - <name>dfs.datanode.data.dir</name> - <value>file:///home/username/hadoop-2.2.0/hdfs/datanode</value> - <description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description> - </property> - - <property> - <name>dfs.namenode.name.dir</name> - <value>file:///home/username/hadoop-2.2.0/hdfs/namenode</value> - <description>Path on the local filesystem where the NameNode stores the namespace and transaction logs persistently.</description> - </property> -</configuration> -</code></pre></div> -<p>Add this property in <code>~/hadoop-2.2.0/etc/hadoop/core-site.xml</code> in all nodes.</p> -<div class="highlight"><pre><code class="language-" data-lang=""><configuration> - <property> - <name>fs.defaultFS</name> - <value>hdfs://localhost:9000/</value> - <description>NameNode URI</description> - </property> - - <property> - <name>fs.hdfs.impl</name> - <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> - </property> -</configuration> -</code></pre></div> -<p>For a multi-node cluster, change the hostname ("localhost") to the correct host name of your namenode server.</p> +<p>Set the following properties in <code class="highlighter-rouge">~/hadoop-2.2.0/etc/hadoop/hdfs-site.xml</code> in all nodes.</p> + +<p>```</p> +<configuration> + <property> + <name>dfs.datanode.data.dir</name> + <value>file:///home/username/hadoop-2.2.0/hdfs/datanode</value> + <description>Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.</description> + </property> + + <property> + <name>dfs.namenode.name.dir</name> + <value>file:///home/username/hadoop-2.2.0/hdfs/namenode</value> + <description>Path on the local filesystem where the NameNode stores the namespace and transaction logs persistently.</description> + </property> +</configuration> +<p>```</p> + +<p>Add this property in <code class="highlighter-rouge">~/hadoop-2.2.0/etc/hadoop/core-site.xml</code> in all nodes.</p> + +<p>```</p> +<configuration> + <property> + <name>fs.defaultFS</name> + <value>hdfs://localhost:9000/</value> + <description>NameNode URI</description> + </property> + + <property> + <name>fs.hdfs.impl</name> + <value>org.apache.hadoop.hdfs.DistributedFileSystem</value> + </property> +</configuration> +<p>``` +For a multi-node cluster, change the hostname (âlocalhostâ) to the correct host name of your namenode server.</p> <p>Format HDFS directory (only perform this if you are running it for the very first time)</p> -<div class="highlight"><pre><code class="language-" data-lang="">~/hadoop-2.2.0/bin/hdfs namenode -format -</code></pre></div> + +<p><code class="highlighter-rouge"> +~/hadoop-2.2.0/bin/hdfs namenode -format +</code></p> + <p>Start namenode daemon on one of the node</p> -<div class="highlight"><pre><code class="language-" data-lang="">~/hadoop-2.2.0/sbin/hadoop-daemon.sh start namenode -</code></pre></div> + +<p><code class="highlighter-rouge"> +~/hadoop-2.2.0/sbin/hadoop-daemon.sh start namenode +</code></p> + <p>Start datanode daemon on all nodes</p> -<div class="highlight"><pre><code class="language-" data-lang="">~/hadoop-2.2.0/sbin/hadoop-daemon.sh start datanode -</code></pre></div> + +<p><code class="highlighter-rouge"> +~/hadoop-2.2.0/sbin/hadoop-daemon.sh start datanode +</code></p> + <p><strong>YARN</strong></p> -<p>If you are running in multi-node cluster, set the resource manager hostname in <code>~/hadoop-2.2.0/etc/hadoop/yarn-site.xml</code> in all nodes as follow:</p> -<div class="highlight"><pre><code class="language-" data-lang=""><configuration> - <property> - <name>yarn.resourcemanager.hostname</name> - <value>resourcemanager-url</value> - <description>The hostname of the RM.</description> - </property> -</configuration> -</code></pre></div> +<p>If you are running in multi-node cluster, set the resource manager hostname in <code class="highlighter-rouge">~/hadoop-2.2.0/etc/hadoop/yarn-site.xml</code> in all nodes as follow:</p> + +<p>```</p> +<configuration> + <property> + <name>yarn.resourcemanager.hostname</name> + <value>resourcemanager-url</value> + <description>The hostname of the RM.</description> + </property> +</configuration> +<p>```</p> + <p><strong>Other configurations</strong> Now we need to tell Samza where to find the configuration of YARN cluster. To do this, first create a new directory in all nodes:</p> -<div class="highlight"><pre><code class="language-" data-lang="">mkdir ~/.samza + +<p><code class="highlighter-rouge"> +mkdir ~/.samza mkdir ~/.samza/conf -</code></pre></div> -<p>Copy (or soft link) <code>core-site.xml</code>, <code>hdfs-site.xml</code>, <code>yarn-site.xml</code> in <code>~/hadoop-2.2.0/etc/hadoop</code> to the new directory </p> -<div class="highlight"><pre><code class="language-" data-lang="">ln -s ~/.samza/conf/core-site.xml ~/hadoop-2.2.0/etc/hadoop/core-site.xml +</code></p> + +<p>Copy (or soft link) <code class="highlighter-rouge">core-site.xml</code>, <code class="highlighter-rouge">hdfs-site.xml</code>, <code class="highlighter-rouge">yarn-site.xml</code> in <code class="highlighter-rouge">~/hadoop-2.2.0/etc/hadoop</code> to the new directory</p> + +<p><code class="highlighter-rouge"> +ln -s ~/.samza/conf/core-site.xml ~/hadoop-2.2.0/etc/hadoop/core-site.xml ln -s ~/.samza/conf/hdfs-site.xml ~/hadoop-2.2.0/etc/hadoop/hdfs-site.xml ln -s ~/.samza/conf/yarn-site.xml ~/hadoop-2.2.0/etc/hadoop/yarn-site.xml -</code></pre></div> +</code></p> + <p>Export the enviroment variable YARN_HOME (in ~/.bashrc) so Samza knows where to find these YARN configuration files.</p> -<div class="highlight"><pre><code class="language-" data-lang="">export YARN_HOME=$HOME/.samza -</code></pre></div> + +<p><code class="highlighter-rouge"> +export YARN_HOME=$HOME/.samza +</code></p> + <p><strong>Start the YARN cluster</strong> Start resource manager on master node</p> -<div class="highlight"><pre><code class="language-" data-lang="">~/hadoop-2.2.0/sbin/yarn-daemon.sh start resourcemanager -</code></pre></div> + +<p><code class="highlighter-rouge"> +~/hadoop-2.2.0/sbin/yarn-daemon.sh start resourcemanager +</code></p> + <p>Start node manager on all worker nodes</p> -<div class="highlight"><pre><code class="language-" data-lang="">~/hadoop-2.2.0/sbin/yarn-daemon.sh start nodemanager -</code></pre></div> -<h2 id="build-samoa">Build SAMOA</h2> +<p><code class="highlighter-rouge"> +~/hadoop-2.2.0/sbin/yarn-daemon.sh start nodemanager +</code></p> + +<h2 id="build-samoa">Build SAMOA</h2> <p>Perform the following step on one of the node in the cluster. Here we assume git and maven are installed on this node.</p> <p>Since Samza is not yet released on Maven, we will have to clone Samza project, build and publish to Maven local repository:</p> -<div class="highlight"><pre><code class="language-" data-lang="">git clone -b 0.7.0 https://github.com/apache/incubator-samza.git + +<p><code class="highlighter-rouge"> +git clone -b 0.7.0 https://github.com/apache/incubator-samza.git cd incubator-samza ./gradlew clean build ./gradlew publishToMavenLocal -</code></pre></div> -<p>Here we cloned and installed Samza version 0.7.0, the current released version (July 2014). </p> +</code></p> + +<p>Here we cloned and installed Samza version 0.7.0, the current released version (July 2014).</p> <p>Now we can clone the repository and install SAMOA.</p> -<div class="highlight"><pre><code class="language-" data-lang="">git clone http://git.apache.org/incubator-samoa.git + +<p><code class="highlighter-rouge"> +git clone http://git.apache.org/incubator-samoa.git cd incubator-samoa mvn -Psamza package -</code></pre></div> -<p>The deployable jars for SAMOA will be in <code>target/SAMOA-<variant>-<version>-SNAPSHOT.jar</code>. For example, in our case for Samza <code>target/SAMOA-Samza-0.2.0-SNAPSHOT.jar</code>.</p> +</code></p> -<h2 id="configure-samoa-samza-execution">Configure SAMOA-Samza execution</h2> +<p>The deployable jars for SAMOA will be in <code class="highlighter-rouge">target/SAMOA-<variant>-<version>-SNAPSHOT.jar</code>. For example, in our case for Samza <code class="highlighter-rouge">target/SAMOA-Samza-0.2.0-SNAPSHOT.jar</code>.</p> -<p>This section explains the configuration parameters in <code>bin/samoa-samza.properties</code> that are required to run SAMOA on top of Samza.</p> +<h2 id="configure-samoa-samza-execution">Configure SAMOA-Samza execution</h2> +<p>This section explains the configuration parameters in <code class="highlighter-rouge">bin/samoa-samza.properties</code> that are required to run SAMOA on top of Samza.</p> <p><strong>Samza execution mode</strong></p> -<div class="highlight"><pre><code class="language-" data-lang="">samoa.samza.mode=[yarn|local] -</code></pre></div> -<p>This parameter specify which mode to execute the task: <code>local</code> for local execution and <code>yarn</code> for cluster execution.</p> + +<p><code class="highlighter-rouge"> +samoa.samza.mode=[yarn|local] +</code> +This parameter specify which mode to execute the task: <code class="highlighter-rouge">local</code> for local execution and <code class="highlighter-rouge">yarn</code> for cluster execution.</p> <p><strong>Zookeeper</strong></p> -<div class="highlight"><pre><code class="language-" data-lang="">zookeeper.connect=localhost + +<p><code class="highlighter-rouge"> +zookeeper.connect=localhost zookeeper.port=2181 -</code></pre></div> -<p>The default setting above applies for local mode execution. For cluster mode, change <code>zookeeper.host</code> to the correct URL of your zookeeper host.</p> +</code> +The default setting above applies for local mode execution. For cluster mode, change <code class="highlighter-rouge">zookeeper.host</code> to the correct URL of your zookeeper host.</p> <p><strong>Kafka</strong></p> -<div class="highlight"><pre><code class="language-" data-lang="">kafka.broker.list=localhost:9092 -</code></pre></div> -<p><code>kafka.broker.list</code> is a comma separated list of host:port of all the brokers in Kafka cluster.</p> -<div class="highlight"><pre><code class="language-" data-lang="">kafka.replication.factor=1 -</code></pre></div> -<p><code>kafka.replication.factor</code> specifies the number of replicas for each stream in Kafka. This number must be less than or equal to the number of brokers in Kafka cluster.</p> -<p><strong>YARN</strong></p> +<p><code class="highlighter-rouge"> +kafka.broker.list=localhost:9092 +</code> +<code class="highlighter-rouge">kafka.broker.list</code> is a comma separated list of host:port of all the brokers in Kafka cluster.</p> + +<p><code class="highlighter-rouge"> +kafka.replication.factor=1 +</code> +<code class="highlighter-rouge">kafka.replication.factor</code> specifies the number of replicas for each stream in Kafka. This number must be less than or equal to the number of brokers in Kafka cluster.</p> -<blockquote> -<p>The below settings do not apply for local mode execution, you can leave them as they are.</p> -</blockquote> +<p><strong>YARN</strong> +> The below settings do not apply for local mode execution, you can leave them as they are.</p> + +<p><code class="highlighter-rouge">yarn.am.memory</code> and <code class="highlighter-rouge">yarn.container.memory</code> specify the memory requirement for the Application Master container and the worker containers, respectively.</p> -<p><code>yarn.am.memory</code> and <code>yarn.container.memory</code> specify the memory requirement for the Application Master container and the worker containers, respectively. </p> -<div class="highlight"><pre><code class="language-" data-lang="">yarn.am.memory=1024 +<p><code class="highlighter-rouge"> +yarn.am.memory=1024 yarn.container.memory=1024 -</code></pre></div> -<p><code>yarn.package.path</code> specifies the path (typically a HDFS path) of the package to be distributed to all YARN containers to execute the task.</p> -<div class="highlight"><pre><code class="language-" data-lang="">yarn.package.path=hdfs://samoa/SAMOA-Samza-0.2.0-SNAPSHOT.jar -</code></pre></div> +</code></p> + +<p><code class="highlighter-rouge">yarn.package.path</code> specifies the path (typically a HDFS path) of the package to be distributed to all YARN containers to execute the task.</p> + +<p><code class="highlighter-rouge"> +yarn.package.path=hdfs://samoa/SAMOA-Samza-0.2.0-SNAPSHOT.jar +</code></p> + <p><strong>Samza</strong> -<code>max.pi.per.container</code> specifies the number of PI instances allowed in one YARN container. </p> -<div class="highlight"><pre><code class="language-" data-lang="">max.pi.per.container=1 -</code></pre></div> -<p><code>kryo.register.file</code> specifies the registration file for Kryo serializer.</p> -<div class="highlight"><pre><code class="language-" data-lang="">kryo.register.file=samza-kryo -</code></pre></div> -<p><code>checkpoint.commit.ms</code> specifies the frequency for PIs to commit their checkpoints (in ms). The default value is 1 minute.</p> -<div class="highlight"><pre><code class="language-" data-lang="">checkpoint.commit.ms=60000 -</code></pre></div> -<h2 id="deploy-samoa-samza-task">Deploy SAMOA-Samza task</h2> +<code class="highlighter-rouge">max.pi.per.container</code> specifies the number of PI instances allowed in one YARN container.</p> + +<p><code class="highlighter-rouge"> +max.pi.per.container=1 +</code></p> + +<p><code class="highlighter-rouge">kryo.register.file</code> specifies the registration file for Kryo serializer.</p> +<p><code class="highlighter-rouge"> +kryo.register.file=samza-kryo +</code></p> + +<p><code class="highlighter-rouge">checkpoint.commit.ms</code> specifies the frequency for PIs to commit their checkpoints (in ms). The default value is 1 minute.</p> + +<p><code class="highlighter-rouge"> +checkpoint.commit.ms=60000 +</code></p> + +<h2 id="deploy-samoa-samza-task">Deploy SAMOA-Samza task</h2> <p>Execute SAMOA task with the following command:</p> -<div class="highlight"><pre><code class="language-" data-lang="">bin/samoa samza target/SAMOA-Samza-0.2.0-SNAPSHOT.jar "<task> & <options>" -</code></pre></div> -<h2 id="observe-execution-and-result">Observe execution and result</h2> -<p>In local mode, all the log will be printed out to stdout. If you execute the task on YARN cluster, the output is written to stdout files in YARN's containers' log folder ($HADOOP_HOME/logs/userlogs/application_<application-id>/container_<container-id>).</p> +<p><code class="highlighter-rouge"> +bin/samoa samza target/SAMOA-Samza-0.2.0-SNAPSHOT.jar "<task> & <options>" +</code></p> + +<h2 id="observe-execution-and-result">Observe execution and result</h2> +<p>In local mode, all the log will be printed out to stdout. If you execute the task on YARN cluster, the output is written to stdout files in YARNâs containersâ log folder ($HADOOP_HOME/logs/userlogs/application_<application-id>/container_<container-id>).</p> </article> Modified: incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-Storm.html URL: http://svn.apache.org/viewvc/incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-Storm.html?rev=1737551&r1=1737550&r2=1737551&view=diff ============================================================================== --- incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-Storm.html (original) +++ incubator/samoa/site/documentation/Executing-SAMOA-with-Apache-Storm.html Sun Apr 3 08:17:59 2016 @@ -76,103 +76,104 @@ <p>In this tutorial page we describe how to execute SAMOA on top of Apache Storm. Here is an outline of what we want to do:</p> <ol> -<li>Ensure that you have necessary Storm cluster and configuration to execute SAMOA</li> -<li>Ensure that you have all the SAMOA deployables for execution in the cluster</li> -<li>Configure samoa-storm.properties</li> -<li>Execute SAMOA classification task</li> -<li>Observe the task execution</li> + <li>Ensure that you have necessary Storm cluster and configuration to execute SAMOA</li> + <li>Ensure that you have all the SAMOA deployables for execution in the cluster</li> + <li>Configure samoa-storm.properties</li> + <li>Execute SAMOA classification task</li> + <li>Observe the task execution</li> </ol> <h3 id="storm-configuration">Storm Configuration</h3> - <p>Before we start the tutorial, please ensure that you already have Storm cluster (preferably Storm 0.8.2) running. You can follow this <a href="http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/">tutorial</a> to set up a Storm cluster.</p> -<p>You also need to install Storm at the machine where you initiate the deployment, and configure Storm (at least) with this configuration in <code>~/.storm/storm.yaml</code>:</p> -<div class="highlight"><pre><code class="language-" data-lang="">########### These MUST be filled in for a storm configuration -nimbus.host: "<enter your nimbus host name here>" +<p>You also need to install Storm at the machine where you initiate the deployment, and configure Storm (at least) with this configuration in <code class="highlighter-rouge">~/.storm/storm.yaml</code>:</p> + +<p>``` +########### These MUST be filled in for a storm configuration +nimbus.host: â<enter your="" nimbus="" host="" name="" here="">"</enter></p> -## List of custom serializations -kryo.register: +<h2 id="list-of-custom-serializations">List of custom serializations</h2> +<p>kryo.register: - org.apache.samoa.learners.classifiers.trees.AttributeContentEvent: org.apache.samoa.learners.classifiers.trees.AttributeContentEvent$AttributeCEFullPrecSerializer - org.apache.samoa.learners.classifiers.trees.ComputeContentEvent: org.apache.samoa.learners.classifiers.trees.ComputeContentEvent$ComputeCEFullPrecSerializer -</code></pre></div> -<!-- +<code class="highlighter-rouge"> +<!-- Or, if you are using SAMOA with optimized VHT, you should use this following configuration file: -``` +</code> ########### These MUST be filled in for a storm configuration -nimbus.host: "<enter your nimbus host name here>" +nimbus.host: â<enter your="" nimbus="" host="" name="" here="">"</enter></p> -## List of custom serializations -kryo.register: +<h2 id="list-of-custom-serializations-1">List of custom serializations</h2> +<p>kryo.register: - org.apache.samoa.learners.classifiers.trees.NaiveAttributeContentEvent: org.apache.samoa.classifiers.trees.NaiveAttributeContentEvent$NaiveAttributeCEFullPrecSerializer - org.apache.samoa.learners.classifiers.trees.ComputeContentEvent: org.apache.samoa.classifiers.trees.ComputeContentEvent$ComputeCEFullPrecSerializer ``` ---> +â></p> -<p>Alternatively, if you don't have Storm cluster running, you can execute SAMOA with Storm in local mode as explained in section <a href="#samoa-storm-properties">samoa-storm.properties Configuration</a>.</p> +<p>Alternatively, if you donât have Storm cluster running, you can execute SAMOA with Storm in local mode as explained in section <a href="#samoa-storm-properties">samoa-storm.properties Configuration</a>.</p> <h3 id="samoa-deployables">SAMOA deployables</h3> - <p>There are three deployables for executing SAMOA on top of Storm. They are:</p> <ol> -<li><code>bin/samoa</code> is the main script to execute SAMOA. You do not need to change anything in this script.</li> -<li><code>target/SAMOA-Storm-x.x.x-SNAPSHOT.jar</code> is the deployed jar file. <code>x.x.x</code> is the version number of SAMOA. </li> -<li><code>bin/samoa-storm.properties</code> contains deployment configurations. You need to set the parameters in this properties file correctly. </li> + <li><code class="highlighter-rouge">bin/samoa</code> is the main script to execute SAMOA. You do not need to change anything in this script.</li> + <li><code class="highlighter-rouge">target/SAMOA-Storm-x.x.x-SNAPSHOT.jar</code> is the deployed jar file. <code class="highlighter-rouge">x.x.x</code> is the version number of SAMOA.</li> + <li><code class="highlighter-rouge">bin/samoa-storm.properties</code> contains deployment configurations. You need to set the parameters in this properties file correctly.</li> </ol> -<h3 id="samoa-storm-properties-configuration"><a name="samoa-storm-properties"> samoa-storm.properties Configuration</a></h3> - +<h3 id="a-namesamoa-storm-properties-samoa-stormproperties-configurationa"><a name="samoa-storm-properties"> samoa-storm.properties Configuration</a></h3> <p>Currently, the properties file contains two configurations:</p> <ol> -<li><code>samoa.storm.mode</code> determines whether the task is executed locally (using Storm's <code>LocalCluster</code>) or executed in a Storm cluster. Use <code>local</code> if you want to test SAMOA and you do not have a Storm cluster for deployment. Use <code>cluster</code> if you want to test SAMOA on your Storm cluster.</li> -<li><code>samoa.storm.numworker</code> determines the number of worker to execute the SAMOA tasks in the Storm cluster. This field must be an integer, less than or equal to the number of available slots in you Storm cluster. If you are using local mode, this property corresponds to the number of thread used by Storm's LocalCluster to execute your SAMOA task.</li> + <li><code class="highlighter-rouge">samoa.storm.mode</code> determines whether the task is executed locally (using Stormâs <code class="highlighter-rouge">LocalCluster</code>) or executed in a Storm cluster. Use <code class="highlighter-rouge">local</code> if you want to test SAMOA and you do not have a Storm cluster for deployment. Use <code class="highlighter-rouge">cluster</code> if you want to test SAMOA on your Storm cluster.</li> + <li><code class="highlighter-rouge">samoa.storm.numworker</code> determines the number of worker to execute the SAMOA tasks in the Storm cluster. This field must be an integer, less than or equal to the number of available slots in you Storm cluster. If you are using local mode, this property corresponds to the number of thread used by Stormâs LocalCluster to execute your SAMOA task.</li> </ol> <p>Here is the example of a complete properties file:</p> -<div class="highlight"><pre><code class="language-" data-lang=""># SAMOA Storm properties file + +<p>``` +# SAMOA Storm properties file # This file contains specific configurations for SAMOA deployment in the Storm platform # Note that you still need to configure Storm client in your machine, -# including setting up Storm configuration file (~/.storm/storm.yaml) with correct settings +# including setting up Storm configuration file (~/.storm/storm.yaml) with correct settings</p> -# samoa.storm.mode corresponds to the execution mode of the Task in Storm -# possible values: +<h1 id="samoastormmode-corresponds-to-the-execution-mode-of-the-task-in-storm">samoa.storm.mode corresponds to the execution mode of the Task in Storm</h1> +<p># possible values: # 1. cluster: the Task will be sent into nimbus. The nimbus is configured by Storm configuration file # 2. local: the Task will be sent using local Storm cluster -samoa.storm.mode=cluster +samoa.storm.mode=cluster</p> -# samoa.storm.numworker corresponds to the number of worker processes allocated in Storm cluster -# possible values: any integer greater than 0 +<h1 id="samoastormnumworker-corresponds-to-the-number-of-worker-processes-allocated-in-storm-cluster">samoa.storm.numworker corresponds to the number of worker processes allocated in Storm cluster</h1> +<p># possible values: any integer greater than 0<br /> samoa.storm.numworker=7 -</code></pre></div> +```</p> + <h3 id="samoa-task-execution">SAMOA task execution</h3> -<p>You can execute a SAMOA task using the aforementioned <code>bin/samoa</code> script with this following format: -<code>bin/samoa <platform> <jar> "<task>"</code>.</p> +<p>You can execute a SAMOA task using the aforementioned <code class="highlighter-rouge">bin/samoa</code> script with this following format: +<code class="highlighter-rouge">bin/samoa <platform> <jar> "<task>"</code>.</p> -<p><code><platform></code> can be <code>storm</code> or <code>s4</code>. Using <code>storm</code> option means you are deploying SAMOA on a Storm environment. In this configuration, the script uses the aforementioned yaml file (<code>~/.storm/storm.yaml</code>) and <code>samoa-storm.properties</code> to perform the deployment. Using <code>s4</code> option means you are deploying SAMOA on an Apache S4 environment. Follow this <a href="Executing-SAMOA-with-Apache-S4">link</a> to learn more about deploying SAMOA on Apache S4.</p> +<p><code class="highlighter-rouge"><platform></code> can be <code class="highlighter-rouge">storm</code> or <code class="highlighter-rouge">s4</code>. Using <code class="highlighter-rouge">storm</code> option means you are deploying SAMOA on a Storm environment. In this configuration, the script uses the aforementioned yaml file (<code class="highlighter-rouge">~/.storm/storm.yaml</code>) and <code class="highlighter-rouge">samoa-storm.properties</code> to perform the deployment. Using <code class="highlighter-rouge">s4</code> option means you are deploying SAMOA on an Apache S4 environment. Follow this <a href="Executing-SAMOA-with-Apache-S4">link</a> to learn more about deploying SAMOA on Apache S4.</p> -<p><code><jar></code> is the location of the deployed jar file (<code>SAMOA-Storm-x.x.x-SNAPSHOT.jar</code>) in your file system. The location can be a relative path or an absolute path into the jar file. </p> +<p><code class="highlighter-rouge"><jar></code> is the location of the deployed jar file (<code class="highlighter-rouge">SAMOA-Storm-x.x.x-SNAPSHOT.jar</code>) in your file system. The location can be a relative path or an absolute path into the jar file.</p> -<p><code>"<task>"</code> is the SAMOA task command line such as <code>PrequentialEvaluation</code> or <code>ClusteringTask</code>. This command line for SAMOA task follows the format of <a href="http://moa.cms.waikato.ac.nz/details/classification/command-line/">Massive Online Analysis (MOA)</a>.</p> +<p><code class="highlighter-rouge">"<task>"</code> is the SAMOA task command line such as <code class="highlighter-rouge">PrequentialEvaluation</code> or <code class="highlighter-rouge">ClusteringTask</code>. This command line for SAMOA task follows the format of <a href="http://moa.cms.waikato.ac.nz/details/classification/command-line/">Massive Online Analysis (MOA)</a>.</p> <p>The complete command to execute SAMOA is:</p> -<div class="highlight"><pre><code class="language-" data-lang="">bin/samoa storm target/SAMOA-Storm-0.0.1-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -i 1000000 -f 100000 -l (org.apache.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p 4) -s (org.apache.samoa.moa.streams.generators.RandomTreeGenerator -c 2 -o 10 -u 10)" -</code></pre></div> -<p>The example above uses <a href="Prequential-Evaluation-Task">Prequential Evaluation task</a> and <a href="Vertical-Hoeffding-Tree-Classifier">Vertical Hoeffding Tree</a> classifier. </p> -<h3 id="observing-task-execution">Observing task execution</h3> +<p><code class="highlighter-rouge"> +bin/samoa storm target/SAMOA-Storm-0.0.1-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -i 1000000 -f 100000 -l (org.apache.samoa.learners.classifiers.trees.VerticalHoeffdingTree -p 4) -s (org.apache.samoa.moa.streams.generators.RandomTreeGenerator -c 2 -o 10 -u 10)" +</code> +The example above uses <a href="Prequential-Evaluation-Task">Prequential Evaluation task</a> and <a href="Vertical-Hoeffding-Tree-Classifier">Vertical Hoeffding Tree</a> classifier.</p> -<p>There are two ways to observe the task execution using Storm UI and by monitoring the dump file of the SAMOA task. Notice that the dump file will be created on the cluster if you are executing your task in <code>cluster</code> mode.</p> +<h3 id="observing-task-execution">Observing task execution</h3> +<p>There are two ways to observe the task execution using Storm UI and by monitoring the dump file of the SAMOA task. Notice that the dump file will be created on the cluster if you are executing your task in <code class="highlighter-rouge">cluster</code> mode.</p> <h4 id="using-storm-ui">Using Storm UI</h4> - <p>Go to the web address of Storm UI and check whether the SAMOA task executes as intended. Use this UI to kill the associated Storm topology if necessary.</p> <h4 id="monitoring-the-dump-file">Monitoring the dump file</h4> - -<p>Several tasks have options to specify a dump file, which is a file that represents the task output. In our example, <a href="Prequential-Evaluation-Task">Prequential Evaluation task</a> has <code>-d</code> option which specifies the path to the dump file. Since Storm performs the allocation of Storm tasks, you should set the dump file into a file on a shared filesystem if you want to access it from the machine submitting the task.</p> +<p>Several tasks have options to specify a dump file, which is a file that represents the task output. In our example, <a href="Prequential-Evaluation-Task">Prequential Evaluation task</a> has <code class="highlighter-rouge">-d</code> option which specifies the path to the dump file. Since Storm performs the allocation of Storm tasks, you should set the dump file into a file on a shared filesystem if you want to access it from the machine submitting the task.</p> </article> Modified: incubator/samoa/site/documentation/Getting-Started.html URL: http://svn.apache.org/viewvc/incubator/samoa/site/documentation/Getting-Started.html?rev=1737551&r1=1737550&r2=1737551&view=diff ============================================================================== --- incubator/samoa/site/documentation/Getting-Started.html (original) +++ incubator/samoa/site/documentation/Getting-Started.html Sun Apr 3 08:17:59 2016 @@ -76,26 +76,40 @@ <p>We start showing how simple is to run a first large scale machine learning task in SAMOA. We will evaluate a bagging ensemble method using decision trees on the Forest Covertype dataset.</p> <ul> -<li>1. Download SAMOA </li> + <li> + <ol> + <li>Download SAMOA</li> + </ol> + </li> </ul> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">git clone http://git.apache.org/incubator-samoa.git -<span class="nb">cd </span>incubator-samoa -mvn package <span class="c">#Local mode</span> -</code></pre></div> -<ul> -<li>2. Download the Forest CoverType dataset </li> -</ul> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">wget <span class="s2">"http://downloads.sourceforge.net/project/moa-datastream/Datasets/Classification/covtypeNorm.arff.zip"</span> + +<p><code class="highlighter-rouge">bash +git clone http://git.apache.org/incubator-samoa.git +cd incubator-samoa +mvn package #Local mode +</code> +* 2. Download the Forest CoverType dataset</p> + +<p><code class="highlighter-rouge">bash +wget "http://downloads.sourceforge.net/project/moa-datastream/Datasets/Classification/covtypeNorm.arff.zip" unzip covtypeNorm.arff.zip -</code></pre></div> +</code></p> + <p><em>Forest Covertype</em> contains the forest cover type for 30 x 30 meter cells obtained from the US Forest Service (USFS) Region 2 Resource Information System (RIS) data. It contains 581,012 instances and 54 attributes, and it has been used in several articles on data stream classification.</p> <ul> -<li>3. Run an example: classifying the CoverType dataset with the bagging algorithm</li> + <li> + <ol> + <li>Run an example: classifying the CoverType dataset with the bagging algorithm</li> + </ol> + </li> </ul> -<div class="highlight"><pre><code class="language-bash" data-lang="bash">bin/samoa <span class="nb">local </span>target/SAMOA-Local-0.3.0-SNAPSHOT.jar <span class="s2">"PrequentialEvaluation -l classifiers.ensemble.Bagging - -s (ArffFileStream -f covtypeNorm.arff) -f 100000"</span> -</code></pre></div> + +<p><code class="highlighter-rouge">bash +bin/samoa local target/SAMOA-Local-0.3.0-SNAPSHOT.jar "PrequentialEvaluation -l classifiers.ensemble.Bagging + -s (ArffFileStream -f covtypeNorm.arff) -f 100000" +</code></p> + <p>The output will be a list of the evaluation results, plotted each 100,000 instances.</p> </article> Modified: incubator/samoa/site/documentation/Home.html URL: http://svn.apache.org/viewvc/incubator/samoa/site/documentation/Home.html?rev=1737551&r1=1737550&r2=1737551&view=diff ============================================================================== --- incubator/samoa/site/documentation/Home.html (original) +++ incubator/samoa/site/documentation/Home.html Sun Apr 3 08:17:59 2016 @@ -81,58 +81,62 @@ SAMOA is similar to Mahout in spirit, bu <p>Apache SAMOA is simple and fun to use! This documentation is intended to give an introduction on how to use SAMOA in different ways. As a user you can run SAMOA algorithms on several stream processing engines: local mode, Storm, S4, Samza, and Flink. As a developer you can create new algorithms only once and test them in all of these distributed stream processing engines.</p> <h2 id="getting-started">Getting Started</h2> - <ul> -<li><a href="Getting-Started.html">0 Hands-on with SAMOA: Getting Started!</a></li> + <li><a href="Getting-Started.html">0 Hands-on with SAMOA: Getting Started!</a></li> </ul> <h2 id="users">Users</h2> - -<ul> -<li><a href="Scalable-Advanced-Massive-Online-Analysis.html">1 Building and Executing SAMOA</a> - -<ul> -<li><a href="Building-SAMOA.html">1.0 Building SAMOA</a></li> -<li><a href="Executing-SAMOA-with-Apache-Storm.html">1.1 Executing SAMOA with Apache Storm</a></li> -<li><a href="Executing-SAMOA-with-Apache-S4.html">1.2 Executing SAMOA with Apache S4</a></li> -<li><a href="Executing-SAMOA-with-Apache-Samza.html">1.3 Executing SAMOA with Apache Samza</a></li> -<li><a href="Executing-SAMOA-with-Apache-Avro-Files.html">1.4 Executing SAMOA with Apache Avro Files</a></li> -</ul></li> -<li><a href="SAMOA-and-Machine-Learning.html">2 Machine Learning Methods in SAMOA</a> - <ul> -<li><a href="Prequential-Evaluation-Task.html">2.1 Prequential Evaluation Task</a></li> -<li><a href="Vertical-Hoeffding-Tree-Classifier.html">2.2 Vertical Hoeffding Tree Classifier</a></li> -<li><a href="Adaptive-Model-Rules-Regressor.html">2.3 Adaptive Model Rules Regressor</a></li> -<li><a href="Bagging-and-Boosting.html">2.4 Bagging and Boosting</a></li> -<li><a href="Distributed-Stream-Clustering.html">2.5 Distributed Stream Clustering</a></li> -<li><a href="Distributed-Stream-Frequent-Itemset-Mining.html">2.6 Distributed Stream Frequent Itemset Mining</a></li> -<li><a href="SAMOA-for-MOA-users.html">2.7 SAMOA for MOA users</a></li> -</ul></li> + <li><a href="Scalable-Advanced-Massive-Online-Analysis.html">1 Building and Executing SAMOA</a> + <ul> + <li><a href="Building-SAMOA.html">1.0 Building SAMOA</a></li> + <li><a href="Executing-SAMOA-with-Apache-Storm.html">1.1 Executing SAMOA with Apache Storm</a></li> + <li><a href="Executing-SAMOA-with-Apache-S4.html">1.2 Executing SAMOA with Apache S4</a></li> + <li><a href="Executing-SAMOA-with-Apache-Samza.html">1.3 Executing SAMOA with Apache Samza</a></li> + <li><a href="Executing-SAMOA-with-Apache-Avro-Files.html">1.4 Executing SAMOA with Apache Avro Files</a></li> + </ul> + </li> + <li><a href="SAMOA-and-Machine-Learning.html">2 Machine Learning Methods in SAMOA</a> + <ul> + <li><a href="Prequential-Evaluation-Task.html">2.1 Prequential Evaluation Task</a></li> + <li><a href="Vertical-Hoeffding-Tree-Classifier.html">2.2 Vertical Hoeffding Tree Classifier</a></li> + <li><a href="Adaptive-Model-Rules-Regressor.html">2.3 Adaptive Model Rules Regressor</a></li> + <li><a href="Bagging-and-Boosting.html">2.4 Bagging and Boosting</a></li> + <li><a href="Distributed-Stream-Clustering.html">2.5 Distributed Stream Clustering</a></li> + <li><a href="Distributed-Stream-Frequent-Itemset-Mining.html">2.6 Distributed Stream Frequent Itemset Mining</a></li> + <li><a href="SAMOA-for-MOA-users.html">2.7 SAMOA for MOA users</a></li> + </ul> + </li> </ul> <h2 id="developers">Developers</h2> - <ul> -<li><a href="SAMOA-Topology.html">3 Understanding SAMOA Topologies</a> - -<ul> -<li><a href="Processor.html">3.1 Processor</a></li> -<li><a href="Content-Event.html">3.2 Content Event</a></li> -<li><a href="Stream.html">3.3 Stream</a></li> -<li><a href="Task.html">3.4 Task</a></li> -<li><a href="Topology-Builder.html">3.5 Topology Builder</a></li> -<li><a href="Learner.html">3.6 Learner</a></li> -<li><a href="Processing-Item.html">3.7 Processing Item</a></li> -</ul></li> -<li><a href="Developing-New-Tasks-in-SAMOA.html">4 Developing New Tasks in SAMOA</a></li> + <li><a href="SAMOA-Topology.html">3 Understanding SAMOA Topologies</a> + <ul> + <li><a href="Processor.html">3.1 Processor</a></li> + <li><a href="Content-Event.html">3.2 Content Event</a></li> + <li><a href="Stream.html">3.3 Stream</a></li> + <li><a href="Task.html">3.4 Task</a></li> + <li><a href="Topology-Builder.html">3.5 Topology Builder</a></li> + <li><a href="Learner.html">3.6 Learner</a></li> + <li><a href="Processing-Item.html">3.7 Processing Item</a></li> + </ul> + </li> + <li><a href="Developing-New-Tasks-in-SAMOA.html">4 Developing New Tasks in SAMOA</a></li> </ul> <h3 id="getting-help">Getting help</h3> +<p>Discussion about SAMOA happens on the Apache development mailing list <a href="mailto:dev@samoa.incubator.org">dev@samoa.incubator.org</a></p> -<p>Discussion about SAMOA happens on the Apache development mailing list <a href="mailto:[email protected]">[email protected]</a></p> - -<p>[ <a href="mailto:[email protected]">subscribe</a> | <a href="mailto:[email protected]">unsubscribe</a> | <a href="http://mail-archives.apache.org/mod_mbox/incubator-samoa-dev">archives</a> ]</p> +<table> + <tbody> + <tr> + <td>[ <a href="mailto:dev-subscribe@samoa.incubator.org">subscribe</a></td> + <td><a href="mailto:dev-unsubscribe@samoa.incubator.org">unsubscribe</a></td> + <td><a href="http://mail-archives.apache.org/mod_mbox/incubator-samoa-dev">archives</a> ]</td> + </tr> + </tbody> +</table> </article> Modified: incubator/samoa/site/documentation/Learner.html URL: http://svn.apache.org/viewvc/incubator/samoa/site/documentation/Learner.html?rev=1737551&r1=1737550&r2=1737551&view=diff ============================================================================== --- incubator/samoa/site/documentation/Learner.html (original) +++ incubator/samoa/site/documentation/Learner.html Sun Apr 3 08:17:59 2016 @@ -74,18 +74,19 @@ <article class="post-content"> <p>Learners are implemented in SAMOA as sub-topologies.</p> -<div class="highlight"><pre><code class="language-" data-lang="">public interface Learner extends Serializable{ - public void init(TopologyBuilder topologyBuilder, Instances dataset); +<p>``` +public interface Learner extends Serializable{</p> - public Processor getInputProcessor(); +<div class="highlighter-rouge"><pre class="highlight"><code>public void init(TopologyBuilder topologyBuilder, Instances dataset); - public Stream getResultStream(); -} -</code></pre></div> -<p>When a <code>Task</code> object is initiated via <code>init()</code>, the method <code>init(...)</code> of <code>Learner</code> is called, and the topology is added to the global topology of the task.</p> +public Processor getInputProcessor(); -<p>To create a new learner, it is only needed to add streams, processors and their connections to the topology in <code>init(...)</code>, specify what is the processor that will manage the input stream of the learner in <code>getInputProcessor()</code>, and finally, specify what is going to be the output stream of the learner with <code>getResultStream()</code>.</p> +public Stream getResultStream(); } ``` When a `Task` object is initiated via `init()`, the method `init(...)` of `Learner` is called, and the topology is added to the global topology of the task. +</code></pre> +</div> + +<p>To create a new learner, it is only needed to add streams, processors and their connections to the topology in <code class="highlighter-rouge">init(...)</code>, specify what is the processor that will manage the input stream of the learner in <code class="highlighter-rouge">getInputProcessor()</code>, and finally, specify what is going to be the output stream of the learner with <code class="highlighter-rouge">getResultStream()</code>.</p> </article> Modified: incubator/samoa/site/documentation/Prequential-Evaluation-Task.html URL: http://svn.apache.org/viewvc/incubator/samoa/site/documentation/Prequential-Evaluation-Task.html?rev=1737551&r1=1737550&r2=1737551&view=diff ============================================================================== --- incubator/samoa/site/documentation/Prequential-Evaluation-Task.html (original) +++ incubator/samoa/site/documentation/Prequential-Evaluation-Task.html Sun Apr 3 08:17:59 2016 @@ -73,26 +73,29 @@ </header> <article class="post-content"> - <p>In data stream mining, the most used evaluation scheme is the prequential or interleaved-test-then-train evolution. The idea is very simple: we use each instance first to test the model, and then to train the model. The Prequential Evaluation task evaluates the performance of online classifiers doing this. It supports two classification performance evaluators: the basic one which measures the accuracy of the classifier model since the start of the evaluation, and a window based one which measures the accuracy on the current sliding window of recent instances. </p> + <p>In data stream mining, the most used evaluation scheme is the prequential or interleaved-test-then-train evolution. The idea is very simple: we use each instance first to test the model, and then to train the model. The Prequential Evaluation task evaluates the performance of online classifiers doing this. It supports two classification performance evaluators: the basic one which measures the accuracy of the classifier model since the start of the evaluation, and a window based one which measures the accuracy on the current sliding window of recent instances.</p> <p>Examples of Prequential Evaluation task in SAMOA command line when deploying into Storm</p> -<div class="highlight"><pre><code class="language-" data-lang="">bin/samoa storm target/SAMOA-Storm-0.0.1-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -i 1000000 -f 100000 -l (classifiers.trees.VerticalHoeffdingTree -p 4) -s (generators.RandomTreeGenerator -c 2 -o 10 -u 10)" -</code></pre></div> + +<p><code class="highlighter-rouge"> +bin/samoa storm target/SAMOA-Storm-0.0.1-SNAPSHOT.jar "PrequentialEvaluation -d /tmp/dump.csv -i 1000000 -f 100000 -l (classifiers.trees.VerticalHoeffdingTree -p 4) -s (generators.RandomTreeGenerator -c 2 -o 10 -u 10)" +</code></p> + <p>Parameters:</p> <ul> -<li><code>-l</code>: classifier to train</li> -<li><code>-s</code>: stream to learn from</li> -<li><code>-e</code>: classification performance evaluation method</li> -<li><code>-i</code>: maximum number of instances to test/train on (-1 = no limit)</li> -<li><code>-f</code>: number of instances between samples of the learning performance</li> -<li><code>-n</code>: evaluation name (default: PrequentialEvaluation_TimeStamp)</li> -<li><code>-d</code>: file to append intermediate csv results to</li> + <li><code class="highlighter-rouge">-l</code>: classifier to train</li> + <li><code class="highlighter-rouge">-s</code>: stream to learn from</li> + <li><code class="highlighter-rouge">-e</code>: classification performance evaluation method</li> + <li><code class="highlighter-rouge">-i</code>: maximum number of instances to test/train on (-1 = no limit)</li> + <li><code class="highlighter-rouge">-f</code>: number of instances between samples of the learning performance</li> + <li><code class="highlighter-rouge">-n</code>: evaluation name (default: PrequentialEvaluation_TimeStamp)</li> + <li><code class="highlighter-rouge">-d</code>: file to append intermediate csv results to</li> </ul> -<p>In terms of SAMOA API, the Prequential Evaluation Task consists of a source <code>Entrance Processor</code>, a <code>Classifier</code>, and an <code>Evaluator Processor</code> as shown below. The <code>Entrance Processor</code> sends instances to the <code>Classifier</code> using the <code>source</code> stream. The classifier sends the classification results to the <code>Evaluator Processor</code> via the <code>result</code> stream. The <code>Entrance Processor</code> corresponds to the <code>-s</code> option of Prequential Evaluation, the <code>Classifier</code> corresponds to the <code>-l</code> option, and the <code>Evaluator Processor</code> corresponds to the <code>-e</code> option.</p> +<p>In terms of SAMOA API, the Prequential Evaluation Task consists of a source <code class="highlighter-rouge">Entrance Processor</code>, a <code class="highlighter-rouge">Classifier</code>, and an <code class="highlighter-rouge">Evaluator Processor</code> as shown below. The <code class="highlighter-rouge">Entrance Processor</code> sends instances to the <code class="highlighter-rouge">Classifier</code> using the <code class="highlighter-rouge">source</code> stream. The classifier sends the classification results to the <code class="highlighter-rouge">Evaluator Processor</code> via the <code class="highlighter-rouge">result</code> stream. The <code class="highlighter-rouge">Entrance Processor</code> corresponds to the <code class="highlighter-rouge">-s</code> option of Prequential Evaluation, the <code class="highlighter-rouge">Classifier</code> corresponds to the <code class="highlighter-rouge">-l</code> option, and the <code class="highlighter-rouge">Evaluator Processor</code> co rresponds to the <code class="highlighter-rouge">-e</code> option.</p> -<p><img src="images/PrequentialEvaluation.png" alt="Prequential Evaluation Task"></p> +<p><img src="images/PrequentialEvaluation.png" alt="Prequential Evaluation Task" /></p> </article> Modified: incubator/samoa/site/documentation/Processing-Item.html URL: http://svn.apache.org/viewvc/incubator/samoa/site/documentation/Processing-Item.html?rev=1737551&r1=1737550&r2=1737551&view=diff ============================================================================== --- incubator/samoa/site/documentation/Processing-Item.html (original) +++ incubator/samoa/site/documentation/Processing-Item.html Sun Apr 3 08:17:59 2016 @@ -82,30 +82,33 @@ It is used internally, and it is not acc There are two types of Processing Items.</p> <ol> -<li>Simple Processing Item (PI)</li> -<li>Entrance Processing Item (EntrancePI)</li> + <li>Simple Processing Item (PI)</li> + <li>Entrance Processing Item (EntrancePI)</li> </ol> -<h4 id="1-simple-processing-item-pi">1. Simple Processing Item (PI)</h4> +<h4 id="simple-processing-item-pi">1. Simple Processing Item (PI)</h4> +<p>Once a Processor is wrapped in a PI, it becomes an executable component of the topology. All physical topology units are created with the help of a <code class="highlighter-rouge">TopologyBuilder</code>. Following code snippet shows the creation of a Processing Item.</p> -<p>Once a Processor is wrapped in a PI, it becomes an executable component of the topology. All physical topology units are created with the help of a <code>TopologyBuilder</code>. Following code snippet shows the creation of a Processing Item.</p> -<div class="highlight"><pre><code class="language-" data-lang="">builder.initTopology("MyTopology"); +<p><code class="highlighter-rouge"> +builder.initTopology("MyTopology"); Processor samplerProcessor = new Sampler(); ProcessingItem samplerPI = builder.createPI(samplerProcessor,3); -</code></pre></div> -<p>The <code>createPI()</code> method of <code>TopologyBuilder</code> is used to create a PI. Its first argument is the instance of a Processor which needs to be wrapped-in. Its second argument is the parallelism hint. It tells the underlying platforms how many parallel instances of this PI should be created on different nodes.</p> - -<h4 id="2-entrance-processing-item-entrancepi">2. Entrance Processing Item (EntrancePI)</h4> +</code> +The <code class="highlighter-rouge">createPI()</code> method of <code class="highlighter-rouge">TopologyBuilder</code> is used to create a PI. Its first argument is the instance of a Processor which needs to be wrapped-in. Its second argument is the parallelism hint. It tells the underlying platforms how many parallel instances of this PI should be created on different nodes.</p> +<h4 id="entrance-processing-item-entrancepi">2. Entrance Processing Item (EntrancePI)</h4> <p>Entrance Processing Item is different from a PI in only one way: it accepts an Entrance Processor which can generate its own stream. It is mostly used as the source of a topology. It connects to external sources, pulls data and provides it to the topology in the form of streams. -All physical topology units are created with the help of a <code>TopologyBuilder</code>. +All physical topology units are created with the help of a <code class="highlighter-rouge">TopologyBuilder</code>. The following code snippet shows the creation of an Entrance Processing Item.</p> -<div class="highlight"><pre><code class="language-" data-lang="">builder.initTopology("MyTopology"); + +<p><code class="highlighter-rouge"> +builder.initTopology("MyTopology"); EntranceProcessor sourceProcessor = new Source(); EntranceProcessingItem sourcePi = builder.createEntrancePi(sourceProcessor); -</code></pre></div> +</code></p> + </article> <!-- </div> -->
