[7/9] mahout git commit: WEBSITE Triage of Old Site Migration

rawkintrevo Sat, 29 Apr 2017 16:25:04 -0700

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/environment/h2o-internals.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/environment/h2o-internals.md
 
b/website/old_site_migration/needs_work_convenience/environment/h2o-internals.md
new file mode 100644
index 0000000..c72a7ae
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/environment/h2o-internals.md
@@ -0,0 +1,51 @@
+---
+layout: default
+title: 
+theme:
+   name: retro-mahout
+---
+
+# Introduction
+
+This document provides an overview of how the Mahout Samsara environment is 
implemented over the H2O backend engine. The document is aimed at Mahout 
developers, to give a high level description of the design so that one can 
explore the code inside `h2o/` with some context.
+
+## H2O Overview
+
+H2O is a distributed scalable machine learning system. Internal architecture 
of H2O has a distributed math engine (h2o-core) and a separate layer on top for 
algorithms and UI. The Mahout integration requires only the math engine 
(h2o-core).
+
+## H2O Data Model
+
+The data model of the H2O math engine is a distributed columnar store (of 
primarily numbers, but also strings). A column of numbers is called a Vector, 
which is broken into Chunks (of a few thousand elements). Chunks are 
distributed across the cluster based on a deterministic hash. Therefore, any 
member of the cluster knows where a particular Chunk of a Vector is homed. Each 
Chunk is separately compressed in memory and elements are individually 
decompressed on the fly upon access with purely register operations (thereby 
achieving high memory throughput). An ordered set of similarly partitioned Vecs 
are composed into a Frame. A Frame is therefore a large two dimensional table 
of numbers. All elements of a logical row in the Frame are guaranteed to be 
homed in the same server of the cluster. Generally speaking, H2O works well on 
"tall skinny" data, i.e, lots of rows (100s of millions) and modest number of 
columns (10s of thousands).
+
+
+## Mahout DRM
+
+The Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a 
large matrix of numbers in-memory in a cluster by distributing logical rows 
among servers. Mahout's scala DSL provides an abstract API on DRMs for backend 
engines to provide implementations of this API. Examples are the Spark and H2O 
backend engines. Each engine has it's own design of mapping the abstract API 
onto its data model and provides implementations for algebraic operators over 
that mapping.
+
+
+## H2O Environment Engine
+
+The H2O backend implements the abstract DRM as an H2O Frame. Each logical 
column in the DRM is an H2O Vector. All elements of a logical DRM row are 
guaranteed to be homed on the same server. A set of rows stored on a server are 
presented as a read-only virtual in-core Matrix (i.e BlockMatrix) for the 
closure method in the `mapBlock(...)` API.
+
+H2O provides a flexible execution framework called `MRTask`. The `MRTask` 
framework typically executes over a Frame (or even a Vector), supports various 
types of map() methods, can optionally modify the Frame or Vector (though this 
never happens in the Mahout integration), and optionally create a new Vector or 
set of Vectors (to combine them into a new Frame, and consequently a new DRM).
+
+
+## Source Layout
+
+Within mahout.git, the top level directory, `h2o/` holds all the source code 
related to the H2O backend engine. Part of the code (that interfaces with the 
rest of the Mahout componenets) is in Scala, and part of the code (that 
interfaces with h2o-core and implements algebraic operators) is in Java. Here 
is a brief overview of what functionality can be found where within `h2o/`.
+
+  h2o/ - top level directory containing all H2O related code
+
+  h2o/src/main/java/org/apache/mahout/h2obindings/ops/*.java - Physical 
operator code for the various DSL algebra
+
+  h2o/src/main/java/org/apache/mahout/h2obindings/drm/*.java - DRM backing 
(onto Frame) and Broadcast implementation
+
+  h2o/src/main/java/org/apache/mahout/h2obindings/H2OHdfs.java - Read / Write 
between DRM (Frame) and files on HDFS
+
+  h2o/src/main/java/org/apache/mahout/h2obindings/H2OBlockMatrix.java - A 
vertical block matrix of DRM presented as a virtual copy-on-write in-core 
Matrix. Used in mapBlock() API
+
+  h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java - A 
collection of various functionality and helpers. For e.g, convert between 
in-core Matrix and DRM, various summary statistics on DRM/Frame.
+
+  h2o/src/main/scala/org/apache/mahout/h2obindings/H2OEngine.scala - DSL 
operator graph evaluator and various abstract API implementations for a 
distributed engine
+
+  h2o/src/main/scala/org/apache/mahout/h2obindings/* - Various abstract API 
implementations ("glue work")
\ No newline at end of file


http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/environment/spark-internals.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/environment/spark-internals.md
 
b/website/old_site_migration/needs_work_convenience/environment/spark-internals.md
new file mode 100644
index 0000000..f5d72a4
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/environment/spark-internals.md
@@ -0,0 +1,25 @@
+---
+layout: default
+title: 
+theme:
+   name: retro-mahout
+---
+
+# Introduction
+
+This document provides an overview of how the Mahout Scala DSL (distributed 
algebraic operators) is implemented over the Spark back end engine. The 
document is aimed at Mahout developers, to give a high level description of the 
design. 
+
+## Spark Overview
+
+## Spark Data Model
+
+
+## Mahout DRM
+
+Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a large 
matrix of numbers in-memory in a cluster by distributing logical rows among 
servers. The DSL provides an abstract API on DRMs for backend engines to 
provide implementations of this API. Examples are Spark and H2O backend 
engines. Each engine has its own design of mapping the abstract API onto its 
data model and provide implementations for algebraic operators over that 
mapping.
+
+
+## Spark DSL Engine
+
+
+## Source Layout

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/faq.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_convenience/faq.md 
b/website/old_site_migration/needs_work_convenience/faq.md
new file mode 100644
index 0000000..8e1e592
--- /dev/null
+++ b/website/old_site_migration/needs_work_convenience/faq.md
@@ -0,0 +1,105 @@
+---
+layout: default
+title: FAQ
+theme:
+    name: retro-mahout
+---
+
+# The Official Mahout FAQ
+
+*General*
+
+1. [What is Apache Mahout?](#whatis)
+1. [What does the name mean?](#mean)
+1. [How is the name pronounced?](#pronounce)
+1. [Where can I find the origins of the Mahout project?](#historical)
+1. [Where can I download the Mahout logo?](#downloadlogo)
+1. [Where can I download Mahout slide presentations?](#presentations)
+
+*Algorithms*
+
+1. [What algorithms are implemented in Mahout?](#algos)
+1. [What algorithms are missing from Mahout?](#todo)
+1. [Do I need Hadoop to run Mahout?](#hadoop)
+
+*Hadoop specific questions*
+
+1. [Mahout just won't run in parallel on my dataset. Why?](#split)
+
+
+# *Answers*
+
+
+## General
+
+
+<a name="whatis"></a>
+#### What is Apache Mahout?
+
+Apache Mahout is a suite of machine learning libraries designed to be
+scalable and robust
+
+<a name="mean"></a>
+#### What does the name mean?
+
+The name [Mahout](http://en.wikipedia.org/wiki/Mahout)
+ was original chosen for it's association with the [Apache 
Hadoop](http://hadoop.apache.org)
+ project.  A Mahout is a person who drives an elephant (hint: Hadoop's logo
+is an elephant).  We just wanted a name that complemented Hadoop but we see
+our project as a good driver of Hadoop in the sense that we will be using
+and testing it.  We are not, however, implying that we are controlling
+Hadoop's development.
+
+Prior to coming to the ASF, those of us working on the project plan voted 
between [Howdah](http://en.wikipedia.org/wiki/Howdah) â the carriage on top 
of an elephant and Mahout.
+
+<a name="historical"></a>
+#### Where can I find the origins of the Mahout project?
+
+See 
[http://ml-site.grantingersoll.com](http://web.archive.org/web/20080101233917/http://ml-site.grantingersoll.com/index.php?title=Main_Page)
+ for old wiki and mailing list archives (all read-only)
+
+Mahout was started by <a 
href="http://web.archive.org/web/20071228055210/http://ml-site.grantingersoll.com/index.php?title=Main_Page";
 class="external-link" rel="nofollow">Isabel Drost, Grant Ingersoll and Karl 
Wettin</a>. It <a 
href="http://web.archive.org/web/20080201093120/http://lucene.apache.org/#22+January+2008+-+Lucene+PMC+Approves+Mahout+Machine+Learning+Project";
 class="external-link" rel="nofollow">started</a> as part of the <a 
href="http://lucene.apache.org"; class="external-link" rel="nofollow">Lucene</a> 
project (see the <a 
href="http://web.archive.org/web/20080102151102/http://ml-site.grantingersoll.com/index.php?title=Incubator_proposal";
 class="external-link" rel="nofollow">original proposal</a>) and went on to 
become a top level project in April of 2010.</p><p style="text-align: 
left;">The original goal was to implement all 10 algorithms from Andrew Ng's 
paper &quot;<a 
href="http://ai.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf"; 
class="external-link" rel="nof
 ollow">Map-Reduce for Machine Learning on Multicore</a>&quot;</p>
+
+<a name="pronounce"></a>
+#### How is the name pronounced?
+
+There are some disagreements about how to pronounce the name. Webster's has it 
as muh-hout (as in ["out"](http://dictionary.reference.com/browse/mahout)), but 
the Sanskrit/Hindi origins pronounce it as "muh-hoot". The second pronunciation 
suggests a nice pun on the Hebrew word ××××ª meaning "essence or truth".
+
+<a name="downloadlogo"></a>
+#### Where can I download the Mahout logo?
+
+See [MAHOUT-335](https://issues.apache.org/jira/browse/MAHOUT-335)
+
+
+<a name="presentations"></a>
+#### Where can I download Mahout slide presentations?
+
+The [Books, Tutorials and 
Talks](https://mahout.apache.org/general/books-tutorials-and-talks.html)
+ page contains an overview of a wide variety of presentations with links to 
slides where available.
+
+## Algorithms
+
+<a name="algos"></a>
+#### What algorithms are implemented in Mahout?
+
+We are interested in a wide variety of machine learning algorithms. Many of
+which are already implemented in Mahout. You can find a list 
[here](https://mahout.apache.org/users/basics/algorithms.html).
+
+<a name="todo"></a>
+#### What algorithms are missing from Mahout?
+
+There are many machine learning algorithms that we would like to have in
+Mahout. If you have an algorithm or an improvement to an algorithm that you 
would
+like to implement, start a discussion on our [mailing 
list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html).
+
+<a name="hadoop"></a>
+#### Do I need Hadoop to use Mahout?
+
+There is a number of algorithm implementations that require no Hadoop 
dependencies whatsoever, consult the [algorithms 
list](https://mahout.apache.org/users/basics/algorithms.html). In the future, 
we might provide more algorithm implementations on platforms more suitable for 
machine learning such as [Apache Spark](http://spark.apache.org)
+
+## Hadoop specific questions
+<a name="split"></a>
+#### Mahout just won't run in parallel on my dataset. Why?
+
+If you are running training on a Hadoop cluster keep in mind that the number 
of mappers started is governed by the size of the input data and the configured 
split/block size of your cluster. As a rule of thumb,
+anything below 100MB in size won't be split by default. 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/flinkbindings/flink-internals.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/flinkbindings/flink-internals.md
 
b/website/old_site_migration/needs_work_convenience/flinkbindings/flink-internals.md
new file mode 100644
index 0000000..8c8145a
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/flinkbindings/flink-internals.md
@@ -0,0 +1,50 @@
+---
+layout: default
+title: 
+theme:
+   name: retro-mahout
+---
+
+#Introduction
+
+This document provides an overview of how the Mahout Samsara environment is 
implemented over the Apache Flink backend engine. This document gives an 
overview of the code layout for the Flink backend engine, the source code for 
which can be found under /flink directory in the Mahout codebase.
+
+Apache Flink is a distributed big data streaming engine that supports both 
Streaming and Batch interfaces. Batch processing is an extension of Flinkâs 
Stream processing engine.
+
+The Mahout Flink integration presently supports Flinkâs batch processing 
capabilities leveraging the DataSet API.
+
+The Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a 
large matrix of numbers in-memory in a cluster by distributing logical rows 
among servers. Mahout's scala DSL provides an abstract API on DRMs for backend 
engines to provide implementations of this API. An example is the Spark backend 
engine. Each engine has it's own design of mapping the abstract API onto its 
data model and provides implementations for algebraic operators over that 
mapping.
+
+#Flink Overview
+
+Apache Flink is an open source, distributed Stream and Batch Processing 
Framework. At it's core, Flink is a Stream Processing engine and Batch 
processing is an extension of Stream Processing. 
+
+Flink includes several APIs for building applications with the Flink Engine:
+
+ <ol>
+<li><b>DataSet API</b> for Batch data in Java, Scala and Python</li>
+<li><b>DataStream API</b> for Stream Processing in Java and Scala</li>
+<li><b>Table API</b> with SQL-like regular expression language in Java and 
Scala</li>
+<li><b>Gelly</b> Graph Processing API in Java and Scala</li>
+<li><b>CEP API</b>, a complex event processing library</li>
+<li><b>FlinkML</b>, a Machine Learning library</li>
+</ol>
+#Flink Environment Engine
+
+The Flink backend implements the abstract DRM as a Flink DataSet. A Flink job 
runs in the context of an ExecutionEnvironment (from the Flink Batch processing 
API).
+
+#Source Layout
+
+Within mahout.git, the top level directory, flink/ holds all the source code 
for the Flink backend engine. Sections of code that interface with the rest of 
the Mahout components are in Scala, and sections of the code that interface 
with Flink DataSet API and implement algebraic operators are in Java. Here is a 
brief overview of what functionality can be found within flink/ folder.
+
+flink/ - top level directory containing all Flink related code
+
+flink/src/main/scala/org/apache/mahout/flinkbindings/blas/*.scala - Physical 
operator code for the Samsara DSL algebra
+
+flink/src/main/scala/org/apache/mahout/flinkbindings/drm/*.scala - Flink 
Dataset DRM and broadcast implementation
+
+flink/src/main/scala/org/apache/mahout/flinkbindings/io/*.scala - Read / Write 
between DRMDataSet and files on HDFS
+
+flink/src/main/scala/org/apache/mahout/flinkbindings/FlinkEngine.scala - DSL 
operator graph evaluator and various abstract API implementations for a 
distributed engine.
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/flinkbindings/playing-with-samsara-flink.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/flinkbindings/playing-with-samsara-flink.md
 
b/website/old_site_migration/needs_work_convenience/flinkbindings/playing-with-samsara-flink.md
new file mode 100644
index 0000000..4bbcd33
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/flinkbindings/playing-with-samsara-flink.md
@@ -0,0 +1,111 @@
+---
+layout: default
+title: 
+theme:
+   name: retro-mahout
+---
+
+## Getting Started 
+
+To get started, add the following dependency to the pom:
+
+    <dependency>
+      <groupId>org.apache.mahout</groupId>
+      <artifactId>mahout-flink_2.10</artifactId>
+      <version>0.12.0</version>
+    </dependency>
+
+Here is how to use the Flink backend:
+
+       import org.apache.flink.api.scala._
+       import org.apache.mahout.math.drm._
+       import org.apache.mahout.math.drm.RLikeDrmOps._
+       import org.apache.mahout.flinkbindings._
+
+       object ReadCsvExample {
+
+         def main(args: Array[String]): Unit = {
+           val filePath = "path/to/the/input/file"
+
+           val env = ExecutionEnvironment.getExecutionEnvironment
+           implicit val ctx = new FlinkDistributedContext(env)
+
+           val drm = readCsv(filePath, delim = "\t", comment = "#")
+           val C = drm.t %*% drm
+           println(C.collect)
+         }
+
+       }
+
+## Current Status
+
+The top JIRA for Flink backend is 
[MAHOUT-1570](https://issues.apache.org/jira/browse/MAHOUT-1570) which has been 
fully implemented.
+
+### Implemented
+
+* [MAHOUT-1701](https://issues.apache.org/jira/browse/MAHOUT-1701) Mahout DSL 
for Flink: implement AtB ABt and AtA operators
+* [MAHOUT-1702](https://issues.apache.org/jira/browse/MAHOUT-1702) implement 
element-wise operators (like `A + 2` or `A + B`) 
+* [MAHOUT-1703](https://issues.apache.org/jira/browse/MAHOUT-1703) implement 
`cbind` and `rbind`
+* [MAHOUT-1709](https://issues.apache.org/jira/browse/MAHOUT-1709) implement 
slicing (like `A(1 to 10, ::)`)
+* [MAHOUT-1710](https://issues.apache.org/jira/browse/MAHOUT-1710) implement 
right in-core matrix multiplication (`A %*% B` when `B` is in-core) 
+* [MAHOUT-1711](https://issues.apache.org/jira/browse/MAHOUT-1711) implement 
broadcasting
+* [MAHOUT-1712](https://issues.apache.org/jira/browse/MAHOUT-1712) implement 
operators `At`, `Ax`, `Atx` - `Ax` and `At` are implemented
+* [MAHOUT-1734](https://issues.apache.org/jira/browse/MAHOUT-1734) implement 
I/O - should be able to read results of Flink bindings
+* [MAHOUT-1747](https://issues.apache.org/jira/browse/MAHOUT-1747) add support 
for different types of indexes (String, long, etc) - now supports `Int`, `Long` 
and `String`
+* [MAHOUT-1748](https://issues.apache.org/jira/browse/MAHOUT-1748) switch to 
Flink Scala API 
+* [MAHOUT-1749](https://issues.apache.org/jira/browse/MAHOUT-1749) Implement 
`Atx`
+* [MAHOUT-1750](https://issues.apache.org/jira/browse/MAHOUT-1750) Implement 
`ABt`
+* [MAHOUT-1751](https://issues.apache.org/jira/browse/MAHOUT-1751) Implement 
`AtA` 
+* [MAHOUT-1755](https://issues.apache.org/jira/browse/MAHOUT-1755) Flush 
intermediate results to FS - Flink, unlike Spark, does not store intermediate 
results in memory.
+* [MAHOUT-1764](https://issues.apache.org/jira/browse/MAHOUT-1764) Add 
standard backend tests for Flink
+* [MAHOUT-1765](https://issues.apache.org/jira/browse/MAHOUT-1765) Add 
documentation about Flink backend
+* [MAHOUT-1776](https://issues.apache.org/jira/browse/MAHOUT-1776) Refactor 
common Engine agnostic classes to Math-Scala module
+* [MAHOUT-1777](https://issues.apache.org/jira/browse/MAHOUT-1777) move 
HDFSUtil classes into the HDFS module
+* [MAHOUT-1804](https://issues.apache.org/jira/browse/MAHOUT-1804) Implement 
drmParallelizeWithRowLabels(..) in Flink
+* [MAHOUT-1805](https://issues.apache.org/jira/browse/MAHOUT-1805) Implement 
allReduceBlock(..) in Flink bindings
+* [MAHOUT-1809](https://issues.apache.org/jira/browse/MAHOUT-1809) Failing 
tests in flin-bindings: dals and dspca
+* [MAHOUT-1810](https://issues.apache.org/jira/browse/MAHOUT-1810) Failing 
test in flink-bindings: A + B Identically partitioned (mapBlock Checkpointing 
issue)
+* [MAHOUT-1812](https://issues.apache.org/jira/browse/MAHOUT-1812) Implement 
drmParallelizeWithEmptyLong(..) in flink bindings
+* [MAHOUT-1814](https://issues.apache.org/jira/browse/MAHOUT-1814) Implement 
drm2intKeyed in flink bindings
+* [MAHOUT-1815](https://issues.apache.org/jira/browse/MAHOUT-1815) 
dsqDist(X,Y) and dsqDist(X) failing in flink tests
+* [MAHOUT-1816](https://issues.apache.org/jira/browse/MAHOUT-1816) Implement 
newRowCardinality in CheckpointedFlinkDrm
+* [MAHOUT-1817](https://issues.apache.org/jira/browse/MAHOUT-1817) Implement 
caching in Flink Bindings
+* [MAHOUT-1818](https://issues.apache.org/jira/browse/MAHOUT-1818) dals test 
failing in Flink Bindings
+* [MAHOUT-1819](https://issues.apache.org/jira/browse/MAHOUT-1819) Set the 
default Parallelism for Flink execution in FlinkDistributedContext
+* [MAHOUT-1820](https://issues.apache.org/jira/browse/MAHOUT-1820) Add a 
method to generate Tuple<PartitionId, Partition elements count>> to support 
Flink backend
+* [MAHOUT-1821](https://issues.apache.org/jira/browse/MAHOUT-1821) Use a 
mahout-flink-conf.yaml configuration file for Mahout specific Flink 
configuration
+* [MAHOUT-1822](https://issues.apache.org/jira/browse/MAHOUT-1822) Update 
NOTICE.txt, License.txt to add Apache Flink
+* [MAHOUT-1823](https://issues.apache.org/jira/browse/MAHOUT-1823) Modify 
MahoutFlinkTestSuite to implement FlinkTestBase
+* [MAHOUT-1824](https://issues.apache.org/jira/browse/MAHOUT-1824) Optimize 
FlinkOpAtA to use upper triangular matrices
+* [MAHOUT-1825](https://issues.apache.org/jira/browse/MAHOUT-1825) Add List of 
Flink algorithms to Mahout wiki page
+
+### Tests 
+
+There is a set of standard tests that all engines should pass (see 
[MAHOUT-1764](https://issues.apache.org/jira/browse/MAHOUT-1764)).  
+
+* `DistributedDecompositionsSuite` 
+* `DrmLikeOpsSuite` 
+* `DrmLikeSuite` 
+* `RLikeDrmOpsSuite` 
+
+
+These are Flink-backend specific tests, e.g.
+
+* `DrmLikeOpsSuite` for operations like `norm`, `rowSums`, `rowMeans`
+* `RLikeOpsSuite` for basic LA like `A.t %*% A`, `A.t %*% x`, etc
+* `LATestSuite` tests for specific operators like `AtB`, `Ax`, etc
+* `UseCasesSuite` has more complex examples, like power iteration, ridge 
regression, etc
+
+## Environment 
+
+For development the minimal supported configuration is 
+
+* [JDK 
1.7](http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html)
+* [Scala 2.10]
+
+When using mahout, please import the following modules: 
+
+* `mahout-math`
+* `mahout-math-scala`
+* `mahout-flink_2.10`
+*
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/bankmarketing-example.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/classification/bankmarketing-example.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/bankmarketing-example.md
new file mode 100644
index 0000000..846a4ce
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/bankmarketing-example.md
@@ -0,0 +1,53 @@
+---
+layout: default
+title:
+theme:
+    name: retro-mahout
+---
+
+Notice:    Licensed to the Apache Software Foundation (ASF) under one
+           or more contributor license agreements.  See the NOTICE file
+           distributed with this work for additional information
+           regarding copyright ownership.  The ASF licenses this file
+           to you under the Apache License, Version 2.0 (the
+           "License"); you may not use this file except in compliance
+           with the License.  You may obtain a copy of the License at
+           .
+             http://www.apache.org/licenses/LICENSE-2.0
+           .
+           Unless required by applicable law or agreed to in writing,
+           software distributed under the License is distributed on an
+           "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+           KIND, either express or implied.  See the License for the
+           specific language governing permissions and limitations
+           under the License.
+
+#Bank Marketing Example
+
+### Introduction
+
+This page describes how to run Mahout's SGD classifier on the [UCI Bank 
Marketing dataset](http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing).
+The goal is to predict if the client will subscribe a term deposit offered via 
a phone call. The features in the dataset consist
+of information such as age, job, marital status as well as information about 
the last contacts from the bank.
+
+### Code & Data
+
+The bank marketing example code lives under 
+
+*mahout-examples/src/main/java/org.apache.mahout.classifier.sgd.bankmarketing*
+
+The data can be found at 
+
+*mahout-examples/src/main/resources/bank-full.csv*
+
+### Code details
+
+This example consists of 3 classes:
+
+  - BankMarketingClassificationMain
+  - TelephoneCall
+  - TelephoneCallParser
+
+When you run the main method of BankMarketingClassificationMain it parses the 
dataset using the TelephoneCallParser and trains
+a logistic regression model with 20 runs and 20 passes. The 
TelephoneCallParser uses Mahout's feature vector encoder
+to encode the features in the dataset into a vector. Afterwards the model is 
tested and the learning rate and AUC is printed accuracy is printed to standard 
output.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/bayesian.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/classification/bayesian.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/bayesian.md
new file mode 100644
index 0000000..51a5c74
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/bayesian.md
@@ -0,0 +1,147 @@
+---
+layout: default
+title:
+theme:
+    name: retro-mahout
+---
+
+# Naive Bayes
+
+
+## Intro
+
+Mahout currently has two Naive Bayes implementations.  The first is standard 
Multinomial Naive Bayes. The second is an implementation of Transformed 
Weight-normalized Complement Naive Bayes as introduced by Rennie et al. 
[[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). We refer to 
the former as Bayes and the latter as CBayes.
+
+Where Bayes has long been a standard in text classification, CBayes is an 
extension of Bayes that performs particularly well on datasets with skewed 
classes and has been shown to be competitive with algorithms of higher 
complexity such as Support Vector Machines. 
+
+
+## Implementations
+Both Bayes and CBayes are currently trained via MapReduce Jobs. Testing and 
classification can be done via a MapReduce Job or sequentially.  Mahout 
provides CLI drivers for preprocessing, training and testing. A Spark 
implementation is currently in the works 
([MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493)).
+
+## Preprocessing and Algorithm
+
+As described in 
[[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive 
Bayes is broken down into the following steps (assignments are over all 
possible index values):  
+
+- Let `\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)` be a set of documents; 
`\(d_{ij}\)` is the count of word `\(i\)` in document `\(j\)`.
+- Let `\(\vec{y}=(y_1,...,y_n)\)` be their labels.
+- Let `\(\alpha_i\)` be a smoothing parameter for all words in the vocabulary; 
let `\(\alpha=\sum_i{\alpha_i}\)`. 
+- **Preprocessing**(via seq2Sparse) TF-IDF transformation and L2 length 
normalization of `\(\vec{d}\)`
+    1. `\(d_{ij} = \sqrt{d_{ij}}\)` 
+    2. `\(d_{ij} = 
d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)` 
+    3. `\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)` 
+- **Training: Bayes**`\((\vec{d},\vec{y})\)` calculate term weights 
`\(w_{ci}\)` as:
+    1. `\(\hat\theta_{ci}=\frac{d_{ic}+\alpha_i}{\sum_k{d_{kc}}+\alpha}\)`
+    2. `\(w_{ci}=\log{\hat\theta_{ci}}\)`
+- **Training: CBayes**`\((\vec{d},\vec{y})\)` calculate term weights 
`\(w_{ci}\)` as:
+    1. `\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq 
c}d_{ij}+\alpha_i}{\sum_{j:y_j\neq c}{\sum_k{d_{kj}}}+\alpha}\)`
+    2. `\(w_{ci}=-\log{\hat\theta_{ci}}\)`
+    3. `\(w_{ci}=\frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}\)`
+- **Label Assignment/Testing:**
+    1. Let `\(\vec{t}= (t_1,...,t_n)\)` be a test document; let `\(t_i\)` be 
the count of the word `\(t\)`.
+    2. Label the document according to `\(l(t)=\arg\max_c \sum\limits_{i} t_i 
w_{ci}\)`
+
+As we can see, the main difference between Bayes and CBayes is the weight 
calculation step.  Where Bayes weighs terms more heavily based on the 
likelihood that they belong to class `\(c\)`, CBayes seeks to maximize term 
weights on the likelihood that they do not belong to any other class.  
+
+## Running from the command line
+
+Mahout provides CLI drivers for all above steps.  Here we will give a simple 
overview of Mahout CLI commands used to preprocess the data, train the model 
and assign labels to the training set. An [example 
script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
 is given for the full process from data acquisition through classification of 
the classic [20 Newsgroups 
corpus](https://mahout.apache.org/users/classification/twenty-newsgroups.html). 
 
+
+- **Preprocessing:**
+For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the 
[mahout 
seq2sparse](https://mahout.apache.org/users/basics/creating-vectors-from-text.html)
 command performs the TF-IDF transformations (-wt tfidf option) and L2 length 
normalization (-n 2 option) as follows:
+
+        mahout seq2sparse 
+          -i ${PATH_TO_SEQUENCE_FILES} 
+          -o ${PATH_TO_TFIDF_VECTORS} 
+          -nv 
+          -n 2
+          -wt tfidf
+
+- **Training:**
+The model is then trained using `mahout trainnb` .  The default is to train a 
Bayes model. The -c option is given to train a CBayes model:
+
+        mahout trainnb
+          -i ${PATH_TO_TFIDF_VECTORS} 
+          -o ${PATH_TO_MODEL}/model 
+          -li ${PATH_TO_MODEL}/labelindex 
+          -ow 
+          -c
+
+- **Label Assignment/Testing:**
+Classification and testing on a holdout set can then be performed via `mahout 
testnb`. Again, the -c option indicates that the model is CBayes.  The -seq 
option tells `mahout testnb` to run sequentially:
+
+        mahout testnb 
+          -i ${PATH_TO_TFIDF_TEST_VECTORS}
+          -m ${PATH_TO_MODEL}/model 
+          -l ${PATH_TO_MODEL}/labelindex 
+          -ow 
+          -o ${PATH_TO_OUTPUT} 
+          -c 
+          -seq
+
+## Command line options
+
+- **Preprocessing:**
+  
+  Only relevant parameters used for Bayes/CBayes as detailed above are shown. 
Several other transformations can be performed by `mahout seq2sparse` and used 
as input to Bayes/CBayes.  For a full list of `mahout seq2Sparse` options see 
the [Creating vectors from 
text](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) 
page.
+
+        mahout seq2sparse                         
+          --output (-o) output             The directory pathname for output.  
      
+          --input (-i) input               Path to job input directory.        
      
+          --weight (-wt) weight            The kind of weight to use. 
Currently TF   
+                                               or TFIDF. Default: TFIDF        
          
+          --norm (-n) norm                 The norm to use, expressed as 
either a    
+                                               float or "INF" if you want to 
use the     
+                                               Infinite norm.  Must be greater 
or equal  
+                                               to 0.  The default is not to 
normalize    
+          --overwrite (-ow)                If set, overwrite the output 
directory    
+          --sequentialAccessVector (-seq)  (Optional) Whether output vectors 
should  
+                                               be SequentialAccessVectors. If 
set true   
+                                               else false                      
          
+          --namedVector (-nv)              (Optional) Whether output vectors 
should  
+                                               be NamedVectors. If set true 
else false   
+
+- **Training:**
+
+        mahout trainnb
+          --input (-i) input               Path to job input directory.        
         
+          --output (-o) output             The directory pathname for output.  
                  
+          --alphaI (-a) alphaI             Smoothing parameter. Default is 1.0
+          --trainComplementary (-c)        Train complementary? Default is 
false.                        
+          --labelIndex (-li) labelIndex    The path to store the label index 
in         
+          --overwrite (-ow)                If present, overwrite the output 
directory   
+                                               before running job              
             
+          --help (-h)                      Print out help                      
         
+          --tempDir tempDir                Intermediate output directory       
         
+          --startPhase startPhase          First phase to run                  
         
+          --endPhase endPhase              Last phase to run
+
+- **Testing:**
+
+        mahout testnb   
+          --input (-i) input               Path to job input directory.        
          
+          --output (-o) output             The directory pathname for output.  
          
+          --overwrite (-ow)                If present, overwrite the output 
directory    
+                                               before running job              
                                  
+
+      
+          --model (-m) model               The path to the model built during 
training   
+          --testComplementary (-c)         Test complementary? Default is 
false.                          
+          --runSequential (-seq)           Run sequential?                     
          
+          --labelIndex (-l) labelIndex     The path to the location of the 
label index   
+          --help (-h)                      Print out help                      
          
+          --tempDir tempDir                Intermediate output directory       
          
+          --startPhase startPhase          First phase to run                  
          
+          --endPhase endPhase              Last phase to run  
+
+
+## Examples
+
+Mahout provides an example for Naive Bayes classification:
+
+1. [Classify 20 Newsgroups](twenty-newsgroups.html)
+ 
+## References
+
+[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003). 
[Tackling the Poor Assumptions of Naive Bayes Text 
Classifiers](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). 
Proceedings of the Twentieth International Conference on Machine Learning 
(ICML-2003).
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/breiman-example.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/classification/breiman-example.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/breiman-example.md
new file mode 100644
index 0000000..d8d049e
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/breiman-example.md
@@ -0,0 +1,67 @@
+---
+layout: default
+title: Breiman Example
+theme:
+    name: retro-mahout
+---
+
+#Breiman Example
+
+#### Introduction
+
+This page describes how to run the Breiman example, which implements the test 
procedure described in [Leo Breiman's 
paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23.3999&rep=rep1&type=pdf).
 The basic algorithm is as follows :
+
+ * repeat *I* iterations
+ * in each iteration do
+  * keep 10% of the dataset apart as a testing set 
+  * build two forests using the training set, one with *m = int(log2(M) + 1)* 
(called Random-Input) and one with *m = 1* (called Single-Input)
+  * choose the forest that gave the lowest oob error estimation to compute
+the test set error
+  * compute the test set error using the Single Input Forest (test error),
+this demonstrates that even with *m = 1*, Decision Forests give comparable
+results to greater values of *m*
+  * compute the mean testset error using every tree of the chosen forest
+(tree error). This should indicate how well a single Decision Tree performs
+ * compute the mean test error for all iterations
+ * compute the mean tree error for all iterations
+
+
+#### Running the Example
+
+The current implementation is compatible with the [UCI 
repository](http://archive.ics.uci.edu/ml/) file format. We'll show how to run 
this example on two datasets:
+
+First, we deal with [Glass 
Identification](http://archive.ics.uci.edu/ml/datasets/Glass+Identification): 
download the 
[dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data)
 file called **glass.data** and store it onto your local machine. Next, we must 
generate the descriptor file **glass.info** for this dataset with the following 
command:
+
+    bin/mahout org.apache.mahout.classifier.df.tools.Describe -p 
/path/to/glass.data -f /path/to/glass.info -d I 9 N L
+
+Substitute */path/to/* with the folder where you downloaded the dataset, the 
argument "I 9 N L" indicates the nature of the variables. Here it means 1
+ignored (I) attribute, followed by 9 numerical(N) attributes, followed by
+the label (L).
+
+Finally, we build and evaluate our random forest classifier as follows:
+
+    bin/mahout org.apache.mahout.classifier.df.BreimanExample -d 
/path/to/glass.data -ds /path/to/glass.info -i 10 -t 100
+which builds 100 trees (-t argument) and repeats the test 10 iterations (-i
+argument) 
+
+The example outputs the following results:
+
+ * Selection error: mean test error for the selected forest on all iterations
+ * Single Input error: mean test error for the single input forest on all
+iterations
+ * One Tree error: mean single tree error on all iterations
+ * Mean Random Input Time: mean build time for random input forests on all
+iterations
+ * Mean Single Input Time: mean build time for single input forests on all
+iterations
+
+We can repeat this for a 
[Sonar](http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar,+Mines+vs.+Rocks%29)
 usecase: download the 
[dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data)
 file called **sonar.all-data** and store it onto your local machine. Generate 
the descriptor file **sonar.info** for this dataset with the following command:
+
+    bin/mahout org.apache.mahout.classifier.df.tools.Describe -p 
/path/to/sonar.all-data -f /path/to/sonar.info -d 60 N L
+
+The argument "60 N L" means 60 numerical(N) attributes, followed by the label 
(L). Analogous to the previous case, we run the evaluation as follows:
+
+    bin/mahout org.apache.mahout.classifier.df.BreimanExample -d 
/path/to/sonar.all-data -ds /path/to/sonar.info -i 10 -t 100
+
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/class-discovery.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/classification/class-discovery.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/class-discovery.md
new file mode 100644
index 0000000..a24cc14
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/class-discovery.md
@@ -0,0 +1,155 @@
+---
+layout: default
+title: Class Discovery
+theme:
+    name: retro-mahout
+---
+<a name="ClassDiscovery-ClassDiscovery"></a>
+# Class Discovery
+
+See http://www.cs.bham.ac.uk/~wbl/biblio/gecco1999/GP-417.pdf
+
+CDGA uses a Genetic Algorithm to discover a classification rule for a given
+dataset. 
+A dataset can be seen as a table:
+
+<table>
+<tr><th> </th><th>attribute 1</th><th>attribute 
2</th><th>...</th><th>attribute N</th></tr>
+<tr><td>row 
1</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+<tr><td>row 
2</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+<tr><td>...</td><td>...</td><td>...</td><td>...</td><td>...</td></tr>
+<tr><td>row 
M</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+</table>
+
+An attribute can be numerical, for example a "temperature" attribute, or
+categorical, for example a "color" attribute. For classification purposes,
+one of the categorical attributes is designated as a *label*, which means
+that its value defines the *class* of the rows.
+A classification rule can be represented as follows:
+<table>
+<tr><th> </th><th>attribute 1</th><th>attribute 
2</th><th>...</th><th>attribute N</th></tr>
+<tr><td>weight</td><td>w1</td><td>w2</td><td>...</td><td>wN</td></tr>
+<tr><td>operator</td><td>op1</td><td>op2</td><td>...</td><td>opN</td></tr>
+<tr><td>value</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr>
+</table>
+
+For a given *target* class and a weight *threshold*, the classification
+rule can be read :
+
+
+    for each row of the dataset
+      if (rule.w1 < threshold || (rule.w1 >= threshold && row.value1 rule.op1
+rule.value1)) &&
+         (rule.w2 < threshold || (rule.w2 >= threshold && row.value2 rule.op2
+rule.value2)) &&
+         ...
+         (rule.wN < threshold || (rule.wN >= threshold && row.valueN rule.opN
+rule.valueN)) then
+        row is part of the target class
+
+
+*Important:* The label attribute is not evaluated by the rule.
+
+The threshold parameter allows some conditions of the rule to be skipped if
+their weight is too small. The operators available depend on the attribute
+types:
+* for a numerical attributes, the available operators are '<' and '>='
+* for categorical attributes, the available operators are '!=' and '=='
+
+The "threshold" and "target" are user defined parameters, and because the
+label is always a categorical attribute, the target is the (zero based)
+index of the class label value in all the possible values of the label. For
+example, if the label attribute can have the following values (blue, brown,
+green), then a target of 1 means the "blue" class.
+
+For example, we have the following dataset (the label attribute is "Eyes
+Color"):
+<table>
+<tr><th> </th><th>Age</th><th>Eyes Color</th><th>Hair Color</th></tr>
+<tr><td>row 1</td><td>16</td><td>brown</td><td>dark</td></tr>
+<tr><td>row 2</td><td>25</td><td>green</td><td>light</td></tr>
+<tr><td>row 3</td><td>12</td><td>blue</td><td>light</td></tr>
+and a classification rule:
+<tr><td>weight</td><td>0</td><td>1</td></tr>
+<tr><td>operator</td><td><</td><td>!=</td></tr>
+<tr><td>value</td><td>20</td><td>light</td></tr>
+and the following parameters: threshold = 1 and target = 0 (brown).
+</table>
+
+This rule can be read as follows:
+
+    for each row of the dataset
+      if (0 < 1 || (0 >= 1 && row.value1 < 20)) &&
+         (1 < 1 || (1 >= 1 && row.value2 != light)) then
+        row is part of the "brown Eye Color" class
+
+
+Please note how the rule skipped the label attribute (Eye Color), and how
+the first condition is ignored because its weight is < threshold.
+
+<a name="ClassDiscovery-Runningtheexample:"></a>
+# Running the example:
+NOTE: Substitute in the appropriate version for the Mahout JOB jar
+
+1. cd <MAHOUT_HOME>/examples
+1. ant job
+1. {code}<HADOOP_HOME>/bin/hadoop dfs -put
+<MAHOUT_HOME>/examples/src/test/resources/wdbc wdbc{code}
+1. {code}<HADOOP_HOME>/bin/hadoop dfs -put
+<MAHOUT_HOME>/examples/src/test/resources/wdbc.infos wdbc.infos{code}
+1. {code}<HADOOP_HOME>/bin/hadoop jar
+<MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.job
+org.apache.mahout.ga.watchmaker.cd.CDGA
+<MAHOUT_HOME>/examples/src/test/resources/wdbc 1 0.9 1 0.033 0.1 0 100 10
+
+    CDGA needs 9 parameters:
+    * param 1 : path of the directory that contains the dataset and its infos
+file
+    * param 2 : target class
+    * param 3 : threshold
+    * param 4 : number of crossover points for the multi-point crossover
+    * param 5 : mutation rate
+    * param 6 : mutation range
+    * param 7 : mutation precision
+    * param 8 : population size
+    * param 9 : number of generations before the program stops
+    
+    For more information about 4th parameter, please see [Multi-point 
Crossover|http://www.geatbx.com/docu/algindex-03.html#P616_36571]
+.
+    For a detailed explanation about the 5th, 6th and 7th parameters, please
+see [Real Valued 
Mutation|http://www.geatbx.com/docu/algindex-04.html#P659_42386]
+.
+    
+    *TODO*: Fill in where to find the output and what it means.
+    
+    h1. The info file:
+    To run properly, CDGA needs some informations about the dataset. Each
+dataset should be accompanied by an .infos file that contains the needed
+informations. for each attribute a corresponding line in the info file
+describes it, it can be one of the following:
+    * IGNORED
+      if the attribute is ignored
+    * LABEL, val1, val2,...
+      if the attribute is the label (class), and its possible values
+    * CATEGORICAL, val1, val2,...
+      if the attribute is categorial (nominal), and its possible values
+    * NUMERICAL, min, max
+      if the attribute is numerical, and its min and max values
+    
+    This file can be generated automaticaly using a special tool available with
+CDGA.
+    
+
+
+*  the tool searches for an existing infos file (*must be filled by the
+user*), in the same directory of the dataset with the same name and with
+the ".infos" extension, that contain the type of the attributes:
+  ** 'N' numerical attribute
+  ** 'C' categorical attribute
+  ** 'L' label (this also a categorical attribute)
+  ** 'I' to ignore the attribute
+  each attribute is in a separate 
+* A Hadoop job is used to parse the dataset and collect the informations.
+This means that *the dataset can be distributed over HDFS*.
+* the results are written back in the same .info file, with the correct
+format needed by CDGA.

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/classifyingyourdata.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/classification/classifyingyourdata.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/classifyingyourdata.md
new file mode 100644
index 0000000..c2099c0
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/classifyingyourdata.md
@@ -0,0 +1,27 @@
+---
+layout: default
+title: ClassifyingYourData
+theme:
+    name: retro-mahout
+---
+
+# Classifying data from the command line
+
+
+After you've done the [Quickstart](../basics/quickstart.html) and are familiar 
with the basics of Mahout, it is time to build a
+classifier from your own data. The following pieces *may* be useful for in 
getting started:
+
+<a name="ClassifyingYourData-Input"></a>
+# Input
+
+For starters, you will need your data in an appropriate Vector format: See 
[Creating Vectors](../basics/creating-vectors.html) as well as [Creating 
Vectors from Text](../basics/creating-vectors-from-text.html).
+
+<a name="ClassifyingYourData-RunningtheProcess"></a>
+# Running the Process
+
+* Logistic regression [background](logistic-regression.html)
+* [Naive Bayes background](naivebayes.html) and 
[commandline](bayesian-commandline.html) options.
+* [Complementary naive bayes background](complementary-naive-bayes.html), 
[design](https://issues.apache.org/jira/browse/mahout-60.html), and 
[c-bayes-commandline](c-bayes-commandline.html)
+* [Random Forests 
Classification](https://cwiki.apache.org/confluence/display/MAHOUT/Random+Forests)
 comes with a [Breiman example](breiman-example.html). There is some really 
great documentation
+over at [Mark Needham's 
blog](http://www.markhneedham.com/blog/2012/10/27/kaggle-digit-recognizer-mahout-random-forest-attempt/).
 Also checkout the description on [Xiaomeng Shawn Wan
+s](http://shawnwan.wordpress.com/2012/06/01/mahout-0-7-random-forest-examples/)
 blog.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/collocations.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/classification/collocations.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/collocations.md
new file mode 100644
index 0000000..f107850
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/collocations.md
@@ -0,0 +1,385 @@
+---
+layout: default
+title: Collocations
+theme:
+    name: retro-mahout
+---
+
+
+
+<a name="Collocations-CollocationsinMahout"></a>
+# Collocations in Mahout
+
+A collocation is defined as a sequence of words or terms which co-occur
+more often than would be expected by chance. Statistically relevant
+combinations of terms identify additional lexical units which can be
+treated as features in a vector-based representation of a text. A detailed
+discussion of collocations can be found on 
[Wikipedia](http://en.wikipedia.org/wiki/Collocation).
+
+See there for a more detailed discussion of collocations in the [Reuters 
example](http://comments.gmane.org/gmane.comp.apache.mahout.user/5685).
+
+<a name="Collocations-Log-LikelihoodbasedCollocationIdentification"></a>
+## Theory behind implementation: Log-Likelihood based Collocation 
Identification
+
+Mahout provides an implementation of a collocation identification algorithm
+which scores collocations using log-likelihood ratio. The log-likelihood
+score indicates the relative usefulness of a collocation with regards other
+term combinations in the text. Collocations with the highest scores in a
+particular corpus will generally be more useful as features.
+
+Calculating the LLR is very straightforward and is described concisely in
+[Ted Dunning's blog 
post](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html)
+. Ted describes the series of counts reqired to calculate the LLR for two
+events A and B in order to determine if they co-occur more often than pure
+chance. These counts include the number of times the events co-occur (k11),
+the number of times the events occur without each other (k12 and k21), and
+the number of times anything occurs. These counts are summarized in the
+following table:
+
+<table>
+<tr><td> </td><td> Event A </td><td> Everything but Event A </td></tr>
+<tr><td> Event B </td><td> A and B together (k11) </td><td>  B but not A (k12) 
</td></tr>
+<tr><td> Everything but Event B </td><td> A but not B (k21) </td><td> Neither 
B nor A (k22) </td></tr>
+</table>
+
+For the purposes of collocation identification, it is useful to begin by
+thinking in word pairs, bigrams. In this case the leading or head term from
+the pair corresponds to A from the table above, B corresponds to the
+trailing or tail term, while neither B nor A is the total number of word
+pairs in the corpus less those containing B, A or both B and A.
+
+Given the word pair of 'oscillation overthruster', the Log-Likelihood ratio
+is computed by looking at the number of occurences of that word pair in the
+corpus, the number of word pairs that begin with 'oscillation' but end with
+something other than 'overthruster', the number of word pairs that end with
+'overthruster' begin with something other than 'oscillation' and the number
+of word pairs in the corpus that contain neither 'oscillation' and
+overthruster.
+
+This can be extended from bigrams to trigrams, 4-grams and beyond. In these
+cases, the current algorithm uses the first token of the ngram as the head
+of the ngram and the remaining n-1 tokens from the ngram, the n-1gram as it
+were, as the tail. Given the trigram 'hong kong cavaliers', 'hong' is
+treated as the head while 'kong cavaliers' is treated as the tail. Future
+versions of this algorithm will allow for variations in which tokens of the
+ngram are treated as the head and tail.
+
+Beyond ngrams, it is often useful to inspect cases where individual words
+occur around other interesting features of the text such as sentence
+boundaries.
+
+<a name="Collocations-GeneratingNGrams"></a>
+## Generating NGrams
+
+The tools that the collocation identification algorithm are embeeded within
+either consume tokenized text as input or provide the ability to specify an
+implementation of the Lucene Analyzer class perform tokenization in order
+to form ngrams. The tokens are passed through a Lucene ShingleFilter to
+produce NGrams of the desired length. 
+
+Given the text "Alice was beginning to get very tired" as an example,
+Lucene's StandardAnalyzer produces the tokens 'alice', 'beginning', 'get',
+'very' and 'tired', while the ShingleFilter with a max NGram size set to 3
+produces the shingles 'alice beginning', 'alice beginning get', 'beginning
+get', 'beginning get very', 'get very', 'get very tired' and 'very tired'.
+Note that both bigrams and trigrams are produced here. A future enhancement
+to the existing algorithm would involve limiting the output to a particular
+gram size as opposed to solely specifiying a max ngram size.
+
+<a name="Collocations-RunningtheCollocationIdentificationAlgorithm."></a>
+## Running the Collocation Identification Algorithm.
+
+There are a couple ways to run the llr-based collocation algorithm in
+mahout
+
+<a name="Collocations-Whencreatingvectorsfromasequencefile"></a>
+### When creating vectors from a sequence file
+
+The llr collocation identifier is integrated into the process that is used
+to create vectors from sequence files of text keys and values. Collocations
+are generated when the --maxNGramSize (-ng) option is not specified and
+defaults to 2 or is set to a number of 2 or greater. The --minLLR option
+can be used to control the cutoff that prevents collocations below the
+specified LLR score from being emitted, and the --minSupport argument can
+be used to filter out collocations that appear below a certain number of
+times. 
+
+
+    bin/mahout seq2sparse
+    
+    Usage:                                                                     
    
+         [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize 
<chunkSize>
+          --output <output> --input <input> --minDF <minDF>
+          --maxDFPercent<maxDFPercent> --weight <weight> --norm <norm> 
--minLLR <minLLR>
+          --numReducers  <numReducers> --maxNGramSize <ngramSize> --overwrite 
--help               
+          --sequentialAccessVector]
+    Options                                                                
+
+      --minSupport (-s) minSupport       (Optional) Minimum Support. Default 
Value: 2                              
+
+      --analyzerName (-a) analyzerName    The class name of the analyzer
+
+      --chunkSize (-chunk) chunkSize      The chunkSize in MegaBytes. 
100-10000MB
+
+      --output (-o) output              The output directory
+
+      --input (-i) input                  Input dir containing the documents 
in sequence file format
+
+      --minDF (-md) minDF                The minimum document frequency. 
Default is 1
+
+      --maxDFPercent (-x) maxDFPercent    The max percentage of docs for the 
DF. Can be used to remove 
+                                          really high frequency terms. 
Expressed as an
+                                          integer between 0 and 100. Default 
is 99.     
+
+      --weight (-wt) weight          The kind of weight to use. Currently TF   
+                                     or TFIDF                              
+
+      --norm (-n) norm               The norm to use, expressed as either a    
+                                     float or "INF" if you want to use the 
+                                     Infinite norm.  Must be greater orequal  
+                                     to 0.  The default is not to normalize    
+
+      --minLLR (-ml) minLLR          (Optional)The minimum Log Likelihood  
+                                     Ratio(Float)  Default is 1.0
+           
+      --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.   
 
+                                     Default Value: 1                      
+
+      --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of 
ngrams to  
+                                     create (2 = bigrams, 3 = trigrams, etc)   
+                                     Default Value:2                    
+   
+      --overwrite (-w)               If set, overwrite the output directory    
+      --help (-h)                            Print out help                    
    
+      --sequentialAccessVector (-seq)     (Optional) Whether output vectors 
should     
+                                     be SequentialAccessVectors If set true    
+                                     else false 
+
+
+<a name="Collocations-CollocDriver"></a>
+### CollocDriver
+
+
+    bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver
+    
+    Usage:                                                                     
    
+     [--input <input> --output <output> --maxNGramSize <ngramSize> --overwrite 
   
+    --minSupport <minSupport> --minLLR <minLLR> --numReducers <numReducers>    
 
+    --analyzerName <analyzerName> --preprocess --unigram --help]
+
+    Options                                                                
+
+      --input (-i) input                     The Path for input files.         
    
+
+      --output (-o) output                   The Path write output to          
    
+
+      --maxNGramSize (-ng) ngramSize      (Optional) The maximum size of 
ngramsto  
+                                     create (2 = bigrams, 3 = trigrams,etc)   
+                                     Default Value:2                   
+    
+      --overwrite (-w)               If set, overwrite the outputdirectory    
+
+      --minSupport (-s) minSupport           (Optional) Minimum Support. 
Default   
+                                     Value: 2                              
+
+      --minLLR (-ml) minLLR          (Optional)The minimum Log Likelihood
+                                     Ratio(Float)  Default is 1.0        
+  
+      --numReducers (-nr) numReducers     (Optional) Number of reduce tasks.   
 
+                                     Default Value: 1                      
+
+      --analyzerName (-a) analyzerName    The class name of the analyzer       
    
+
+      --preprocess (-p)                      If set, input is 
SequenceFile<Text,Text>  
+                                     where the value is the document, which    
+                                     will be tokenized using the specified 
+                                     analyzer.                         
+    
+      --unigram (-u)                 If set, unigrams will be emitted inthe   
+                                     final output alongside collocations
+   
+      --help (-h)                            Print out help          
+
+
+<a name="Collocations-Algorithmdetails"></a>
+## Algorithm details
+
+This section describes the implementation of the collocation identification
+algorithm in terms of the map-reduce phases that are used to generate
+ngrams and count the frequencies required to perform the log-likelihood
+calculation. Unless otherwise noted, classes that are indicated in
+CamelCase can be found in the mahout-utils module under the package
+org.apache.mahout.utils.nlp.collocations.llr
+
+The algorithm is implemented in two map-reduce passes:
+
+<a name="Collocations-Pass1:CollocDriver.generateCollocations(...)"></a>
+### Pass 1: CollocDriver.generateCollocations(...)
+
+Generates NGrams and counts frequencies for ngrams, head and tail subgrams.
+
+<a name="Collocations-Map:CollocMapper"></a>
+#### Map: CollocMapper
+
+Input k: Text (documentId), v: StringTuple (tokens) 
+
+Each call to the mapper passes in the full set of tokens for the
+corresponding document using a StringTuple. The ShingleFilter is run across
+these tokens to produce ngrams of the desired length. ngrams and
+frequencies are collected across the entire document.
+
+Once this is done, ngrams are split into head and tail portions. A key of type 
GramKey is generated which is used later to join ngrams with their heads and 
tails in the reducer phase. The GramKey is a composite key made up of a string 
n-gram fragement as the primary key and a secondary key used for grouping and 
sorting in the reduce phase. The secondary key will either be EMPTY in the case 
where we are collecting either the head or tail of an ngram as the value or it 
will contain the byte[](.html)
+ form of the ngram when collecting an ngram as the value.
+
+
+    head_key(EMPTY) -> (head subgram, head frequency)
+
+    head_key(ngram) -> (ngram, ngram frequency) 
+
+    tail_key(EMPTY) -> (tail subgram, tail frequency)
+
+    tail_key(ngram) -> (ngram, ngram frequency)
+
+
+subgram and ngram values are packaged in Gram objects.
+
+For each ngram found, the Count.NGRAM_TOTAL counter is incremented. When
+the pass is complete, this counter will hold the total number of ngrams
+encountered in the input which is used as a part of the LLR calculation.
+
+Output k: GramKey (head or tail subgram), v: Gram (head, tail or ngram with
+frequency)
+
+<a name="Collocations-Combiner:CollocCombiner"></a>
+#### Combiner: CollocCombiner
+
+Input k: GramKey, v:Gram (as above)
+
+This phase merges the counts for unique ngrams or ngram fragments across
+multiple documents. The combiner treats the entire GramKey as the key and
+as such, identical tuples from separate documents are passed into a single
+call to the combiner's reduce method, their frequencies are summed and a
+single tuple is passed out via the collector.
+
+Output k: GramKey, v:Gram
+
+<a name="Collocations-Reduce:CollocReducer"></a>
+#### Reduce: CollocReducer
+
+Input k: GramKey, v: Gram (as above)
+
+The CollocReducer employs the Hadoop secondary sort strategy to avoid
+caching ngram tuples in memory in order to calculate total ngram and
+subgram frequencies. The GramKeyPartitioner ensures that tuples with the
+same primary key are sent to the same reducer while the
+GramKeyGroupComparator ensures that iterator provided by the reduce method
+first returns the subgram and then returns ngram values grouped by ngram.
+This eliminates the need to cache the values returned by the iterator in
+order to calculate total frequencies for both subgrams and ngrams. There
+input will consist of multiple frequencies for each (subgram_key, subgram)
+or (subgram_key, ngram) tuple; one from each map task executed in which the
+particular subgram was found.
+The input will be traversed in the following order:
+
+
+    (head subgram, frequency 1)
+    (head subgram, frequency 2)
+    ... 
+    (head subgram, frequency N)
+    (ngram 1, frequency 1)
+    (ngram 1, frequency 2)
+    ...
+    (ngram 1, frequency N)
+    (ngram 2, frequency 1)
+    (ngram 2, frequency 2)
+    ...
+    (ngram 2, frequency N)
+    ...
+    (ngram N, frequency 1)
+    (ngram N, frequency 2)
+    ...
+    (ngram N, frequency N)
+
+
+Where all of the ngrams above share the same head. Data is presented in the
+same manner for the tail subgrams.
+
+As the values for a subgram or ngram are traversed, frequencies are
+accumulated. Once all values for a subgram or ngram are processed the
+resulting key/value pairs are passed to the collector as long as the ngram
+frequency is equal to or greater than the specified minSupport. When an
+ngram is skipped in this way the Skipped.LESS_THAN_MIN_SUPPORT counter to
+be incremented.
+
+Pairs are passed to the collector in the following format:
+
+
+    ngram, ngram frequency -> subgram subgram frequency
+
+
+In this manner, the output becomes an unsorted version of the following:
+
+
+    ngram 1, frequency -> ngram 1 head, head frequency
+    ngram 1, frequency -> ngram 1 tail, tail frequency
+    ngram 2, frequency -> ngram 2 head, head frequency
+    ngram 2, frequency -> ngram 2 tail, tail frequency
+    ngram N, frequency -> ngram N head, head frequency
+    ngram N, frequency -> ngram N tail, tail frequency
+
+
+Output is in the format k:Gram (ngram, frequency), v:Gram (subgram,
+frequency)
+
+<a name="Collocations-Pass2:CollocDriver.computeNGramsPruneByLLR(...)"></a>
+### Pass 2: CollocDriver.computeNGramsPruneByLLR(...)
+
+Pass 1 has calculated full frequencies for ngrams and subgrams, Pass 2
+performs the LLR calculation.
+
+<a 
name="Collocations-MapPhase:IdentityMapper(org.apache.hadoop.mapred.lib.IdentityMapper)"></a>
+#### Map Phase: IdentityMapper (org.apache.hadoop.mapred.lib.IdentityMapper)
+
+This phase is a no-op. The data is passed through unchanged. The rest of
+the work for llr calculation is done in the reduce phase.
+
+<a name="Collocations-ReducePhase:LLRReducer"></a>
+#### Reduce Phase: LLRReducer
+
+Input is k:Gram, v:Gram (as above)
+
+This phase receives the head and tail subgrams and their frequencies for
+each ngram (with frequency) produced for the input:
+
+
+    ngram 1, frequency -> ngram 1 head, frequency; ngram 1 tail, frequency
+    ngram 2, frequency -> ngram 2 head, frequency; ngram 2 tail, frequency
+    ...
+    ngram 1, frequency -> ngram N head, frequency; ngram N tail, frequency
+
+
+It also reads the full ngram count obtained from the first pass, passed in
+as a configuration option. The parameters to the llr calculation are
+calculated as follows:
+
+k11 = f_n
+k12 = f_h - f_n
+k21 = f_t - f_n
+k22 = N - ((f_h + f_t) - f_n)
+
+Where f_n is the ngram frequency, f_h and f_t the frequency of head and
+tail and N is the total number of ngrams.
+
+Tokens with a llr below that of the specified minimum llr are dropped and
+the Skipped.LESS_THAN_MIN_LLR counter is incremented.
+
+Output is k: Text (ngram), v: DoubleWritable (llr score)
+
+<a name="Collocations-Unigrampass-through."></a>
+### Unigram pass-through.
+
+By default in seq2sparse, or if the -u option is provided to the
+CollocDriver, unigrams (single tokens) will be passed through the job and
+each token's frequency will be calculated. As with ngrams, unigrams are
+subject to filtering with minSupport and minLLR.
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/gaussian-discriminative-analysis.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/classification/gaussian-discriminative-analysis.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/gaussian-discriminative-analysis.md
new file mode 100644
index 0000000..e8a54af
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/gaussian-discriminative-analysis.md
@@ -0,0 +1,20 @@
+---
+layout: default
+title: Gaussian Discriminative Analysis
+theme:
+    name: retro-mahout
+---
+
+<a name="GaussianDiscriminativeAnalysis-GaussianDiscriminativeAnalysis"></a>
+# Gaussian Discriminative Analysis
+
+Gaussian Discriminative Analysis is a tool for multigroup classification
+based on extending linear discriminant analysis. The paper on the approach
+is located at http://citeseer.ist.psu.edu/4617.html (note, for some reason
+the paper is backwards, in that page 1 is at the end)
+
+<a name="GaussianDiscriminativeAnalysis-Parallelizationstrategy"></a>
+## Parallelization strategy
+
+<a name="GaussianDiscriminativeAnalysis-Designofpackages"></a>
+## Design of packages

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/hidden-markov-models.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/classification/hidden-markov-models.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/hidden-markov-models.md
new file mode 100644
index 0000000..7321493
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/hidden-markov-models.md
@@ -0,0 +1,102 @@
+---
+layout: default
+title: Hidden Markov Models
+theme:
+    name: retro-mahout
+---
+
+# Hidden Markov Models
+
+<a name="HiddenMarkovModels-IntroductionandUsage"></a>
+## Introduction and Usage
+
+Hidden Markov Models are used in multiple areas of Machine Learning, such
+as speech recognition, handwritten letter recognition or natural language
+processing. 
+
+<a name="HiddenMarkovModels-FormalDefinition"></a>
+## Formal Definition
+
+A Hidden Markov Model (HMM) is a statistical model of a process consisting
+of two (in our case discrete) random variables O and Y, which change their
+state sequentially. The variable Y with states \{y_1, ... , y_n\} is called
+the "hidden variable", since its state is not directly observable. The
+state of Y changes sequentially with a so called - in our case first-order
+- Markov Property. This means, that the state change probability of Y only
+depends on its current state and does not change in time. Formally we
+write: P(Y(t+1)=y_i|Y(0)...Y(t)) = P(Y(t+1)=y_i|Y(t)) = P(Y(2)=y_i|Y(1)).
+The variable O with states \{o_1, ... , o_m\} is called the "observable
+variable", since its state can be directly observed. O does not have a
+Markov Property, but its state probability depends statically on the
+current state of Y.
+
+Formally, an HMM is defined as a tuple M=(n,m,P,A,B), where n is the number of 
hidden states, m is the number of observable states, P is an n-dimensional 
vector containing initial hidden state probabilities, A is the nxn-dimensional 
"transition matrix" containing the transition probabilities such that 
A\[i,j\](i,j\.html)
+=P(Y(t)=y_i|Y(t-1)=y_j) and B is the mxn-dimensional "emission matrix"
+containing the observation probabilities such that B\[i,j\]=
+P(O=o_i|Y=y_j).
+
+<a name="HiddenMarkovModels-Problems"></a>
+## Problems
+
+Rabiner \[1\](1\.html)
+ defined three main problems for HMM models:
+
+1. Evaluation: Given a sequence O of observations and a model M, what is
+the probability P(O|M) that sequence O was generated by model M. The
+Evaluation problem can be efficiently solved using the Forward algorithm
+2. Decoding: Given a sequence O of observations and a model M, what is
+the most likely sequence Y*=argmax(Y) P(O|M,Y) of hidden variables to
+generate this sequence. The Decoding problem can be efficiently solved
+using the Viterbi algorithm.
+3. Learning: Given a sequence O of observations, what is the most likely
+model M*=argmax(M)P(O|M) to generate this sequence. The Learning problem
+can be efficiently solved using the Baum-Welch algorithm.
+
+<a name="HiddenMarkovModels-Example"></a>
+## Example
+
+To build a Hidden Markov Model and use it to build some predictions, try a 
simple example like this:
+
+Create an input file to train the model.  Here we have a sequence drawn from 
the set of states 0, 1, 2, and 3, separated by space characters.
+
+    $ echo "0 1 2 2 2 1 1 0 0 3 3 3 2 1 2 1 1 1 1 2 2 2 0 0 0 0 0 0 2 2 2 0 0 
0 0 0 0 2 2 2 3 3 3 3 3 3 2 3 2 3 2 3 2 1 3 0 0 0 1 0 1 0 2 1 2 1 2 1 2 3 3 3 3 
2 2 3 2 1 1 0" > hmm-input
+
+Now run the baumwelch job to train your model, after first setting 
MAHOUT_LOCAL to true, to use your local file system.
+
+    $ export MAHOUT_LOCAL=true
+    $ $MAHOUT_HOME/bin/mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 4 
-e .0001 -m 1000
+
+Output like the following should appear in the console.
+
+    Initial probabilities: 
+    0 1 2 
+    1.0 0.0 3.5659361683006626E-251 
+    Transition matrix:
+      0 1 2 
+    0 6.098919959130616E-5 0.9997275322964165 2.1147850399214744E-4 
+    1 7.404648706054873E-37 0.9086408633885092 0.09135913661149081 
+    2 0.2284374545687356 7.01786289571088E-11 0.7715625453610858 
+    Emission matrix: 
+      0 1 2 3 
+    0 0.9999997858591223 2.0536163836449762E-39 2.1414087769942127E-7 
1.052441093535389E-27 
+    1 7.495656581383351E-34 0.2241269055449904 0.4510889999455847 
0.32478409450942497 
+    2 0.815051477991782 0.18494852200821799 8.465660634827592E-33 
2.8603899591778015E-36 
+    14/03/22 09:52:21 INFO driver.MahoutDriver: Program took 180 ms (Minutes: 
0.003)
+
+The model trained with the input set now is in the file 'hmm-model', which we 
can use to build a predicted sequence.
+
+    $ $MAHOUT_HOME/bin/mahout hmmpredict -m hmm-model -o hmm-predictions -l 10
+
+To see the predictions:
+
+    $ cat hmm-predictions 
+    0 1 3 3 2 2 2 2 1 2
+
+
+<a name="HiddenMarkovModels-Resources"></a>
+## Resources
+
+\[1\]
+ Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models
+and selected applications in speech recognition". Proceedings of the IEEE
+77 (2): 257-286. doi:10.1109/5.18626.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/independent-component-analysis.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/classification/independent-component-analysis.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/independent-component-analysis.md
new file mode 100644
index 0000000..6035b54
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/independent-component-analysis.md
@@ -0,0 +1,17 @@
+---
+layout: default
+title: Independent Component Analysis
+theme:
+    name: retro-mahout
+---
+
+<a name="IndependentComponentAnalysis-IndependentComponentAnalysis"></a>
+# Independent Component Analysis
+
+See also: Principal Component Analysis.
+
+<a name="IndependentComponentAnalysis-Parallelizationstrategy"></a>
+## Parallelization strategy
+
+<a name="IndependentComponentAnalysis-Designofpackages"></a>
+## Design of packages

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/locally-weighted-linear-regression.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/classification/locally-weighted-linear-regression.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/locally-weighted-linear-regression.md
new file mode 100644
index 0000000..7b23d85
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/locally-weighted-linear-regression.md
@@ -0,0 +1,25 @@
+---
+layout: default
+title: Locally Weighted Linear Regression
+theme:
+    name: retro-mahout
+---
+
+<a name="LocallyWeightedLinearRegression-LocallyWeightedLinearRegression"></a>
+# Locally Weighted Linear Regression
+
+Model-based methods, such as SVM, Naive Bayes and the mixture of Gaussians,
+use the data to build a parameterized model. After training, the model is
+used for predictions and the data are generally discarded. In contrast,
+"memory-based" methods are non-parametric approaches that explicitly retain
+the training data, and use it each time a prediction needs to be made.
+Locally weighted regression (LWR) is a memory-based method that performs a
+regression around a point of interest using only training data that are
+"local" to that point. Source:
+http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/cohn96a-html/node7.html
+
+<a name="LocallyWeightedLinearRegression-Strategyforparallelregression"></a>
+## Strategy for parallel regression
+
+<a name="LocallyWeightedLinearRegression-Designofpackages"></a>
+## Design of packages

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/logistic-regression.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/classification/logistic-regression.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/logistic-regression.md
new file mode 100644
index 0000000..b066fda
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/logistic-regression.md
@@ -0,0 +1,129 @@
+---
+layout: default
+title: Logistic Regression
+theme:
+    name: retro-mahout
+---
+
+<a name="LogisticRegression-LogisticRegression(SGD)"></a>
+# Logistic Regression (SGD)
+
+Logistic regression is a model used for prediction of the probability of
+occurrence of an event. It makes use of several predictor variables that
+may be either numerical or categories.
+
+Logistic regression is the standard industry workhorse that underlies many
+production fraud detection and advertising quality and targeting products. 
+The Mahout implementation uses Stochastic Gradient Descent (SGD) to all
+large training sets to be used.
+
+For a more detailed analysis of the approach, have a look at the [thesis of
+Paul 
Komarek](http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en)
 [1].
+
+See MAHOUT-228 for the main JIRA issue for SGD.
+
+A more detailed overview of the Mahout Linear Regression classifier and 
[detailed discription of building a Logistic Regression 
classifier](http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/)
 for the classic [Iris flower 
dataset](http://en.wikipedia.org/wiki/Iris_flower_data_set) is also available 
[2]. 
+
+An example of training a Logistic Regression classifier for the [UCI Bank 
Marketing Dataset](http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing) can be 
found [on the Mahout 
website](http://mahout.apache.org/users/classification/bankmarketing-example.html)
 [3].
+
+An example of training and testing a Logistic Regression document classifier 
for the classic [20 newsgroups 
corpus](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
 [4] is also available. 
+
+<a name="LogisticRegression-Parallelizationstrategy"></a>
+## Parallelization strategy
+
+The bad news is that SGD is an inherently sequential algorithm.  The good
+news is that it is blazingly fast and thus it is not a problem for Mahout's
+implementation to handle training sets of tens of millions of examples. 
+With the down-sampling typical in many data-sets, this is equivalent to a
+dataset with billions of raw training examples.
+
+The SGD system in Mahout is an online learning algorithm which means that
+you can learn models in an incremental fashion and that you can do
+performance testing as your system runs.  Often this means that you can
+stop training when a model reaches a target level of performance.  The SGD
+framework includes classes to do on-line evaluation using cross validation
+(the CrossFoldLearner) and an evolutionary system to do learning
+hyper-parameter optimization on the fly (the AdaptiveLogisticRegression). 
+The AdaptiveLogisticRegression system makes heavy use of threads to
+increase machine utilization.  The way it works is that it runs 20
+CrossFoldLearners in separate threads, each with slightly different
+learning parameters.  As better settings are found, these new settings are
+propagating to the other learners.
+
+<a name="LogisticRegression-Designofpackages"></a>
+## Design of packages
+
+There are three packages that are used in Mahout's SGD system. These
+include
+
+* The vector encoding package (found in org.apache.mahout.vectorizer.encoders)
+
+* The SGD learning package (found in org.apache.mahout.classifier.sgd)
+
+* The evolutionary optimization system (found in org.apache.mahout.ep)
+
+<a name="LogisticRegression-Featurevectorencoding"></a>
+## Feature vector encoding
+
+Because the SGD algorithms need to have fixed length feature vectors and
+because it is a pain to build a dictionary ahead of time, most SGD
+applications use the hashed feature vector encoding system that is rooted
+at FeatureVectorEncoder.
+
+The basic idea is that you create a vector, typically a
+RandomAccessSparseVector, and then you use various feature encoders to
+progressively add features to that vector.  The size of the vector should
+be large enough to avoid feature collisions as features are hashed.
+
+There are specialized encoders for a variety of data types.  You can
+normally encode either a string representation of the value you want to
+encode or you can encode a byte level representation to avoid string
+conversion.  In the case of ContinuousValueEncoder and
+ConstantValueEncoder, it is also possible to encode a null value and pass
+the real value in as a weight. This avoids numerical parsing entirely in
+case you are getting your training data from a system like Avro.
+
+Here is a class diagram for the encoders package:
+
+![class diagram](../../images/vector-class-hierarchy.png)
+
+<a name="LogisticRegression-SGDLearning"></a>
+## SGD Learning
+
+For the simplest applications, you can construct an
+OnlineLogisticRegression and be off and running.  Typically, though, it is
+nice to have running estimates of performance on held out data.  To do
+that, you should use a CrossFoldLearner which keeps a stable of five (by
+default) OnlineLogisticRegression objects.  Each time you pass a training
+example to a CrossFoldLearner, it passes this example to all but one of its
+children as training and passes the example to the last child to evaluate
+current performance.  The children are used for evaluation in a round-robin
+fashion so, if you are using the default 5 way split, all of the children
+get 80% of the training data for training and get 20% of the data for
+evaluation.
+
+To avoid the pesky need to configure learning rates, regularization
+parameters and annealing schedules, you can use the
+AdaptiveLogisticRegression.  This class maintains a pool of
+CrossFoldLearners and adapts learning rates and regularization on the fly
+so that you don't have to.
+
+Here is a class diagram for the classifiers.sgd package.  As you can see,
+the number of twiddlable knobs is pretty large.  For some examples, see the
+[TrainNewsGroups](https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainNewsGroups.java)
 example code.
+
+![sgd class diagram](../../images/sgd-class-hierarchy.png)
+
+## References
+
+[1] [Thesis of
+Paul 
Komarek](http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en)
+
+[2] [An Introduction To Mahout's Logistic Regression SGD 
Classifier](http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/)
+
+## Examples
+
+[3] [SGD Bank Marketing 
Example](http://mahout.apache.org/users/classification/bankmarketing-example.html)
+
+[4] [SGD 20 newsgroups 
classification](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/mahout-collections.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/classification/mahout-collections.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/mahout-collections.md
new file mode 100644
index 0000000..99f22f6
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/classification/mahout-collections.md
@@ -0,0 +1,60 @@
+---
+layout: default
+title: mahout-collections
+theme:
+    name: retro-mahout
+---
+
+# Mahout collections
+
+<a name="mahout-collections-Introduction"></a>
+## Introduction
+
+The Mahout Collections library is a set of container classes that address
+some limitations of the standard collections in Java. [This 
presentation](http://domino.research.ibm.com/comm/research_people.nsf/pages/sevitsky.pubs.html/$FILE/oopsla08%20memory-efficient%20java%20slides.pdf)
+ describes a number of performance problems with the standard collections. 
+
+Mahout collections addresses two of the more glaring: the lack of support
+for primitive types and the lack of open hashing.
+
+<a name="mahout-collections-PrimitiveTypes"></a>
+## Primitive Types
+
+The most visible feature of Mahout Collections is the large collection of
+primitive type collections. Given Java's asymmetrical support for the
+primitive types, the only efficient way to handle them is with many
+classes. So, there are ArrayList-like containers for all of the primitive
+types, and hash maps for all the useful combinations of primitive type and
+object keys and values.
+
+These classes do not, in general, implement interfaces from *java.util*.
+Even when the *java.util* interfaces could be type-compatible, they tend
+to include requirements that are not consistent with efficient use of
+primitive types.
+
+<a name="mahout-collections-OpenAddressing"></a>
+# Open Addressing
+
+All of the sets and maps in Mahout Collections are open-addressed hash
+tables. Open addressing has a much smaller memory footprint than chaining.
+Since the purpose of these collections is to avoid the memory cost of
+autoboxing, open addressing is a consistent design choice.
+
+<a name="mahout-collections-Sets"></a>
+## Sets
+
+Mahout Collections includes open hash sets. Unlike *java.util*, a set is
+not a recycled hash table; the sets are separately implemented and do not
+have any additional storage usage for unused keys.
+
+<a name="mahout-collections-CreditwhereCreditisdue"></a>
+# Credit where Credit is due
+
+The implementation of Mahout Collections is derived from [Cern 
Colt](http://acs.lbl.gov/~hoschek/colt/)
+.
+
+
+
+
+
+

[7/9] mahout git commit: WEBSITE Triage of Old Site Migration

Reply via email to