http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/environment/h2o-internals.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/environment/h2o-internals.md b/website/old_site_migration/needs_work_convenience/environment/h2o-internals.md new file mode 100644 index 0000000..c72a7ae --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/environment/h2o-internals.md @@ -0,0 +1,51 @@ +--- +layout: default +title: +theme: + name: retro-mahout +--- + +# Introduction + +This document provides an overview of how the Mahout Samsara environment is implemented over the H2O backend engine. The document is aimed at Mahout developers, to give a high level description of the design so that one can explore the code inside `h2o/` with some context. + +## H2O Overview + +H2O is a distributed scalable machine learning system. Internal architecture of H2O has a distributed math engine (h2o-core) and a separate layer on top for algorithms and UI. The Mahout integration requires only the math engine (h2o-core). + +## H2O Data Model + +The data model of the H2O math engine is a distributed columnar store (of primarily numbers, but also strings). A column of numbers is called a Vector, which is broken into Chunks (of a few thousand elements). Chunks are distributed across the cluster based on a deterministic hash. Therefore, any member of the cluster knows where a particular Chunk of a Vector is homed. Each Chunk is separately compressed in memory and elements are individually decompressed on the fly upon access with purely register operations (thereby achieving high memory throughput). An ordered set of similarly partitioned Vecs are composed into a Frame. A Frame is therefore a large two dimensional table of numbers. All elements of a logical row in the Frame are guaranteed to be homed in the same server of the cluster. Generally speaking, H2O works well on "tall skinny" data, i.e, lots of rows (100s of millions) and modest number of columns (10s of thousands). + + +## Mahout DRM + +The Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a large matrix of numbers in-memory in a cluster by distributing logical rows among servers. Mahout's scala DSL provides an abstract API on DRMs for backend engines to provide implementations of this API. Examples are the Spark and H2O backend engines. Each engine has it's own design of mapping the abstract API onto its data model and provides implementations for algebraic operators over that mapping. + + +## H2O Environment Engine + +The H2O backend implements the abstract DRM as an H2O Frame. Each logical column in the DRM is an H2O Vector. All elements of a logical DRM row are guaranteed to be homed on the same server. A set of rows stored on a server are presented as a read-only virtual in-core Matrix (i.e BlockMatrix) for the closure method in the `mapBlock(...)` API. + +H2O provides a flexible execution framework called `MRTask`. The `MRTask` framework typically executes over a Frame (or even a Vector), supports various types of map() methods, can optionally modify the Frame or Vector (though this never happens in the Mahout integration), and optionally create a new Vector or set of Vectors (to combine them into a new Frame, and consequently a new DRM). + + +## Source Layout + +Within mahout.git, the top level directory, `h2o/` holds all the source code related to the H2O backend engine. Part of the code (that interfaces with the rest of the Mahout componenets) is in Scala, and part of the code (that interfaces with h2o-core and implements algebraic operators) is in Java. Here is a brief overview of what functionality can be found where within `h2o/`. + + h2o/ - top level directory containing all H2O related code + + h2o/src/main/java/org/apache/mahout/h2obindings/ops/*.java - Physical operator code for the various DSL algebra + + h2o/src/main/java/org/apache/mahout/h2obindings/drm/*.java - DRM backing (onto Frame) and Broadcast implementation + + h2o/src/main/java/org/apache/mahout/h2obindings/H2OHdfs.java - Read / Write between DRM (Frame) and files on HDFS + + h2o/src/main/java/org/apache/mahout/h2obindings/H2OBlockMatrix.java - A vertical block matrix of DRM presented as a virtual copy-on-write in-core Matrix. Used in mapBlock() API + + h2o/src/main/java/org/apache/mahout/h2obindings/H2OHelper.java - A collection of various functionality and helpers. For e.g, convert between in-core Matrix and DRM, various summary statistics on DRM/Frame. + + h2o/src/main/scala/org/apache/mahout/h2obindings/H2OEngine.scala - DSL operator graph evaluator and various abstract API implementations for a distributed engine + + h2o/src/main/scala/org/apache/mahout/h2obindings/* - Various abstract API implementations ("glue work") \ No newline at end of file
http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/environment/spark-internals.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/environment/spark-internals.md b/website/old_site_migration/needs_work_convenience/environment/spark-internals.md new file mode 100644 index 0000000..f5d72a4 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/environment/spark-internals.md @@ -0,0 +1,25 @@ +--- +layout: default +title: +theme: + name: retro-mahout +--- + +# Introduction + +This document provides an overview of how the Mahout Scala DSL (distributed algebraic operators) is implemented over the Spark back end engine. The document is aimed at Mahout developers, to give a high level description of the design. + +## Spark Overview + +## Spark Data Model + + +## Mahout DRM + +Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a large matrix of numbers in-memory in a cluster by distributing logical rows among servers. The DSL provides an abstract API on DRMs for backend engines to provide implementations of this API. Examples are Spark and H2O backend engines. Each engine has its own design of mapping the abstract API onto its data model and provide implementations for algebraic operators over that mapping. + + +## Spark DSL Engine + + +## Source Layout http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/faq.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/faq.md b/website/old_site_migration/needs_work_convenience/faq.md new file mode 100644 index 0000000..8e1e592 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/faq.md @@ -0,0 +1,105 @@ +--- +layout: default +title: FAQ +theme: + name: retro-mahout +--- + +# The Official Mahout FAQ + +*General* + +1. [What is Apache Mahout?](#whatis) +1. [What does the name mean?](#mean) +1. [How is the name pronounced?](#pronounce) +1. [Where can I find the origins of the Mahout project?](#historical) +1. [Where can I download the Mahout logo?](#downloadlogo) +1. [Where can I download Mahout slide presentations?](#presentations) + +*Algorithms* + +1. [What algorithms are implemented in Mahout?](#algos) +1. [What algorithms are missing from Mahout?](#todo) +1. [Do I need Hadoop to run Mahout?](#hadoop) + +*Hadoop specific questions* + +1. [Mahout just won't run in parallel on my dataset. Why?](#split) + + +# *Answers* + + +## General + + +<a name="whatis"></a> +#### What is Apache Mahout? + +Apache Mahout is a suite of machine learning libraries designed to be +scalable and robust + +<a name="mean"></a> +#### What does the name mean? + +The name [Mahout](http://en.wikipedia.org/wiki/Mahout) + was original chosen for it's association with the [Apache Hadoop](http://hadoop.apache.org) + project. A Mahout is a person who drives an elephant (hint: Hadoop's logo +is an elephant). We just wanted a name that complemented Hadoop but we see +our project as a good driver of Hadoop in the sense that we will be using +and testing it. We are not, however, implying that we are controlling +Hadoop's development. + +Prior to coming to the ASF, those of us working on the project plan voted between [Howdah](http://en.wikipedia.org/wiki/Howdah) â the carriage on top of an elephant and Mahout. + +<a name="historical"></a> +#### Where can I find the origins of the Mahout project? + +See [http://ml-site.grantingersoll.com](http://web.archive.org/web/20080101233917/http://ml-site.grantingersoll.com/index.php?title=Main_Page) + for old wiki and mailing list archives (all read-only) + +Mahout was started by <a href="http://web.archive.org/web/20071228055210/http://ml-site.grantingersoll.com/index.php?title=Main_Page" class="external-link" rel="nofollow">Isabel Drost, Grant Ingersoll and Karl Wettin</a>. It <a href="http://web.archive.org/web/20080201093120/http://lucene.apache.org/#22+January+2008+-+Lucene+PMC+Approves+Mahout+Machine+Learning+Project" class="external-link" rel="nofollow">started</a> as part of the <a href="http://lucene.apache.org" class="external-link" rel="nofollow">Lucene</a> project (see the <a href="http://web.archive.org/web/20080102151102/http://ml-site.grantingersoll.com/index.php?title=Incubator_proposal" class="external-link" rel="nofollow">original proposal</a>) and went on to become a top level project in April of 2010.</p><p style="text-align: left;">The original goal was to implement all 10 algorithms from Andrew Ng's paper "<a href="http://ai.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf" class="external-link" rel="nof ollow">Map-Reduce for Machine Learning on Multicore</a>"</p> + +<a name="pronounce"></a> +#### How is the name pronounced? + +There are some disagreements about how to pronounce the name. Webster's has it as muh-hout (as in ["out"](http://dictionary.reference.com/browse/mahout)), but the Sanskrit/Hindi origins pronounce it as "muh-hoot". The second pronunciation suggests a nice pun on the Hebrew word ×××ת meaning "essence or truth". + +<a name="downloadlogo"></a> +#### Where can I download the Mahout logo? + +See [MAHOUT-335](https://issues.apache.org/jira/browse/MAHOUT-335) + + +<a name="presentations"></a> +#### Where can I download Mahout slide presentations? + +The [Books, Tutorials and Talks](https://mahout.apache.org/general/books-tutorials-and-talks.html) + page contains an overview of a wide variety of presentations with links to slides where available. + +## Algorithms + +<a name="algos"></a> +#### What algorithms are implemented in Mahout? + +We are interested in a wide variety of machine learning algorithms. Many of +which are already implemented in Mahout. You can find a list [here](https://mahout.apache.org/users/basics/algorithms.html). + +<a name="todo"></a> +#### What algorithms are missing from Mahout? + +There are many machine learning algorithms that we would like to have in +Mahout. If you have an algorithm or an improvement to an algorithm that you would +like to implement, start a discussion on our [mailing list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html). + +<a name="hadoop"></a> +#### Do I need Hadoop to use Mahout? + +There is a number of algorithm implementations that require no Hadoop dependencies whatsoever, consult the [algorithms list](https://mahout.apache.org/users/basics/algorithms.html). In the future, we might provide more algorithm implementations on platforms more suitable for machine learning such as [Apache Spark](http://spark.apache.org) + +## Hadoop specific questions +<a name="split"></a> +#### Mahout just won't run in parallel on my dataset. Why? + +If you are running training on a Hadoop cluster keep in mind that the number of mappers started is governed by the size of the input data and the configured split/block size of your cluster. As a rule of thumb, +anything below 100MB in size won't be split by default. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/flinkbindings/flink-internals.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/flinkbindings/flink-internals.md b/website/old_site_migration/needs_work_convenience/flinkbindings/flink-internals.md new file mode 100644 index 0000000..8c8145a --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/flinkbindings/flink-internals.md @@ -0,0 +1,50 @@ +--- +layout: default +title: +theme: + name: retro-mahout +--- + +#Introduction + +This document provides an overview of how the Mahout Samsara environment is implemented over the Apache Flink backend engine. This document gives an overview of the code layout for the Flink backend engine, the source code for which can be found under /flink directory in the Mahout codebase. + +Apache Flink is a distributed big data streaming engine that supports both Streaming and Batch interfaces. Batch processing is an extension of Flinkâs Stream processing engine. + +The Mahout Flink integration presently supports Flinkâs batch processing capabilities leveraging the DataSet API. + +The Mahout DRM, or Distributed Row Matrix, is an abstraction for storing a large matrix of numbers in-memory in a cluster by distributing logical rows among servers. Mahout's scala DSL provides an abstract API on DRMs for backend engines to provide implementations of this API. An example is the Spark backend engine. Each engine has it's own design of mapping the abstract API onto its data model and provides implementations for algebraic operators over that mapping. + +#Flink Overview + +Apache Flink is an open source, distributed Stream and Batch Processing Framework. At it's core, Flink is a Stream Processing engine and Batch processing is an extension of Stream Processing. + +Flink includes several APIs for building applications with the Flink Engine: + + <ol> +<li><b>DataSet API</b> for Batch data in Java, Scala and Python</li> +<li><b>DataStream API</b> for Stream Processing in Java and Scala</li> +<li><b>Table API</b> with SQL-like regular expression language in Java and Scala</li> +<li><b>Gelly</b> Graph Processing API in Java and Scala</li> +<li><b>CEP API</b>, a complex event processing library</li> +<li><b>FlinkML</b>, a Machine Learning library</li> +</ol> +#Flink Environment Engine + +The Flink backend implements the abstract DRM as a Flink DataSet. A Flink job runs in the context of an ExecutionEnvironment (from the Flink Batch processing API). + +#Source Layout + +Within mahout.git, the top level directory, flink/ holds all the source code for the Flink backend engine. Sections of code that interface with the rest of the Mahout components are in Scala, and sections of the code that interface with Flink DataSet API and implement algebraic operators are in Java. Here is a brief overview of what functionality can be found within flink/ folder. + +flink/ - top level directory containing all Flink related code + +flink/src/main/scala/org/apache/mahout/flinkbindings/blas/*.scala - Physical operator code for the Samsara DSL algebra + +flink/src/main/scala/org/apache/mahout/flinkbindings/drm/*.scala - Flink Dataset DRM and broadcast implementation + +flink/src/main/scala/org/apache/mahout/flinkbindings/io/*.scala - Read / Write between DRMDataSet and files on HDFS + +flink/src/main/scala/org/apache/mahout/flinkbindings/FlinkEngine.scala - DSL operator graph evaluator and various abstract API implementations for a distributed engine. + + http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/flinkbindings/playing-with-samsara-flink.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/flinkbindings/playing-with-samsara-flink.md b/website/old_site_migration/needs_work_convenience/flinkbindings/playing-with-samsara-flink.md new file mode 100644 index 0000000..4bbcd33 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/flinkbindings/playing-with-samsara-flink.md @@ -0,0 +1,111 @@ +--- +layout: default +title: +theme: + name: retro-mahout +--- + +## Getting Started + +To get started, add the following dependency to the pom: + + <dependency> + <groupId>org.apache.mahout</groupId> + <artifactId>mahout-flink_2.10</artifactId> + <version>0.12.0</version> + </dependency> + +Here is how to use the Flink backend: + + import org.apache.flink.api.scala._ + import org.apache.mahout.math.drm._ + import org.apache.mahout.math.drm.RLikeDrmOps._ + import org.apache.mahout.flinkbindings._ + + object ReadCsvExample { + + def main(args: Array[String]): Unit = { + val filePath = "path/to/the/input/file" + + val env = ExecutionEnvironment.getExecutionEnvironment + implicit val ctx = new FlinkDistributedContext(env) + + val drm = readCsv(filePath, delim = "\t", comment = "#") + val C = drm.t %*% drm + println(C.collect) + } + + } + +## Current Status + +The top JIRA for Flink backend is [MAHOUT-1570](https://issues.apache.org/jira/browse/MAHOUT-1570) which has been fully implemented. + +### Implemented + +* [MAHOUT-1701](https://issues.apache.org/jira/browse/MAHOUT-1701) Mahout DSL for Flink: implement AtB ABt and AtA operators +* [MAHOUT-1702](https://issues.apache.org/jira/browse/MAHOUT-1702) implement element-wise operators (like `A + 2` or `A + B`) +* [MAHOUT-1703](https://issues.apache.org/jira/browse/MAHOUT-1703) implement `cbind` and `rbind` +* [MAHOUT-1709](https://issues.apache.org/jira/browse/MAHOUT-1709) implement slicing (like `A(1 to 10, ::)`) +* [MAHOUT-1710](https://issues.apache.org/jira/browse/MAHOUT-1710) implement right in-core matrix multiplication (`A %*% B` when `B` is in-core) +* [MAHOUT-1711](https://issues.apache.org/jira/browse/MAHOUT-1711) implement broadcasting +* [MAHOUT-1712](https://issues.apache.org/jira/browse/MAHOUT-1712) implement operators `At`, `Ax`, `Atx` - `Ax` and `At` are implemented +* [MAHOUT-1734](https://issues.apache.org/jira/browse/MAHOUT-1734) implement I/O - should be able to read results of Flink bindings +* [MAHOUT-1747](https://issues.apache.org/jira/browse/MAHOUT-1747) add support for different types of indexes (String, long, etc) - now supports `Int`, `Long` and `String` +* [MAHOUT-1748](https://issues.apache.org/jira/browse/MAHOUT-1748) switch to Flink Scala API +* [MAHOUT-1749](https://issues.apache.org/jira/browse/MAHOUT-1749) Implement `Atx` +* [MAHOUT-1750](https://issues.apache.org/jira/browse/MAHOUT-1750) Implement `ABt` +* [MAHOUT-1751](https://issues.apache.org/jira/browse/MAHOUT-1751) Implement `AtA` +* [MAHOUT-1755](https://issues.apache.org/jira/browse/MAHOUT-1755) Flush intermediate results to FS - Flink, unlike Spark, does not store intermediate results in memory. +* [MAHOUT-1764](https://issues.apache.org/jira/browse/MAHOUT-1764) Add standard backend tests for Flink +* [MAHOUT-1765](https://issues.apache.org/jira/browse/MAHOUT-1765) Add documentation about Flink backend +* [MAHOUT-1776](https://issues.apache.org/jira/browse/MAHOUT-1776) Refactor common Engine agnostic classes to Math-Scala module +* [MAHOUT-1777](https://issues.apache.org/jira/browse/MAHOUT-1777) move HDFSUtil classes into the HDFS module +* [MAHOUT-1804](https://issues.apache.org/jira/browse/MAHOUT-1804) Implement drmParallelizeWithRowLabels(..) in Flink +* [MAHOUT-1805](https://issues.apache.org/jira/browse/MAHOUT-1805) Implement allReduceBlock(..) in Flink bindings +* [MAHOUT-1809](https://issues.apache.org/jira/browse/MAHOUT-1809) Failing tests in flin-bindings: dals and dspca +* [MAHOUT-1810](https://issues.apache.org/jira/browse/MAHOUT-1810) Failing test in flink-bindings: A + B Identically partitioned (mapBlock Checkpointing issue) +* [MAHOUT-1812](https://issues.apache.org/jira/browse/MAHOUT-1812) Implement drmParallelizeWithEmptyLong(..) in flink bindings +* [MAHOUT-1814](https://issues.apache.org/jira/browse/MAHOUT-1814) Implement drm2intKeyed in flink bindings +* [MAHOUT-1815](https://issues.apache.org/jira/browse/MAHOUT-1815) dsqDist(X,Y) and dsqDist(X) failing in flink tests +* [MAHOUT-1816](https://issues.apache.org/jira/browse/MAHOUT-1816) Implement newRowCardinality in CheckpointedFlinkDrm +* [MAHOUT-1817](https://issues.apache.org/jira/browse/MAHOUT-1817) Implement caching in Flink Bindings +* [MAHOUT-1818](https://issues.apache.org/jira/browse/MAHOUT-1818) dals test failing in Flink Bindings +* [MAHOUT-1819](https://issues.apache.org/jira/browse/MAHOUT-1819) Set the default Parallelism for Flink execution in FlinkDistributedContext +* [MAHOUT-1820](https://issues.apache.org/jira/browse/MAHOUT-1820) Add a method to generate Tuple<PartitionId, Partition elements count>> to support Flink backend +* [MAHOUT-1821](https://issues.apache.org/jira/browse/MAHOUT-1821) Use a mahout-flink-conf.yaml configuration file for Mahout specific Flink configuration +* [MAHOUT-1822](https://issues.apache.org/jira/browse/MAHOUT-1822) Update NOTICE.txt, License.txt to add Apache Flink +* [MAHOUT-1823](https://issues.apache.org/jira/browse/MAHOUT-1823) Modify MahoutFlinkTestSuite to implement FlinkTestBase +* [MAHOUT-1824](https://issues.apache.org/jira/browse/MAHOUT-1824) Optimize FlinkOpAtA to use upper triangular matrices +* [MAHOUT-1825](https://issues.apache.org/jira/browse/MAHOUT-1825) Add List of Flink algorithms to Mahout wiki page + +### Tests + +There is a set of standard tests that all engines should pass (see [MAHOUT-1764](https://issues.apache.org/jira/browse/MAHOUT-1764)). + +* `DistributedDecompositionsSuite` +* `DrmLikeOpsSuite` +* `DrmLikeSuite` +* `RLikeDrmOpsSuite` + + +These are Flink-backend specific tests, e.g. + +* `DrmLikeOpsSuite` for operations like `norm`, `rowSums`, `rowMeans` +* `RLikeOpsSuite` for basic LA like `A.t %*% A`, `A.t %*% x`, etc +* `LATestSuite` tests for specific operators like `AtB`, `Ax`, etc +* `UseCasesSuite` has more complex examples, like power iteration, ridge regression, etc + +## Environment + +For development the minimal supported configuration is + +* [JDK 1.7](http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html) +* [Scala 2.10] + +When using mahout, please import the following modules: + +* `mahout-math` +* `mahout-math-scala` +* `mahout-flink_2.10` +* \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/bankmarketing-example.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/bankmarketing-example.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/bankmarketing-example.md new file mode 100644 index 0000000..846a4ce --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/bankmarketing-example.md @@ -0,0 +1,53 @@ +--- +layout: default +title: +theme: + name: retro-mahout +--- + +Notice: Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + . + http://www.apache.org/licenses/LICENSE-2.0 + . + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + +#Bank Marketing Example + +### Introduction + +This page describes how to run Mahout's SGD classifier on the [UCI Bank Marketing dataset](http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing). +The goal is to predict if the client will subscribe a term deposit offered via a phone call. The features in the dataset consist +of information such as age, job, marital status as well as information about the last contacts from the bank. + +### Code & Data + +The bank marketing example code lives under + +*mahout-examples/src/main/java/org.apache.mahout.classifier.sgd.bankmarketing* + +The data can be found at + +*mahout-examples/src/main/resources/bank-full.csv* + +### Code details + +This example consists of 3 classes: + + - BankMarketingClassificationMain + - TelephoneCall + - TelephoneCallParser + +When you run the main method of BankMarketingClassificationMain it parses the dataset using the TelephoneCallParser and trains +a logistic regression model with 20 runs and 20 passes. The TelephoneCallParser uses Mahout's feature vector encoder +to encode the features in the dataset into a vector. Afterwards the model is tested and the learning rate and AUC is printed accuracy is printed to standard output. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/bayesian.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/bayesian.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/bayesian.md new file mode 100644 index 0000000..51a5c74 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/bayesian.md @@ -0,0 +1,147 @@ +--- +layout: default +title: +theme: + name: retro-mahout +--- + +# Naive Bayes + + +## Intro + +Mahout currently has two Naive Bayes implementations. The first is standard Multinomial Naive Bayes. The second is an implementation of Transformed Weight-normalized Complement Naive Bayes as introduced by Rennie et al. [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). We refer to the former as Bayes and the latter as CBayes. + +Where Bayes has long been a standard in text classification, CBayes is an extension of Bayes that performs particularly well on datasets with skewed classes and has been shown to be competitive with algorithms of higher complexity such as Support Vector Machines. + + +## Implementations +Both Bayes and CBayes are currently trained via MapReduce Jobs. Testing and classification can be done via a MapReduce Job or sequentially. Mahout provides CLI drivers for preprocessing, training and testing. A Spark implementation is currently in the works ([MAHOUT-1493](https://issues.apache.org/jira/browse/MAHOUT-1493)). + +## Preprocessing and Algorithm + +As described in [[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive Bayes is broken down into the following steps (assignments are over all possible index values): + +- Let `\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)` be a set of documents; `\(d_{ij}\)` is the count of word `\(i\)` in document `\(j\)`. +- Let `\(\vec{y}=(y_1,...,y_n)\)` be their labels. +- Let `\(\alpha_i\)` be a smoothing parameter for all words in the vocabulary; let `\(\alpha=\sum_i{\alpha_i}\)`. +- **Preprocessing**(via seq2Sparse) TF-IDF transformation and L2 length normalization of `\(\vec{d}\)` + 1. `\(d_{ij} = \sqrt{d_{ij}}\)` + 2. `\(d_{ij} = d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)` + 3. `\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)` +- **Training: Bayes**`\((\vec{d},\vec{y})\)` calculate term weights `\(w_{ci}\)` as: + 1. `\(\hat\theta_{ci}=\frac{d_{ic}+\alpha_i}{\sum_k{d_{kc}}+\alpha}\)` + 2. `\(w_{ci}=\log{\hat\theta_{ci}}\)` +- **Training: CBayes**`\((\vec{d},\vec{y})\)` calculate term weights `\(w_{ci}\)` as: + 1. `\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq c}d_{ij}+\alpha_i}{\sum_{j:y_j\neq c}{\sum_k{d_{kj}}}+\alpha}\)` + 2. `\(w_{ci}=-\log{\hat\theta_{ci}}\)` + 3. `\(w_{ci}=\frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}\)` +- **Label Assignment/Testing:** + 1. Let `\(\vec{t}= (t_1,...,t_n)\)` be a test document; let `\(t_i\)` be the count of the word `\(t\)`. + 2. Label the document according to `\(l(t)=\arg\max_c \sum\limits_{i} t_i w_{ci}\)` + +As we can see, the main difference between Bayes and CBayes is the weight calculation step. Where Bayes weighs terms more heavily based on the likelihood that they belong to class `\(c\)`, CBayes seeks to maximize term weights on the likelihood that they do not belong to any other class. + +## Running from the command line + +Mahout provides CLI drivers for all above steps. Here we will give a simple overview of Mahout CLI commands used to preprocess the data, train the model and assign labels to the training set. An [example script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) is given for the full process from data acquisition through classification of the classic [20 Newsgroups corpus](https://mahout.apache.org/users/classification/twenty-newsgroups.html). + +- **Preprocessing:** +For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the [mahout seq2sparse](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) command performs the TF-IDF transformations (-wt tfidf option) and L2 length normalization (-n 2 option) as follows: + + mahout seq2sparse + -i ${PATH_TO_SEQUENCE_FILES} + -o ${PATH_TO_TFIDF_VECTORS} + -nv + -n 2 + -wt tfidf + +- **Training:** +The model is then trained using `mahout trainnb` . The default is to train a Bayes model. The -c option is given to train a CBayes model: + + mahout trainnb + -i ${PATH_TO_TFIDF_VECTORS} + -o ${PATH_TO_MODEL}/model + -li ${PATH_TO_MODEL}/labelindex + -ow + -c + +- **Label Assignment/Testing:** +Classification and testing on a holdout set can then be performed via `mahout testnb`. Again, the -c option indicates that the model is CBayes. The -seq option tells `mahout testnb` to run sequentially: + + mahout testnb + -i ${PATH_TO_TFIDF_TEST_VECTORS} + -m ${PATH_TO_MODEL}/model + -l ${PATH_TO_MODEL}/labelindex + -ow + -o ${PATH_TO_OUTPUT} + -c + -seq + +## Command line options + +- **Preprocessing:** + + Only relevant parameters used for Bayes/CBayes as detailed above are shown. Several other transformations can be performed by `mahout seq2sparse` and used as input to Bayes/CBayes. For a full list of `mahout seq2Sparse` options see the [Creating vectors from text](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) page. + + mahout seq2sparse + --output (-o) output The directory pathname for output. + --input (-i) input Path to job input directory. + --weight (-wt) weight The kind of weight to use. Currently TF + or TFIDF. Default: TFIDF + --norm (-n) norm The norm to use, expressed as either a + float or "INF" if you want to use the + Infinite norm. Must be greater or equal + to 0. The default is not to normalize + --overwrite (-ow) If set, overwrite the output directory + --sequentialAccessVector (-seq) (Optional) Whether output vectors should + be SequentialAccessVectors. If set true + else false + --namedVector (-nv) (Optional) Whether output vectors should + be NamedVectors. If set true else false + +- **Training:** + + mahout trainnb + --input (-i) input Path to job input directory. + --output (-o) output The directory pathname for output. + --alphaI (-a) alphaI Smoothing parameter. Default is 1.0 + --trainComplementary (-c) Train complementary? Default is false. + --labelIndex (-li) labelIndex The path to store the label index in + --overwrite (-ow) If present, overwrite the output directory + before running job + --help (-h) Print out help + --tempDir tempDir Intermediate output directory + --startPhase startPhase First phase to run + --endPhase endPhase Last phase to run + +- **Testing:** + + mahout testnb + --input (-i) input Path to job input directory. + --output (-o) output The directory pathname for output. + --overwrite (-ow) If present, overwrite the output directory + before running job + + + --model (-m) model The path to the model built during training + --testComplementary (-c) Test complementary? Default is false. + --runSequential (-seq) Run sequential? + --labelIndex (-l) labelIndex The path to the location of the label index + --help (-h) Print out help + --tempDir tempDir Intermediate output directory + --startPhase startPhase First phase to run + --endPhase endPhase Last phase to run + + +## Examples + +Mahout provides an example for Naive Bayes classification: + +1. [Classify 20 Newsgroups](twenty-newsgroups.html) + +## References + +[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003). [Tackling the Poor Assumptions of Naive Bayes Text Classifiers](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003). + + http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/breiman-example.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/breiman-example.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/breiman-example.md new file mode 100644 index 0000000..d8d049e --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/breiman-example.md @@ -0,0 +1,67 @@ +--- +layout: default +title: Breiman Example +theme: + name: retro-mahout +--- + +#Breiman Example + +#### Introduction + +This page describes how to run the Breiman example, which implements the test procedure described in [Leo Breiman's paper](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.23.3999&rep=rep1&type=pdf). The basic algorithm is as follows : + + * repeat *I* iterations + * in each iteration do + * keep 10% of the dataset apart as a testing set + * build two forests using the training set, one with *m = int(log2(M) + 1)* (called Random-Input) and one with *m = 1* (called Single-Input) + * choose the forest that gave the lowest oob error estimation to compute +the test set error + * compute the test set error using the Single Input Forest (test error), +this demonstrates that even with *m = 1*, Decision Forests give comparable +results to greater values of *m* + * compute the mean testset error using every tree of the chosen forest +(tree error). This should indicate how well a single Decision Tree performs + * compute the mean test error for all iterations + * compute the mean tree error for all iterations + + +#### Running the Example + +The current implementation is compatible with the [UCI repository](http://archive.ics.uci.edu/ml/) file format. We'll show how to run this example on two datasets: + +First, we deal with [Glass Identification](http://archive.ics.uci.edu/ml/datasets/Glass+Identification): download the [dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data) file called **glass.data** and store it onto your local machine. Next, we must generate the descriptor file **glass.info** for this dataset with the following command: + + bin/mahout org.apache.mahout.classifier.df.tools.Describe -p /path/to/glass.data -f /path/to/glass.info -d I 9 N L + +Substitute */path/to/* with the folder where you downloaded the dataset, the argument "I 9 N L" indicates the nature of the variables. Here it means 1 +ignored (I) attribute, followed by 9 numerical(N) attributes, followed by +the label (L). + +Finally, we build and evaluate our random forest classifier as follows: + + bin/mahout org.apache.mahout.classifier.df.BreimanExample -d /path/to/glass.data -ds /path/to/glass.info -i 10 -t 100 +which builds 100 trees (-t argument) and repeats the test 10 iterations (-i +argument) + +The example outputs the following results: + + * Selection error: mean test error for the selected forest on all iterations + * Single Input error: mean test error for the single input forest on all +iterations + * One Tree error: mean single tree error on all iterations + * Mean Random Input Time: mean build time for random input forests on all +iterations + * Mean Single Input Time: mean build time for single input forests on all +iterations + +We can repeat this for a [Sonar](http://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+%28Sonar,+Mines+vs.+Rocks%29) usecase: download the [dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data) file called **sonar.all-data** and store it onto your local machine. Generate the descriptor file **sonar.info** for this dataset with the following command: + + bin/mahout org.apache.mahout.classifier.df.tools.Describe -p /path/to/sonar.all-data -f /path/to/sonar.info -d 60 N L + +The argument "60 N L" means 60 numerical(N) attributes, followed by the label (L). Analogous to the previous case, we run the evaluation as follows: + + bin/mahout org.apache.mahout.classifier.df.BreimanExample -d /path/to/sonar.all-data -ds /path/to/sonar.info -i 10 -t 100 + + + http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/class-discovery.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/class-discovery.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/class-discovery.md new file mode 100644 index 0000000..a24cc14 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/class-discovery.md @@ -0,0 +1,155 @@ +--- +layout: default +title: Class Discovery +theme: + name: retro-mahout +--- +<a name="ClassDiscovery-ClassDiscovery"></a> +# Class Discovery + +See http://www.cs.bham.ac.uk/~wbl/biblio/gecco1999/GP-417.pdf + +CDGA uses a Genetic Algorithm to discover a classification rule for a given +dataset. +A dataset can be seen as a table: + +<table> +<tr><th> </th><th>attribute 1</th><th>attribute 2</th><th>...</th><th>attribute N</th></tr> +<tr><td>row 1</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr> +<tr><td>row 2</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr> +<tr><td>...</td><td>...</td><td>...</td><td>...</td><td>...</td></tr> +<tr><td>row M</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr> +</table> + +An attribute can be numerical, for example a "temperature" attribute, or +categorical, for example a "color" attribute. For classification purposes, +one of the categorical attributes is designated as a *label*, which means +that its value defines the *class* of the rows. +A classification rule can be represented as follows: +<table> +<tr><th> </th><th>attribute 1</th><th>attribute 2</th><th>...</th><th>attribute N</th></tr> +<tr><td>weight</td><td>w1</td><td>w2</td><td>...</td><td>wN</td></tr> +<tr><td>operator</td><td>op1</td><td>op2</td><td>...</td><td>opN</td></tr> +<tr><td>value</td><td>value1</td><td>value2</td><td>...</td><td>valueN</td></tr> +</table> + +For a given *target* class and a weight *threshold*, the classification +rule can be read : + + + for each row of the dataset + if (rule.w1 < threshold || (rule.w1 >= threshold && row.value1 rule.op1 +rule.value1)) && + (rule.w2 < threshold || (rule.w2 >= threshold && row.value2 rule.op2 +rule.value2)) && + ... + (rule.wN < threshold || (rule.wN >= threshold && row.valueN rule.opN +rule.valueN)) then + row is part of the target class + + +*Important:* The label attribute is not evaluated by the rule. + +The threshold parameter allows some conditions of the rule to be skipped if +their weight is too small. The operators available depend on the attribute +types: +* for a numerical attributes, the available operators are '<' and '>=' +* for categorical attributes, the available operators are '!=' and '==' + +The "threshold" and "target" are user defined parameters, and because the +label is always a categorical attribute, the target is the (zero based) +index of the class label value in all the possible values of the label. For +example, if the label attribute can have the following values (blue, brown, +green), then a target of 1 means the "blue" class. + +For example, we have the following dataset (the label attribute is "Eyes +Color"): +<table> +<tr><th> </th><th>Age</th><th>Eyes Color</th><th>Hair Color</th></tr> +<tr><td>row 1</td><td>16</td><td>brown</td><td>dark</td></tr> +<tr><td>row 2</td><td>25</td><td>green</td><td>light</td></tr> +<tr><td>row 3</td><td>12</td><td>blue</td><td>light</td></tr> +and a classification rule: +<tr><td>weight</td><td>0</td><td>1</td></tr> +<tr><td>operator</td><td><</td><td>!=</td></tr> +<tr><td>value</td><td>20</td><td>light</td></tr> +and the following parameters: threshold = 1 and target = 0 (brown). +</table> + +This rule can be read as follows: + + for each row of the dataset + if (0 < 1 || (0 >= 1 && row.value1 < 20)) && + (1 < 1 || (1 >= 1 && row.value2 != light)) then + row is part of the "brown Eye Color" class + + +Please note how the rule skipped the label attribute (Eye Color), and how +the first condition is ignored because its weight is < threshold. + +<a name="ClassDiscovery-Runningtheexample:"></a> +# Running the example: +NOTE: Substitute in the appropriate version for the Mahout JOB jar + +1. cd <MAHOUT_HOME>/examples +1. ant job +1. {code}<HADOOP_HOME>/bin/hadoop dfs -put +<MAHOUT_HOME>/examples/src/test/resources/wdbc wdbc{code} +1. {code}<HADOOP_HOME>/bin/hadoop dfs -put +<MAHOUT_HOME>/examples/src/test/resources/wdbc.infos wdbc.infos{code} +1. {code}<HADOOP_HOME>/bin/hadoop jar +<MAHOUT_HOME>/examples/build/apache-mahout-examples-0.1-dev.job +org.apache.mahout.ga.watchmaker.cd.CDGA +<MAHOUT_HOME>/examples/src/test/resources/wdbc 1 0.9 1 0.033 0.1 0 100 10 + + CDGA needs 9 parameters: + * param 1 : path of the directory that contains the dataset and its infos +file + * param 2 : target class + * param 3 : threshold + * param 4 : number of crossover points for the multi-point crossover + * param 5 : mutation rate + * param 6 : mutation range + * param 7 : mutation precision + * param 8 : population size + * param 9 : number of generations before the program stops + + For more information about 4th parameter, please see [Multi-point Crossover|http://www.geatbx.com/docu/algindex-03.html#P616_36571] +. + For a detailed explanation about the 5th, 6th and 7th parameters, please +see [Real Valued Mutation|http://www.geatbx.com/docu/algindex-04.html#P659_42386] +. + + *TODO*: Fill in where to find the output and what it means. + + h1. The info file: + To run properly, CDGA needs some informations about the dataset. Each +dataset should be accompanied by an .infos file that contains the needed +informations. for each attribute a corresponding line in the info file +describes it, it can be one of the following: + * IGNORED + if the attribute is ignored + * LABEL, val1, val2,... + if the attribute is the label (class), and its possible values + * CATEGORICAL, val1, val2,... + if the attribute is categorial (nominal), and its possible values + * NUMERICAL, min, max + if the attribute is numerical, and its min and max values + + This file can be generated automaticaly using a special tool available with +CDGA. + + + +* the tool searches for an existing infos file (*must be filled by the +user*), in the same directory of the dataset with the same name and with +the ".infos" extension, that contain the type of the attributes: + ** 'N' numerical attribute + ** 'C' categorical attribute + ** 'L' label (this also a categorical attribute) + ** 'I' to ignore the attribute + each attribute is in a separate +* A Hadoop job is used to parse the dataset and collect the informations. +This means that *the dataset can be distributed over HDFS*. +* the results are written back in the same .info file, with the correct +format needed by CDGA. http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/classifyingyourdata.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/classifyingyourdata.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/classifyingyourdata.md new file mode 100644 index 0000000..c2099c0 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/classifyingyourdata.md @@ -0,0 +1,27 @@ +--- +layout: default +title: ClassifyingYourData +theme: + name: retro-mahout +--- + +# Classifying data from the command line + + +After you've done the [Quickstart](../basics/quickstart.html) and are familiar with the basics of Mahout, it is time to build a +classifier from your own data. The following pieces *may* be useful for in getting started: + +<a name="ClassifyingYourData-Input"></a> +# Input + +For starters, you will need your data in an appropriate Vector format: See [Creating Vectors](../basics/creating-vectors.html) as well as [Creating Vectors from Text](../basics/creating-vectors-from-text.html). + +<a name="ClassifyingYourData-RunningtheProcess"></a> +# Running the Process + +* Logistic regression [background](logistic-regression.html) +* [Naive Bayes background](naivebayes.html) and [commandline](bayesian-commandline.html) options. +* [Complementary naive bayes background](complementary-naive-bayes.html), [design](https://issues.apache.org/jira/browse/mahout-60.html), and [c-bayes-commandline](c-bayes-commandline.html) +* [Random Forests Classification](https://cwiki.apache.org/confluence/display/MAHOUT/Random+Forests) comes with a [Breiman example](breiman-example.html). There is some really great documentation +over at [Mark Needham's blog](http://www.markhneedham.com/blog/2012/10/27/kaggle-digit-recognizer-mahout-random-forest-attempt/). Also checkout the description on [Xiaomeng Shawn Wan +s](http://shawnwan.wordpress.com/2012/06/01/mahout-0-7-random-forest-examples/) blog. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/collocations.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/collocations.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/collocations.md new file mode 100644 index 0000000..f107850 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/collocations.md @@ -0,0 +1,385 @@ +--- +layout: default +title: Collocations +theme: + name: retro-mahout +--- + + + +<a name="Collocations-CollocationsinMahout"></a> +# Collocations in Mahout + +A collocation is defined as a sequence of words or terms which co-occur +more often than would be expected by chance. Statistically relevant +combinations of terms identify additional lexical units which can be +treated as features in a vector-based representation of a text. A detailed +discussion of collocations can be found on [Wikipedia](http://en.wikipedia.org/wiki/Collocation). + +See there for a more detailed discussion of collocations in the [Reuters example](http://comments.gmane.org/gmane.comp.apache.mahout.user/5685). + +<a name="Collocations-Log-LikelihoodbasedCollocationIdentification"></a> +## Theory behind implementation: Log-Likelihood based Collocation Identification + +Mahout provides an implementation of a collocation identification algorithm +which scores collocations using log-likelihood ratio. The log-likelihood +score indicates the relative usefulness of a collocation with regards other +term combinations in the text. Collocations with the highest scores in a +particular corpus will generally be more useful as features. + +Calculating the LLR is very straightforward and is described concisely in +[Ted Dunning's blog post](http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html) +. Ted describes the series of counts reqired to calculate the LLR for two +events A and B in order to determine if they co-occur more often than pure +chance. These counts include the number of times the events co-occur (k11), +the number of times the events occur without each other (k12 and k21), and +the number of times anything occurs. These counts are summarized in the +following table: + +<table> +<tr><td> </td><td> Event A </td><td> Everything but Event A </td></tr> +<tr><td> Event B </td><td> A and B together (k11) </td><td> B but not A (k12) </td></tr> +<tr><td> Everything but Event B </td><td> A but not B (k21) </td><td> Neither B nor A (k22) </td></tr> +</table> + +For the purposes of collocation identification, it is useful to begin by +thinking in word pairs, bigrams. In this case the leading or head term from +the pair corresponds to A from the table above, B corresponds to the +trailing or tail term, while neither B nor A is the total number of word +pairs in the corpus less those containing B, A or both B and A. + +Given the word pair of 'oscillation overthruster', the Log-Likelihood ratio +is computed by looking at the number of occurences of that word pair in the +corpus, the number of word pairs that begin with 'oscillation' but end with +something other than 'overthruster', the number of word pairs that end with +'overthruster' begin with something other than 'oscillation' and the number +of word pairs in the corpus that contain neither 'oscillation' and +overthruster. + +This can be extended from bigrams to trigrams, 4-grams and beyond. In these +cases, the current algorithm uses the first token of the ngram as the head +of the ngram and the remaining n-1 tokens from the ngram, the n-1gram as it +were, as the tail. Given the trigram 'hong kong cavaliers', 'hong' is +treated as the head while 'kong cavaliers' is treated as the tail. Future +versions of this algorithm will allow for variations in which tokens of the +ngram are treated as the head and tail. + +Beyond ngrams, it is often useful to inspect cases where individual words +occur around other interesting features of the text such as sentence +boundaries. + +<a name="Collocations-GeneratingNGrams"></a> +## Generating NGrams + +The tools that the collocation identification algorithm are embeeded within +either consume tokenized text as input or provide the ability to specify an +implementation of the Lucene Analyzer class perform tokenization in order +to form ngrams. The tokens are passed through a Lucene ShingleFilter to +produce NGrams of the desired length. + +Given the text "Alice was beginning to get very tired" as an example, +Lucene's StandardAnalyzer produces the tokens 'alice', 'beginning', 'get', +'very' and 'tired', while the ShingleFilter with a max NGram size set to 3 +produces the shingles 'alice beginning', 'alice beginning get', 'beginning +get', 'beginning get very', 'get very', 'get very tired' and 'very tired'. +Note that both bigrams and trigrams are produced here. A future enhancement +to the existing algorithm would involve limiting the output to a particular +gram size as opposed to solely specifiying a max ngram size. + +<a name="Collocations-RunningtheCollocationIdentificationAlgorithm."></a> +## Running the Collocation Identification Algorithm. + +There are a couple ways to run the llr-based collocation algorithm in +mahout + +<a name="Collocations-Whencreatingvectorsfromasequencefile"></a> +### When creating vectors from a sequence file + +The llr collocation identifier is integrated into the process that is used +to create vectors from sequence files of text keys and values. Collocations +are generated when the --maxNGramSize (-ng) option is not specified and +defaults to 2 or is set to a number of 2 or greater. The --minLLR option +can be used to control the cutoff that prevents collocations below the +specified LLR score from being emitted, and the --minSupport argument can +be used to filter out collocations that appear below a certain number of +times. + + + bin/mahout seq2sparse + + Usage: + [--minSupport <minSupport> --analyzerName <analyzerName> --chunkSize <chunkSize> + --output <output> --input <input> --minDF <minDF> + --maxDFPercent<maxDFPercent> --weight <weight> --norm <norm> --minLLR <minLLR> + --numReducers <numReducers> --maxNGramSize <ngramSize> --overwrite --help + --sequentialAccessVector] + Options + + --minSupport (-s) minSupport (Optional) Minimum Support. Default Value: 2 + + --analyzerName (-a) analyzerName The class name of the analyzer + + --chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. 100-10000MB + + --output (-o) output The output directory + + --input (-i) input Input dir containing the documents in sequence file format + + --minDF (-md) minDF The minimum document frequency. Default is 1 + + --maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF. Can be used to remove + really high frequency terms. Expressed as an + integer between 0 and 100. Default is 99. + + --weight (-wt) weight The kind of weight to use. Currently TF + or TFIDF + + --norm (-n) norm The norm to use, expressed as either a + float or "INF" if you want to use the + Infinite norm. Must be greater orequal + to 0. The default is not to normalize + + --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood + Ratio(Float) Default is 1.0 + + --numReducers (-nr) numReducers (Optional) Number of reduce tasks. + Default Value: 1 + + --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to + create (2 = bigrams, 3 = trigrams, etc) + Default Value:2 + + --overwrite (-w) If set, overwrite the output directory + --help (-h) Print out help + --sequentialAccessVector (-seq) (Optional) Whether output vectors should + be SequentialAccessVectors If set true + else false + + +<a name="Collocations-CollocDriver"></a> +### CollocDriver + + + bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver + + Usage: + [--input <input> --output <output> --maxNGramSize <ngramSize> --overwrite + --minSupport <minSupport> --minLLR <minLLR> --numReducers <numReducers> + --analyzerName <analyzerName> --preprocess --unigram --help] + + Options + + --input (-i) input The Path for input files. + + --output (-o) output The Path write output to + + --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngramsto + create (2 = bigrams, 3 = trigrams,etc) + Default Value:2 + + --overwrite (-w) If set, overwrite the outputdirectory + + --minSupport (-s) minSupport (Optional) Minimum Support. Default + Value: 2 + + --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood + Ratio(Float) Default is 1.0 + + --numReducers (-nr) numReducers (Optional) Number of reduce tasks. + Default Value: 1 + + --analyzerName (-a) analyzerName The class name of the analyzer + + --preprocess (-p) If set, input is SequenceFile<Text,Text> + where the value is the document, which + will be tokenized using the specified + analyzer. + + --unigram (-u) If set, unigrams will be emitted inthe + final output alongside collocations + + --help (-h) Print out help + + +<a name="Collocations-Algorithmdetails"></a> +## Algorithm details + +This section describes the implementation of the collocation identification +algorithm in terms of the map-reduce phases that are used to generate +ngrams and count the frequencies required to perform the log-likelihood +calculation. Unless otherwise noted, classes that are indicated in +CamelCase can be found in the mahout-utils module under the package +org.apache.mahout.utils.nlp.collocations.llr + +The algorithm is implemented in two map-reduce passes: + +<a name="Collocations-Pass1:CollocDriver.generateCollocations(...)"></a> +### Pass 1: CollocDriver.generateCollocations(...) + +Generates NGrams and counts frequencies for ngrams, head and tail subgrams. + +<a name="Collocations-Map:CollocMapper"></a> +#### Map: CollocMapper + +Input k: Text (documentId), v: StringTuple (tokens) + +Each call to the mapper passes in the full set of tokens for the +corresponding document using a StringTuple. The ShingleFilter is run across +these tokens to produce ngrams of the desired length. ngrams and +frequencies are collected across the entire document. + +Once this is done, ngrams are split into head and tail portions. A key of type GramKey is generated which is used later to join ngrams with their heads and tails in the reducer phase. The GramKey is a composite key made up of a string n-gram fragement as the primary key and a secondary key used for grouping and sorting in the reduce phase. The secondary key will either be EMPTY in the case where we are collecting either the head or tail of an ngram as the value or it will contain the byte[](.html) + form of the ngram when collecting an ngram as the value. + + + head_key(EMPTY) -> (head subgram, head frequency) + + head_key(ngram) -> (ngram, ngram frequency) + + tail_key(EMPTY) -> (tail subgram, tail frequency) + + tail_key(ngram) -> (ngram, ngram frequency) + + +subgram and ngram values are packaged in Gram objects. + +For each ngram found, the Count.NGRAM_TOTAL counter is incremented. When +the pass is complete, this counter will hold the total number of ngrams +encountered in the input which is used as a part of the LLR calculation. + +Output k: GramKey (head or tail subgram), v: Gram (head, tail or ngram with +frequency) + +<a name="Collocations-Combiner:CollocCombiner"></a> +#### Combiner: CollocCombiner + +Input k: GramKey, v:Gram (as above) + +This phase merges the counts for unique ngrams or ngram fragments across +multiple documents. The combiner treats the entire GramKey as the key and +as such, identical tuples from separate documents are passed into a single +call to the combiner's reduce method, their frequencies are summed and a +single tuple is passed out via the collector. + +Output k: GramKey, v:Gram + +<a name="Collocations-Reduce:CollocReducer"></a> +#### Reduce: CollocReducer + +Input k: GramKey, v: Gram (as above) + +The CollocReducer employs the Hadoop secondary sort strategy to avoid +caching ngram tuples in memory in order to calculate total ngram and +subgram frequencies. The GramKeyPartitioner ensures that tuples with the +same primary key are sent to the same reducer while the +GramKeyGroupComparator ensures that iterator provided by the reduce method +first returns the subgram and then returns ngram values grouped by ngram. +This eliminates the need to cache the values returned by the iterator in +order to calculate total frequencies for both subgrams and ngrams. There +input will consist of multiple frequencies for each (subgram_key, subgram) +or (subgram_key, ngram) tuple; one from each map task executed in which the +particular subgram was found. +The input will be traversed in the following order: + + + (head subgram, frequency 1) + (head subgram, frequency 2) + ... + (head subgram, frequency N) + (ngram 1, frequency 1) + (ngram 1, frequency 2) + ... + (ngram 1, frequency N) + (ngram 2, frequency 1) + (ngram 2, frequency 2) + ... + (ngram 2, frequency N) + ... + (ngram N, frequency 1) + (ngram N, frequency 2) + ... + (ngram N, frequency N) + + +Where all of the ngrams above share the same head. Data is presented in the +same manner for the tail subgrams. + +As the values for a subgram or ngram are traversed, frequencies are +accumulated. Once all values for a subgram or ngram are processed the +resulting key/value pairs are passed to the collector as long as the ngram +frequency is equal to or greater than the specified minSupport. When an +ngram is skipped in this way the Skipped.LESS_THAN_MIN_SUPPORT counter to +be incremented. + +Pairs are passed to the collector in the following format: + + + ngram, ngram frequency -> subgram subgram frequency + + +In this manner, the output becomes an unsorted version of the following: + + + ngram 1, frequency -> ngram 1 head, head frequency + ngram 1, frequency -> ngram 1 tail, tail frequency + ngram 2, frequency -> ngram 2 head, head frequency + ngram 2, frequency -> ngram 2 tail, tail frequency + ngram N, frequency -> ngram N head, head frequency + ngram N, frequency -> ngram N tail, tail frequency + + +Output is in the format k:Gram (ngram, frequency), v:Gram (subgram, +frequency) + +<a name="Collocations-Pass2:CollocDriver.computeNGramsPruneByLLR(...)"></a> +### Pass 2: CollocDriver.computeNGramsPruneByLLR(...) + +Pass 1 has calculated full frequencies for ngrams and subgrams, Pass 2 +performs the LLR calculation. + +<a name="Collocations-MapPhase:IdentityMapper(org.apache.hadoop.mapred.lib.IdentityMapper)"></a> +#### Map Phase: IdentityMapper (org.apache.hadoop.mapred.lib.IdentityMapper) + +This phase is a no-op. The data is passed through unchanged. The rest of +the work for llr calculation is done in the reduce phase. + +<a name="Collocations-ReducePhase:LLRReducer"></a> +#### Reduce Phase: LLRReducer + +Input is k:Gram, v:Gram (as above) + +This phase receives the head and tail subgrams and their frequencies for +each ngram (with frequency) produced for the input: + + + ngram 1, frequency -> ngram 1 head, frequency; ngram 1 tail, frequency + ngram 2, frequency -> ngram 2 head, frequency; ngram 2 tail, frequency + ... + ngram 1, frequency -> ngram N head, frequency; ngram N tail, frequency + + +It also reads the full ngram count obtained from the first pass, passed in +as a configuration option. The parameters to the llr calculation are +calculated as follows: + +k11 = f_n +k12 = f_h - f_n +k21 = f_t - f_n +k22 = N - ((f_h + f_t) - f_n) + +Where f_n is the ngram frequency, f_h and f_t the frequency of head and +tail and N is the total number of ngrams. + +Tokens with a llr below that of the specified minimum llr are dropped and +the Skipped.LESS_THAN_MIN_LLR counter is incremented. + +Output is k: Text (ngram), v: DoubleWritable (llr score) + +<a name="Collocations-Unigrampass-through."></a> +### Unigram pass-through. + +By default in seq2sparse, or if the -u option is provided to the +CollocDriver, unigrams (single tokens) will be passed through the job and +each token's frequency will be calculated. As with ngrams, unigrams are +subject to filtering with minSupport and minLLR. + http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/gaussian-discriminative-analysis.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/gaussian-discriminative-analysis.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/gaussian-discriminative-analysis.md new file mode 100644 index 0000000..e8a54af --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/gaussian-discriminative-analysis.md @@ -0,0 +1,20 @@ +--- +layout: default +title: Gaussian Discriminative Analysis +theme: + name: retro-mahout +--- + +<a name="GaussianDiscriminativeAnalysis-GaussianDiscriminativeAnalysis"></a> +# Gaussian Discriminative Analysis + +Gaussian Discriminative Analysis is a tool for multigroup classification +based on extending linear discriminant analysis. The paper on the approach +is located at http://citeseer.ist.psu.edu/4617.html (note, for some reason +the paper is backwards, in that page 1 is at the end) + +<a name="GaussianDiscriminativeAnalysis-Parallelizationstrategy"></a> +## Parallelization strategy + +<a name="GaussianDiscriminativeAnalysis-Designofpackages"></a> +## Design of packages http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/hidden-markov-models.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/hidden-markov-models.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/hidden-markov-models.md new file mode 100644 index 0000000..7321493 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/hidden-markov-models.md @@ -0,0 +1,102 @@ +--- +layout: default +title: Hidden Markov Models +theme: + name: retro-mahout +--- + +# Hidden Markov Models + +<a name="HiddenMarkovModels-IntroductionandUsage"></a> +## Introduction and Usage + +Hidden Markov Models are used in multiple areas of Machine Learning, such +as speech recognition, handwritten letter recognition or natural language +processing. + +<a name="HiddenMarkovModels-FormalDefinition"></a> +## Formal Definition + +A Hidden Markov Model (HMM) is a statistical model of a process consisting +of two (in our case discrete) random variables O and Y, which change their +state sequentially. The variable Y with states \{y_1, ... , y_n\} is called +the "hidden variable", since its state is not directly observable. The +state of Y changes sequentially with a so called - in our case first-order +- Markov Property. This means, that the state change probability of Y only +depends on its current state and does not change in time. Formally we +write: P(Y(t+1)=y_i|Y(0)...Y(t)) = P(Y(t+1)=y_i|Y(t)) = P(Y(2)=y_i|Y(1)). +The variable O with states \{o_1, ... , o_m\} is called the "observable +variable", since its state can be directly observed. O does not have a +Markov Property, but its state probability depends statically on the +current state of Y. + +Formally, an HMM is defined as a tuple M=(n,m,P,A,B), where n is the number of hidden states, m is the number of observable states, P is an n-dimensional vector containing initial hidden state probabilities, A is the nxn-dimensional "transition matrix" containing the transition probabilities such that A\[i,j\](i,j\.html) +=P(Y(t)=y_i|Y(t-1)=y_j) and B is the mxn-dimensional "emission matrix" +containing the observation probabilities such that B\[i,j\]= +P(O=o_i|Y=y_j). + +<a name="HiddenMarkovModels-Problems"></a> +## Problems + +Rabiner \[1\](1\.html) + defined three main problems for HMM models: + +1. Evaluation: Given a sequence O of observations and a model M, what is +the probability P(O|M) that sequence O was generated by model M. The +Evaluation problem can be efficiently solved using the Forward algorithm +2. Decoding: Given a sequence O of observations and a model M, what is +the most likely sequence Y*=argmax(Y) P(O|M,Y) of hidden variables to +generate this sequence. The Decoding problem can be efficiently solved +using the Viterbi algorithm. +3. Learning: Given a sequence O of observations, what is the most likely +model M*=argmax(M)P(O|M) to generate this sequence. The Learning problem +can be efficiently solved using the Baum-Welch algorithm. + +<a name="HiddenMarkovModels-Example"></a> +## Example + +To build a Hidden Markov Model and use it to build some predictions, try a simple example like this: + +Create an input file to train the model. Here we have a sequence drawn from the set of states 0, 1, 2, and 3, separated by space characters. + + $ echo "0 1 2 2 2 1 1 0 0 3 3 3 2 1 2 1 1 1 1 2 2 2 0 0 0 0 0 0 2 2 2 0 0 0 0 0 0 2 2 2 3 3 3 3 3 3 2 3 2 3 2 3 2 1 3 0 0 0 1 0 1 0 2 1 2 1 2 1 2 3 3 3 3 2 2 3 2 1 1 0" > hmm-input + +Now run the baumwelch job to train your model, after first setting MAHOUT_LOCAL to true, to use your local file system. + + $ export MAHOUT_LOCAL=true + $ $MAHOUT_HOME/bin/mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 4 -e .0001 -m 1000 + +Output like the following should appear in the console. + + Initial probabilities: + 0 1 2 + 1.0 0.0 3.5659361683006626E-251 + Transition matrix: + 0 1 2 + 0 6.098919959130616E-5 0.9997275322964165 2.1147850399214744E-4 + 1 7.404648706054873E-37 0.9086408633885092 0.09135913661149081 + 2 0.2284374545687356 7.01786289571088E-11 0.7715625453610858 + Emission matrix: + 0 1 2 3 + 0 0.9999997858591223 2.0536163836449762E-39 2.1414087769942127E-7 1.052441093535389E-27 + 1 7.495656581383351E-34 0.2241269055449904 0.4510889999455847 0.32478409450942497 + 2 0.815051477991782 0.18494852200821799 8.465660634827592E-33 2.8603899591778015E-36 + 14/03/22 09:52:21 INFO driver.MahoutDriver: Program took 180 ms (Minutes: 0.003) + +The model trained with the input set now is in the file 'hmm-model', which we can use to build a predicted sequence. + + $ $MAHOUT_HOME/bin/mahout hmmpredict -m hmm-model -o hmm-predictions -l 10 + +To see the predictions: + + $ cat hmm-predictions + 0 1 3 3 2 2 2 2 1 2 + + +<a name="HiddenMarkovModels-Resources"></a> +## Resources + +\[1\] + Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models +and selected applications in speech recognition". Proceedings of the IEEE +77 (2): 257-286. doi:10.1109/5.18626. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/independent-component-analysis.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/independent-component-analysis.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/independent-component-analysis.md new file mode 100644 index 0000000..6035b54 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/independent-component-analysis.md @@ -0,0 +1,17 @@ +--- +layout: default +title: Independent Component Analysis +theme: + name: retro-mahout +--- + +<a name="IndependentComponentAnalysis-IndependentComponentAnalysis"></a> +# Independent Component Analysis + +See also: Principal Component Analysis. + +<a name="IndependentComponentAnalysis-Parallelizationstrategy"></a> +## Parallelization strategy + +<a name="IndependentComponentAnalysis-Designofpackages"></a> +## Design of packages http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/locally-weighted-linear-regression.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/locally-weighted-linear-regression.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/locally-weighted-linear-regression.md new file mode 100644 index 0000000..7b23d85 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/locally-weighted-linear-regression.md @@ -0,0 +1,25 @@ +--- +layout: default +title: Locally Weighted Linear Regression +theme: + name: retro-mahout +--- + +<a name="LocallyWeightedLinearRegression-LocallyWeightedLinearRegression"></a> +# Locally Weighted Linear Regression + +Model-based methods, such as SVM, Naive Bayes and the mixture of Gaussians, +use the data to build a parameterized model. After training, the model is +used for predictions and the data are generally discarded. In contrast, +"memory-based" methods are non-parametric approaches that explicitly retain +the training data, and use it each time a prediction needs to be made. +Locally weighted regression (LWR) is a memory-based method that performs a +regression around a point of interest using only training data that are +"local" to that point. Source: +http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/cohn96a-html/node7.html + +<a name="LocallyWeightedLinearRegression-Strategyforparallelregression"></a> +## Strategy for parallel regression + +<a name="LocallyWeightedLinearRegression-Designofpackages"></a> +## Design of packages http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/logistic-regression.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/logistic-regression.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/logistic-regression.md new file mode 100644 index 0000000..b066fda --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/logistic-regression.md @@ -0,0 +1,129 @@ +--- +layout: default +title: Logistic Regression +theme: + name: retro-mahout +--- + +<a name="LogisticRegression-LogisticRegression(SGD)"></a> +# Logistic Regression (SGD) + +Logistic regression is a model used for prediction of the probability of +occurrence of an event. It makes use of several predictor variables that +may be either numerical or categories. + +Logistic regression is the standard industry workhorse that underlies many +production fraud detection and advertising quality and targeting products. +The Mahout implementation uses Stochastic Gradient Descent (SGD) to all +large training sets to be used. + +For a more detailed analysis of the approach, have a look at the [thesis of +Paul Komarek](http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en) [1]. + +See MAHOUT-228 for the main JIRA issue for SGD. + +A more detailed overview of the Mahout Linear Regression classifier and [detailed discription of building a Logistic Regression classifier](http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/) for the classic [Iris flower dataset](http://en.wikipedia.org/wiki/Iris_flower_data_set) is also available [2]. + +An example of training a Logistic Regression classifier for the [UCI Bank Marketing Dataset](http://mlr.cs.umass.edu/ml/datasets/Bank+Marketing) can be found [on the Mahout website](http://mahout.apache.org/users/classification/bankmarketing-example.html) [3]. + +An example of training and testing a Logistic Regression document classifier for the classic [20 newsgroups corpus](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) [4] is also available. + +<a name="LogisticRegression-Parallelizationstrategy"></a> +## Parallelization strategy + +The bad news is that SGD is an inherently sequential algorithm. The good +news is that it is blazingly fast and thus it is not a problem for Mahout's +implementation to handle training sets of tens of millions of examples. +With the down-sampling typical in many data-sets, this is equivalent to a +dataset with billions of raw training examples. + +The SGD system in Mahout is an online learning algorithm which means that +you can learn models in an incremental fashion and that you can do +performance testing as your system runs. Often this means that you can +stop training when a model reaches a target level of performance. The SGD +framework includes classes to do on-line evaluation using cross validation +(the CrossFoldLearner) and an evolutionary system to do learning +hyper-parameter optimization on the fly (the AdaptiveLogisticRegression). +The AdaptiveLogisticRegression system makes heavy use of threads to +increase machine utilization. The way it works is that it runs 20 +CrossFoldLearners in separate threads, each with slightly different +learning parameters. As better settings are found, these new settings are +propagating to the other learners. + +<a name="LogisticRegression-Designofpackages"></a> +## Design of packages + +There are three packages that are used in Mahout's SGD system. These +include + +* The vector encoding package (found in org.apache.mahout.vectorizer.encoders) + +* The SGD learning package (found in org.apache.mahout.classifier.sgd) + +* The evolutionary optimization system (found in org.apache.mahout.ep) + +<a name="LogisticRegression-Featurevectorencoding"></a> +## Feature vector encoding + +Because the SGD algorithms need to have fixed length feature vectors and +because it is a pain to build a dictionary ahead of time, most SGD +applications use the hashed feature vector encoding system that is rooted +at FeatureVectorEncoder. + +The basic idea is that you create a vector, typically a +RandomAccessSparseVector, and then you use various feature encoders to +progressively add features to that vector. The size of the vector should +be large enough to avoid feature collisions as features are hashed. + +There are specialized encoders for a variety of data types. You can +normally encode either a string representation of the value you want to +encode or you can encode a byte level representation to avoid string +conversion. In the case of ContinuousValueEncoder and +ConstantValueEncoder, it is also possible to encode a null value and pass +the real value in as a weight. This avoids numerical parsing entirely in +case you are getting your training data from a system like Avro. + +Here is a class diagram for the encoders package: + + + +<a name="LogisticRegression-SGDLearning"></a> +## SGD Learning + +For the simplest applications, you can construct an +OnlineLogisticRegression and be off and running. Typically, though, it is +nice to have running estimates of performance on held out data. To do +that, you should use a CrossFoldLearner which keeps a stable of five (by +default) OnlineLogisticRegression objects. Each time you pass a training +example to a CrossFoldLearner, it passes this example to all but one of its +children as training and passes the example to the last child to evaluate +current performance. The children are used for evaluation in a round-robin +fashion so, if you are using the default 5 way split, all of the children +get 80% of the training data for training and get 20% of the data for +evaluation. + +To avoid the pesky need to configure learning rates, regularization +parameters and annealing schedules, you can use the +AdaptiveLogisticRegression. This class maintains a pool of +CrossFoldLearners and adapts learning rates and regularization on the fly +so that you don't have to. + +Here is a class diagram for the classifiers.sgd package. As you can see, +the number of twiddlable knobs is pretty large. For some examples, see the +[TrainNewsGroups](https://github.com/apache/mahout/blob/master/examples/src/main/java/org/apache/mahout/classifier/sgd/TrainNewsGroups.java) example code. + + + +## References + +[1] [Thesis of +Paul Komarek](http://www.autonlab.org/autonweb/14709/version/4/part/5/data/komarek:lr_thesis.pdf?branch=main&language=en) + +[2] [An Introduction To Mahout's Logistic Regression SGD Classifier](http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/) + +## Examples + +[3] [SGD Bank Marketing Example](http://mahout.apache.org/users/classification/bankmarketing-example.html) + +[4] [SGD 20 newsgroups classification](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh) + http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/classification/mahout-collections.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_convenience/map-reduce/classification/mahout-collections.md b/website/old_site_migration/needs_work_convenience/map-reduce/classification/mahout-collections.md new file mode 100644 index 0000000..99f22f6 --- /dev/null +++ b/website/old_site_migration/needs_work_convenience/map-reduce/classification/mahout-collections.md @@ -0,0 +1,60 @@ +--- +layout: default +title: mahout-collections +theme: + name: retro-mahout +--- + +# Mahout collections + +<a name="mahout-collections-Introduction"></a> +## Introduction + +The Mahout Collections library is a set of container classes that address +some limitations of the standard collections in Java. [This presentation](http://domino.research.ibm.com/comm/research_people.nsf/pages/sevitsky.pubs.html/$FILE/oopsla08%20memory-efficient%20java%20slides.pdf) + describes a number of performance problems with the standard collections. + +Mahout collections addresses two of the more glaring: the lack of support +for primitive types and the lack of open hashing. + +<a name="mahout-collections-PrimitiveTypes"></a> +## Primitive Types + +The most visible feature of Mahout Collections is the large collection of +primitive type collections. Given Java's asymmetrical support for the +primitive types, the only efficient way to handle them is with many +classes. So, there are ArrayList-like containers for all of the primitive +types, and hash maps for all the useful combinations of primitive type and +object keys and values. + +These classes do not, in general, implement interfaces from *java.util*. +Even when the *java.util* interfaces could be type-compatible, they tend +to include requirements that are not consistent with efficient use of +primitive types. + +<a name="mahout-collections-OpenAddressing"></a> +# Open Addressing + +All of the sets and maps in Mahout Collections are open-addressed hash +tables. Open addressing has a much smaller memory footprint than chaining. +Since the purpose of these collections is to avoid the memory cost of +autoboxing, open addressing is a consistent design choice. + +<a name="mahout-collections-Sets"></a> +## Sets + +Mahout Collections includes open hash sets. Unlike *java.util*, a set is +not a recycled hash table; the sets are separately implemented and do not +have any additional storage usage for unused keys. + +<a name="mahout-collections-CreditwhereCreditisdue"></a> +# Credit where Credit is due + +The implementation of Mahout Collections is derived from [Cern Colt](http://acs.lbl.gov/~hoschek/colt/) +. + + + + + +
