[07/13] mahout git commit: WEBSITE Porting Old Website

rawkintrevo Sat, 29 Apr 2017 20:24:34 -0700

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/flinkbindings/playing-with-samsara-flink.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/completed/flinkbindings/playing-with-samsara-flink.md
 
b/website/old_site_migration/completed/flinkbindings/playing-with-samsara-flink.md
new file mode 100644
index 0000000..4bbcd33
--- /dev/null
+++ 
b/website/old_site_migration/completed/flinkbindings/playing-with-samsara-flink.md
@@ -0,0 +1,111 @@
+---
+layout: default
+title: 
+theme:
+   name: retro-mahout
+---
+
+## Getting Started 
+
+To get started, add the following dependency to the pom:
+
+    <dependency>
+      <groupId>org.apache.mahout</groupId>
+      <artifactId>mahout-flink_2.10</artifactId>
+      <version>0.12.0</version>
+    </dependency>
+
+Here is how to use the Flink backend:
+
+       import org.apache.flink.api.scala._
+       import org.apache.mahout.math.drm._
+       import org.apache.mahout.math.drm.RLikeDrmOps._
+       import org.apache.mahout.flinkbindings._
+
+       object ReadCsvExample {
+
+         def main(args: Array[String]): Unit = {
+           val filePath = "path/to/the/input/file"
+
+           val env = ExecutionEnvironment.getExecutionEnvironment
+           implicit val ctx = new FlinkDistributedContext(env)
+
+           val drm = readCsv(filePath, delim = "\t", comment = "#")
+           val C = drm.t %*% drm
+           println(C.collect)
+         }
+
+       }
+
+## Current Status
+
+The top JIRA for Flink backend is 
[MAHOUT-1570](https://issues.apache.org/jira/browse/MAHOUT-1570) which has been 
fully implemented.
+
+### Implemented
+
+* [MAHOUT-1701](https://issues.apache.org/jira/browse/MAHOUT-1701) Mahout DSL 
for Flink: implement AtB ABt and AtA operators
+* [MAHOUT-1702](https://issues.apache.org/jira/browse/MAHOUT-1702) implement 
element-wise operators (like `A + 2` or `A + B`) 
+* [MAHOUT-1703](https://issues.apache.org/jira/browse/MAHOUT-1703) implement 
`cbind` and `rbind`
+* [MAHOUT-1709](https://issues.apache.org/jira/browse/MAHOUT-1709) implement 
slicing (like `A(1 to 10, ::)`)
+* [MAHOUT-1710](https://issues.apache.org/jira/browse/MAHOUT-1710) implement 
right in-core matrix multiplication (`A %*% B` when `B` is in-core) 
+* [MAHOUT-1711](https://issues.apache.org/jira/browse/MAHOUT-1711) implement 
broadcasting
+* [MAHOUT-1712](https://issues.apache.org/jira/browse/MAHOUT-1712) implement 
operators `At`, `Ax`, `Atx` - `Ax` and `At` are implemented
+* [MAHOUT-1734](https://issues.apache.org/jira/browse/MAHOUT-1734) implement 
I/O - should be able to read results of Flink bindings
+* [MAHOUT-1747](https://issues.apache.org/jira/browse/MAHOUT-1747) add support 
for different types of indexes (String, long, etc) - now supports `Int`, `Long` 
and `String`
+* [MAHOUT-1748](https://issues.apache.org/jira/browse/MAHOUT-1748) switch to 
Flink Scala API 
+* [MAHOUT-1749](https://issues.apache.org/jira/browse/MAHOUT-1749) Implement 
`Atx`
+* [MAHOUT-1750](https://issues.apache.org/jira/browse/MAHOUT-1750) Implement 
`ABt`
+* [MAHOUT-1751](https://issues.apache.org/jira/browse/MAHOUT-1751) Implement 
`AtA` 
+* [MAHOUT-1755](https://issues.apache.org/jira/browse/MAHOUT-1755) Flush 
intermediate results to FS - Flink, unlike Spark, does not store intermediate 
results in memory.
+* [MAHOUT-1764](https://issues.apache.org/jira/browse/MAHOUT-1764) Add 
standard backend tests for Flink
+* [MAHOUT-1765](https://issues.apache.org/jira/browse/MAHOUT-1765) Add 
documentation about Flink backend
+* [MAHOUT-1776](https://issues.apache.org/jira/browse/MAHOUT-1776) Refactor 
common Engine agnostic classes to Math-Scala module
+* [MAHOUT-1777](https://issues.apache.org/jira/browse/MAHOUT-1777) move 
HDFSUtil classes into the HDFS module
+* [MAHOUT-1804](https://issues.apache.org/jira/browse/MAHOUT-1804) Implement 
drmParallelizeWithRowLabels(..) in Flink
+* [MAHOUT-1805](https://issues.apache.org/jira/browse/MAHOUT-1805) Implement 
allReduceBlock(..) in Flink bindings
+* [MAHOUT-1809](https://issues.apache.org/jira/browse/MAHOUT-1809) Failing 
tests in flin-bindings: dals and dspca
+* [MAHOUT-1810](https://issues.apache.org/jira/browse/MAHOUT-1810) Failing 
test in flink-bindings: A + B Identically partitioned (mapBlock Checkpointing 
issue)
+* [MAHOUT-1812](https://issues.apache.org/jira/browse/MAHOUT-1812) Implement 
drmParallelizeWithEmptyLong(..) in flink bindings
+* [MAHOUT-1814](https://issues.apache.org/jira/browse/MAHOUT-1814) Implement 
drm2intKeyed in flink bindings
+* [MAHOUT-1815](https://issues.apache.org/jira/browse/MAHOUT-1815) 
dsqDist(X,Y) and dsqDist(X) failing in flink tests
+* [MAHOUT-1816](https://issues.apache.org/jira/browse/MAHOUT-1816) Implement 
newRowCardinality in CheckpointedFlinkDrm
+* [MAHOUT-1817](https://issues.apache.org/jira/browse/MAHOUT-1817) Implement 
caching in Flink Bindings
+* [MAHOUT-1818](https://issues.apache.org/jira/browse/MAHOUT-1818) dals test 
failing in Flink Bindings
+* [MAHOUT-1819](https://issues.apache.org/jira/browse/MAHOUT-1819) Set the 
default Parallelism for Flink execution in FlinkDistributedContext
+* [MAHOUT-1820](https://issues.apache.org/jira/browse/MAHOUT-1820) Add a 
method to generate Tuple<PartitionId, Partition elements count>> to support 
Flink backend
+* [MAHOUT-1821](https://issues.apache.org/jira/browse/MAHOUT-1821) Use a 
mahout-flink-conf.yaml configuration file for Mahout specific Flink 
configuration
+* [MAHOUT-1822](https://issues.apache.org/jira/browse/MAHOUT-1822) Update 
NOTICE.txt, License.txt to add Apache Flink
+* [MAHOUT-1823](https://issues.apache.org/jira/browse/MAHOUT-1823) Modify 
MahoutFlinkTestSuite to implement FlinkTestBase
+* [MAHOUT-1824](https://issues.apache.org/jira/browse/MAHOUT-1824) Optimize 
FlinkOpAtA to use upper triangular matrices
+* [MAHOUT-1825](https://issues.apache.org/jira/browse/MAHOUT-1825) Add List of 
Flink algorithms to Mahout wiki page
+
+### Tests 
+
+There is a set of standard tests that all engines should pass (see 
[MAHOUT-1764](https://issues.apache.org/jira/browse/MAHOUT-1764)).  
+
+* `DistributedDecompositionsSuite` 
+* `DrmLikeOpsSuite` 
+* `DrmLikeSuite` 
+* `RLikeDrmOpsSuite` 
+
+
+These are Flink-backend specific tests, e.g.
+
+* `DrmLikeOpsSuite` for operations like `norm`, `rowSums`, `rowMeans`
+* `RLikeOpsSuite` for basic LA like `A.t %*% A`, `A.t %*% x`, etc
+* `LATestSuite` tests for specific operators like `AtB`, `Ax`, etc
+* `UseCasesSuite` has more complex examples, like power iteration, ridge 
regression, etc
+
+## Environment 
+
+For development the minimal supported configuration is 
+
+* [JDK 
1.7](http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html)
+* [Scala 2.10]
+
+When using mahout, please import the following modules: 
+
+* `mahout-math`
+* `mahout-math-scala`
+* `mahout-flink_2.10`
+*
\ No newline at end of file


http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/spark-naive-bayes.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/completed/spark-naive-bayes.md 
b/website/old_site_migration/completed/spark-naive-bayes.md
new file mode 100644
index 0000000..8823812
--- /dev/null
+++ b/website/old_site_migration/completed/spark-naive-bayes.md
@@ -0,0 +1,132 @@
+---
+layout: default
+title: Spark Naive Bayes
+theme:
+    name: retro-mahout
+---
+
+# Spark Naive Bayes
+
+
+## Intro
+
+Mahout currently has two flavors of Naive Bayes.  The first is standard 
Multinomial Naive Bayes. The second is an implementation of Transformed 
Weight-normalized Complement Naive Bayes as introduced by Rennie et al. 
[[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). We refer to 
the former as Bayes and the latter as CBayes.
+
+Where Bayes has long been a standard in text classification, CBayes is an 
extension of Bayes that performs particularly well on datasets with skewed 
classes and has been shown to be competitive with algorithms of higher 
complexity such as Support Vector Machines. 
+
+
+## Implementations
+The mahout `math-scala` library has an implemetation of both Bayes and CBayes 
which is further optimized in the `spark` module. Currently the Spark optimized 
version provides CLI drivers for training and testing. Mahout Spark-Naive-Bayes 
models can also be trained, tested and saved to the filesystem from the Mahout 
Spark Shell. 
+
+## Preprocessing and Algorithm
+
+As described in 
[[1]](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf) Mahout Naive 
Bayes is broken down into the following steps (assignments are over all 
possible index values):  
+
+- Let `\(\vec{d}=(\vec{d_1},...,\vec{d_n})\)` be a set of documents; 
`\(d_{ij}\)` is the count of word `\(i\)` in document `\(j\)`.
+- Let `\(\vec{y}=(y_1,...,y_n)\)` be their labels.
+- Let `\(\alpha_i\)` be a smoothing parameter for all words in the vocabulary; 
let `\(\alpha=\sum_i{\alpha_i}\)`. 
+- **Preprocessing**(via seq2Sparse) TF-IDF transformation and L2 length 
normalization of `\(\vec{d}\)`
+    1. `\(d_{ij} = \sqrt{d_{ij}}\)` 
+    2. `\(d_{ij} = 
d_{ij}\left(\log{\frac{\sum_k1}{\sum_k\delta_{ik}+1}}+1\right)\)` 
+    3. `\(d_{ij} =\frac{d_{ij}}{\sqrt{\sum_k{d_{kj}^2}}}\)` 
+- **Training: Bayes**`\((\vec{d},\vec{y})\)` calculate term weights 
`\(w_{ci}\)` as:
+    1. `\(\hat\theta_{ci}=\frac{d_{ic}+\alpha_i}{\sum_k{d_{kc}}+\alpha}\)`
+    2. `\(w_{ci}=\log{\hat\theta_{ci}}\)`
+- **Training: CBayes**`\((\vec{d},\vec{y})\)` calculate term weights 
`\(w_{ci}\)` as:
+    1. `\(\hat\theta_{ci} = \frac{\sum_{j:y_j\neq 
c}d_{ij}+\alpha_i}{\sum_{j:y_j\neq c}{\sum_k{d_{kj}}}+\alpha}\)`
+    2. `\(w_{ci}=-\log{\hat\theta_{ci}}\)`
+    3. `\(w_{ci}=\frac{w_{ci}}{\sum_i \lvert w_{ci}\rvert}\)`
+- **Label Assignment/Testing:**
+    1. Let `\(\vec{t}= (t_1,...,t_n)\)` be a test document; let `\(t_i\)` be 
the count of the word `\(t\)`.
+    2. Label the document according to `\(l(t)=\arg\max_c \sum\limits_{i} t_i 
w_{ci}\)`
+
+As we can see, the main difference between Bayes and CBayes is the weight 
calculation step.  Where Bayes weighs terms more heavily based on the 
likelihood that they belong to class `\(c\)`, CBayes seeks to maximize term 
weights on the likelihood that they do not belong to any other class.  
+
+## Running from the command line
+
+Mahout provides CLI drivers for all above steps.  Here we will give a simple 
overview of Mahout CLI commands used to preprocess the data, train the model 
and assign labels to the training set. An [example 
script](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
 is given for the full process from data acquisition through classification of 
the classic [20 Newsgroups 
corpus](https://mahout.apache.org/users/classification/twenty-newsgroups.html). 
 
+
+- **Preprocessing:**
+For a set of Sequence File Formatted documents in PATH_TO_SEQUENCE_FILES the 
[mahout 
seq2sparse](https://mahout.apache.org/users/basics/creating-vectors-from-text.html)
 command performs the TF-IDF transformations (-wt tfidf option) and L2 length 
normalization (-n 2 option) as follows:
+
+        $ mahout seq2sparse 
+          -i ${PATH_TO_SEQUENCE_FILES} 
+          -o ${PATH_TO_TFIDF_VECTORS} 
+          -nv 
+          -n 2
+          -wt tfidf
+
+- **Training:**
+The model is then trained using `mahout spark-trainnb`.  The default is to 
train a Bayes model. The -c option is given to train a CBayes model:
+
+        $ mahout spark-trainnb
+          -i ${PATH_TO_TFIDF_VECTORS} 
+          -o ${PATH_TO_MODEL}
+          -ow 
+          -c
+
+- **Label Assignment/Testing:**
+Classification and testing on a holdout set can then be performed via `mahout 
spark-testnb`. Again, the -c option indicates that the model is CBayes:
+
+        $ mahout spark-testnb 
+          -i ${PATH_TO_TFIDF_TEST_VECTORS}
+          -m ${PATH_TO_MODEL} 
+          -c 
+
+## Command line options
+
+- **Preprocessing:** *note: still reliant on MapReduce seq2sparse* 
+  
+  Only relevant parameters used for Bayes/CBayes as detailed above are shown. 
Several other transformations can be performed by `mahout seq2sparse` and used 
as input to Bayes/CBayes.  For a full list of `mahout seq2Sparse` options see 
the [Creating vectors from 
text](https://mahout.apache.org/users/basics/creating-vectors-from-text.html) 
page.
+
+        $ mahout seq2sparse                         
+          --output (-o) output             The directory pathname for output.  
      
+          --input (-i) input               Path to job input directory.        
      
+          --weight (-wt) weight            The kind of weight to use. 
Currently TF   
+                                               or TFIDF. Default: TFIDF        
          
+          --norm (-n) norm                 The norm to use, expressed as 
either a    
+                                               float or "INF" if you want to 
use the     
+                                               Infinite norm.  Must be greater 
or equal  
+                                               to 0.  The default is not to 
normalize    
+          --overwrite (-ow)                If set, overwrite the output 
directory    
+          --sequentialAccessVector (-seq)  (Optional) Whether output vectors 
should  
+                                               be SequentialAccessVectors. If 
set true   
+                                               else false                      
          
+          --namedVector (-nv)              (Optional) Whether output vectors 
should  
+                                               be NamedVectors. If set true 
else false   
+
+- **Training:**
+
+        $ mahout spark-trainnb
+          --input (-i) input               Path to job input directory.        
         
+          --output (-o) output             The directory pathname for output.  
         
+          --trainComplementary (-c)        Train complementary? Default is 
false.
+          --master (-ma)                   Spark Master URL (optional). 
Default: "local".
+                                               Note that you can specify the 
number of 
+                                               cores to get a performance 
improvement, 
+                                               for example "local[4]"
+          --help (-h)                      Print out help                      
         
+
+- **Testing:**
+
+        $ mahout spark-testnb   
+          --input (-i) input               Path to job input directory.        
          
+          --model (-m) model               The path to the model built during 
training.   
+          --testComplementary (-c)         Test complementary? Default is 
false.                          
+          --master (-ma)                   Spark Master URL (optional). 
Default: "local". 
+                                               Note that you can specify the 
number of 
+                                               cores to get a performance 
improvement, 
+                                               for example "local[4]"          
              
+          --help (-h)                      Print out help                      
          
+
+## Examples
+1. [20 Newsgroups 
classification](https://github.com/apache/mahout/blob/master/examples/bin/classify-20newsgroups.sh)
+2. [Document classification with Naive Bayes in the Mahout 
shell](https://github.com/apache/mahout/blob/master/examples/bin/spark-document-classifier.mscala)
+        
+ 
+## References
+
+[1]: Jason D. M. Rennie, Lawerence Shih, Jamie Teevan, David Karger (2003). 
[Tackling the Poor Assumptions of Naive Bayes Text 
Classifiers](http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf). 
Proceedings of the Twentieth International Conference on Machine Learning 
(ICML-2003).
+
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/c81fc8b7/website/old_site_migration/completed/sparkbindings/MahoutScalaAndSparkBindings.pptx
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/completed/sparkbindings/MahoutScalaAndSparkBindings.pptx
 
b/website/old_site_migration/completed/sparkbindings/MahoutScalaAndSparkBindings.pptx
new file mode 100644
index 0000000..ec1de04
Binary files /dev/null and 
b/website/old_site_migration/completed/sparkbindings/MahoutScalaAndSparkBindings.pptx
 differ

[07/13] mahout git commit: WEBSITE Porting Old Website

Reply via email to