[21/29] mahout git commit: WEBSITE final cleanup before merge to master

rawkintrevo Thu, 04 May 2017 18:15:07 -0700

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md 
b/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md
deleted file mode 100644
index 14dd276..0000000
--- 
a/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md
+++ /dev/null
@@ -1,291 +0,0 @@
----
-layout: default
-title: Creating Vectors from Text
-theme:
-    name: retro-mahout
----
-
-
-# Creating vectors from text
-<a name="CreatingVectorsfromText-Introduction"></a>
-# Introduction
-
-For clustering and classifying documents it is usually necessary to convert 
the raw text
-into vectors that can then be consumed by the clustering 
[Algorithms](algorithms.html).  These approaches are described below.
-
-<a name="CreatingVectorsfromText-FromLucene"></a>
-# From Lucene
-
-*NOTE: Your Lucene index must be created with the same version of Lucene
-used in Mahout.  As of Mahout 0.9 this is Lucene 4.6.1. If these versions dont 
match you will likely get "Exception in thread "main"
-org.apache.lucene.index.CorruptIndexException: Unknown format version: -11"
-as an error.*
-
-Mahout has utilities that allow one to easily produce Mahout Vector
-representations from a Lucene (and Solr, since they are they same) index.
-
-For this, we assume you know how to build a Lucene/Solr index. For those
-who don't, it is probably easiest to get up and running using 
[Solr](http://lucene.apache.org/solr)
- as it can ingest things like PDFs, XML, Office, etc. and create a Lucene
-index. For those wanting to use just Lucene, see the [Lucene 
website](http://lucene.apache.org/core)
- or check out _Lucene In Action_ by Erik Hatcher, Otis Gospodnetic and Mike
-McCandless.
-
-To get started, make sure you get a fresh copy of Mahout from 
[GitHub](http://mahout.apache.org/developers/buildingmahout.html)
- and are comfortable building it. It defines interfaces and implementations
-for efficiently iterating over a data source (it only supports Lucene
-currently, but should be extensible to databases, Solr, etc.) and produces
-a Mahout Vector file and term dictionary which can then be used for
-clustering.   The main code for driving this is the driver program located
-in the org.apache.mahout.utils.vectors package.  The driver program offers
-several input options, which can be displayed by specifying the --help
-option.  Examples of running the driver are included below:
-
-<a name="CreatingVectorsfromText-GeneratinganoutputfilefromaLuceneIndex"></a>
-#### Generating an output file from a Lucene Index
-
-
-    $MAHOUT_HOME/bin/mahout lucene.vector 
-        --dir (-d) dir                     The Lucene directory      
-        --idField idField                  The field in the index    
-                                               containing the index.  If 
-                                               null, then the Lucene     
-                                               internal doc id is used   
-                                               which is prone to error   
-                                               if the underlying index   
-                                               changes                   
-        --output (-o) output               The output file           
-        --delimiter (-l) delimiter         The delimiter for         
-                                               outputting the dictionary 
-        --help (-h)                        Print out help            
-        --field (-f) field                 The field in the index    
-        --max (-m) max                         The maximum number of     
-                                               vectors to output.  If    
-                                               not specified, then it    
-                                               will loop over all docs   
-        --dictOut (-t) dictOut             The output of the         
-                                               dictionary                
-        --seqDictOut (-st) seqDictOut      The output of the         
-                                               dictionary as sequence    
-                                               file                      
-        --norm (-n) norm                   The norm to use,          
-                                               expressed as either a     
-                                               double or "INF" if you    
-                                               want to use the Infinite  
-                                               norm.  Must be greater or 
-                                               equal to 0.  The default  
-                                               is not to normalize       
-        --maxDFPercent (-x) maxDFPercent   The max percentage of     
-                                               docs for the DF.  Can be  
-                                               used to remove really     
-                                               high frequency terms.     
-                                               Expressed as an integer   
-                                               between 0 and 100.        
-                                               Default is 99.            
-        --weight (-w) weight               The kind of weight to     
-                                               use. Currently TF or      
-                                               TFIDF                     
-        --minDF (-md) minDF                The minimum document      
-                                               frequency.  Default is 1  
-        --maxPercentErrorDocs (-err) mErr  The max percentage of     
-                                               docs that can have a null 
-                                               term vector. These are    
-                                               noise document and can    
-                                               occur if the analyzer     
-                                               used strips out all terms 
-                                               in the target field. This 
-                                               percentage is expressed   
-                                               as a value between 0 and  
-                                               1. The default is 0.  
-  
-#### Example: Create 50 Vectors from an Index 
-
-    $MAHOUT_HOME/bin/mahout lucene.vector
-        --dir $WORK_DIR/wikipedia/solr/data/index 
-        --field body 
-        --dictOut $WORK_DIR/solr/wikipedia/dict.txt
-        --output $WORK_DIR/solr/wikipedia/out.txt 
-        --max 50
-
-
-This uses the index specified by --dir and the body field in it and writes
-out the info to the output dir and the dictionary to dict.txt. It only
-outputs 50 vectors.  If you don't specify --max, then all the documents in
-the index are output.
-
-<a name="CreatingVectorsfromText-50VectorsFromLuceneL2Norm"></a>
-#### Example: Creating 50 Normalized Vectors from a Lucene Index using the 
[L_2 Norm](http://en.wikipedia.org/wiki/Lp_space)
-
-    $MAHOUT_HOME/bin/mahout lucene.vector 
-        --dir $WORK_DIR/wikipedia/solr/data/index 
-        --field body 
-        --dictOut $WORK_DIR/solr/wikipedia/dict.txt
-        --output $WORK_DIR/solr/wikipedia/out.txt 
-        --max 50 
-        --norm 2
-
-
-<a name="CreatingVectorsfromText-FromDirectoryofTextdocuments"></a>
-## From A Directory of Text documents
-Mahout has utilities to generate Vectors from a directory of text
-documents. Before creating the vectors, you need to convert the documents
-to SequenceFile format. SequenceFile is a hadoop class which allows us to
-write arbitary (key, value) pairs into it. The DocumentVectorizer requires the
-key to be a Text with a unique document id, and value to be the Text
-content in UTF-8 format.
-
-You may find [Tika](http://tika.apache.org/) helpful in converting
-binary documents to text.
-
-<a 
name="CreatingVectorsfromText-ConvertingdirectoryofdocumentstoSequenceFileformat"></a>
-#### Converting directory of documents to SequenceFile format
-Mahout has a nifty utility which reads a directory path including its
-sub-directories and creates the SequenceFile in a chunked manner for us.
-
-    $MAHOUT_HOME/bin/mahout seqdirectory 
-        --input (-i) input                       Path to job input directory.  
 
-        --output (-o) output                     The directory pathname for    
 
-                                                     output.                   
     
-        --overwrite (-ow)                        If present, overwrite the     
 
-                                                     output directory before   
     
-                                                     running job               
     
-        --method (-xm) method                    The execution method to use:  
 
-                                                     sequential or mapreduce.  
     
-                                                     Default is mapreduce      
     
-        --chunkSize (-chunk) chunkSize           The chunkSize in MegaBytes.   
 
-                                                     Defaults to 64            
     
-        --fileFilterClass (-filter) fFilterClass The name of the class to use  
 
-                                                     for file parsing. 
Default:     
-                                                     
org.apache.mahout.text.PrefixAdditionFilter                   
-        --keyPrefix (-prefix) keyPrefix          The prefix to be prepended to 
 
-                                                     the key                   
     
-        --charset (-c) charset                   The name of the character     
 
-                                                     encoding of the input 
files.   
-                                                     Default to UTF-8 
{accepts: cp1252|ascii...}             
-        --method (-xm) method                    The execution method to use:  
 
-                                                     sequential or mapreduce.  
     
-                                                 Default is mapreduce          
 
-        --overwrite (-ow)                        If present, overwrite the     
 
-                                                     output directory before   
     
-                                                     running job               
     
-        --help (-h)                              Print out help                
 
-        --tempDir tempDir                        Intermediate output directory 
 
-        --startPhase startPhase                  First phase to run            
 
-        --endPhase endPhase                      Last phase to run  
-
-The output of seqDirectory will be a Sequence file < Text, Text > of all 
documents (/sub-directory-path/documentFileName, documentText).
-
-<a name="CreatingVectorsfromText-CreatingVectorsfromSequenceFile"></a>
-#### Creating Vectors from SequenceFile
-
-From the sequence file generated from the above step run the following to
-generate vectors. 
-
-
-    $MAHOUT_HOME/bin/mahout seq2sparse
-        --minSupport (-s) minSupport      (Optional) Minimum Support. Default  
     
-                                              Value: 2                         
         
-        --analyzerName (-a) analyzerName  The class name of the analyzer       
     
-        --chunkSize (-chunk) chunkSize    The chunkSize in MegaBytes. Default  
     
-                                              Value: 100MB                     
         
-        --output (-o) output              The directory pathname for output.   
     
-        --input (-i) input                Path to job input directory.         
     
-        --minDF (-md) minDF               The minimum document frequency.  
Default  
-                                              is 1                             
         
-        --maxDFSigma (-xs) maxDFSigma     What portion of the tf (tf-idf) 
vectors   
-                                              to be used, expressed in times 
the        
-                                              standard deviation (sigma) of 
the         
-                                              document frequencies of these 
vectors.    
-                                              Can be used to remove really 
high         
-                                              frequency terms. Expressed as a 
double    
-                                              value. Good value to be 
specified is 3.0. 
-                                              In case the value is less than 0 
no       
-                                              vectors will be filtered out. 
Default is  
-                                              -1.0.  Overrides maxDFPercent    
         
-        --maxDFPercent (-x) maxDFPercent  The max percentage of docs for the 
DF.    
-                                              Can be used to remove really 
high         
-                                              frequency terms. Expressed as an 
integer  
-                                              between 0 and 100. Default is 
99.  If     
-                                              maxDFSigma is also set, it will 
override  
-                                              this value.                      
         
-        --weight (-wt) weight             The kind of weight to use. Currently 
TF   
-                                              or TFIDF. Default: TFIDF         
         
-        --norm (-n) norm                  The norm to use, expressed as either 
a    
-                                              float or "INF" if you want to 
use the     
-                                              Infinite norm.  Must be greater 
or equal  
-                                              to 0.  The default is not to 
normalize    
-        --minLLR (-ml) minLLR             (Optional)The minimum Log Likelihood 
     
-                                              Ratio(Float)  Default is 1.0     
         
-        --numReducers (-nr) numReducers   (Optional) Number of reduce tasks.   
     
-                                              Default Value: 1                 
         
-        --maxNGramSize (-ng) ngramSize    (Optional) The maximum size of 
ngrams to  
-                                              create (2 = bigrams, 3 = 
trigrams, etc)   
-                                              Default Value:1                  
         
-        --overwrite (-ow)                 If set, overwrite the output 
directory    
-        --help (-h)                           Print out help                   
         
-        --sequentialAccessVector (-seq)   (Optional) Whether output vectors 
should  
-                                              be SequentialAccessVectors. 
Default is false;
-                                              true required for running some 
algorithms
-                                              (LDA,Lanczos)                    
            
-        --namedVector (-nv)               (Optional) Whether output vectors 
should  
-                                              be NamedVectors. If set true 
else false   
-        --logNormalize (-lnorm)           (Optional) Whether output vectors 
should  
-                                              be logNormalize. If set true 
else false
-
-
-
-This will create SequenceFiles of tokenized documents < Text, StringTuple >  
(docID, tokenizedDoc) and vectorized documents < Text, VectorWritable > (docID, 
TF-IDF Vector).  
-
-As well, seq2sparse will create SequenceFiles for: a dictionary (wordIndex, 
word), a word frequency count (wordIndex, count) and a document frequency count 
(wordIndex, DFCount) in the output directory. 
-
-The --minSupport option is the min frequency for the word to be considered as 
a feature; --minDF is the min number of documents the word needs to be in; 
--maxDFPercent is the max value of the expression (document frequency of a 
word/total number of document) to be considered as good feature to be in the 
document. These options are helpful in removing high frequency features like 
stop words.
-
-The vectorized documents can then be used as input to many of Mahout's 
classification and clustering algorithms.
-
-#### Example: Creating Normalized 
[TF-IDF](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) Vectors from a directory 
of text documents using [trigrams](http://en.wikipedia.org/wiki/N-gram) and the 
[L_2 Norm](http://en.wikipedia.org/wiki/Lp_space)
-Create sequence files from the directory of text documents:
-    
-    $MAHOUT_HOME/bin/mahout seqdirectory 
-        -i $WORK_DIR/reuters 
-        -o $WORK_DIR/reuters-seqdir 
-        -c UTF-8
-        -chunk 64
-        -xm sequential
-
-Vectorize the documents using trigrams, L_2 length normalization and a maximum 
document frequency cutoff of 85%.
-
-    $MAHOUT_HOME/bin/mahout seq2sparse 
-        -i $WORK_DIR/reuters-out-seqdir/ 
-        -o $WORK_DIR/reuters-out-seqdir-sparse-kmeans 
-        --namedVec
-        -wt tfidf
-        -ng 3
-        -n 2
-        --maxDFPercent 85 
-
-The sequence file in the 
$WORK_DIR/reuters-out-seqdir-sparse-kmeans/tfidf-vectors directory can now be 
used as input to the Mahout 
[k-Means](http://mahout.apache.org/users/clustering/k-means-clustering.html) 
clustering algorithm.
-
-<a name="CreatingVectorsfromText-Background"></a>
-## Background
-
-* [Discussion on centroid calculations with sparse 
vectors](http://markmail.org/thread/l5zi3yk446goll3o)
-
-<a 
name="CreatingVectorsfromText-ConvertingexistingvectorstoMahout'sformat"></a>
-## Converting existing vectors to Mahout's format
-
-If you are in the happy position to already own a document (as in: texts,
-images or whatever item you wish to treat) processing pipeline, the
-question arises of how to convert the vectors into the Mahout vector
-format. Probably the easiest way to go would be to implement your own
-Iterable<Vector> (called VectorIterable in the example below) and then
-reuse the existing VectorWriter classes:
-
-
-    VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
-                                                          configuration,
-                                                          outfile,
-                                                          LongWritable.class,
-                                                          SparseVector.class);
-
-    long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
-


http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_priority/creating-vectors.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_priority/creating-vectors.md 
b/website/old_site_migration/needs_work_priority/creating-vectors.md
deleted file mode 100644
index 10cbd8e..0000000
--- a/website/old_site_migration/needs_work_priority/creating-vectors.md
+++ /dev/null
@@ -1,16 +0,0 @@
----
-layout: default
-title: Creating Vectors
-theme:
-    name: retro-mahout
----
-
-
-<a name="CreatingVectors-UtilitiesforCreatingVectors"></a>
-# Utilities for Creating Vectors
-
-1. [Text](creating-vectors-from-text.html) ... utilities to turn plain text 
into Mahout vectors.
-
-1. Mahout also has rudimentary support for the arff file format. See [arff 
junit 
doc](https://builds.apache.org/job/Mahout-Quality/ws/trunk/integration/target/site/apidocs/org/apache/mahout/utils/vectors/arff/package-summary.html).
-
-1. There is also support for reading vectors from [csv 
files](https://builds.apache.org/job/Mahout-Quality/ws/trunk/integration/target/site/apidocs/org/apache/mahout/utils/vectors/csv/package-summary.html).

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_priority/dim-reduction/dimensional-reduction.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_priority/dim-reduction/dimensional-reduction.md
 
b/website/old_site_migration/needs_work_priority/dim-reduction/dimensional-reduction.md
deleted file mode 100644
index 2a157f6..0000000
--- 
a/website/old_site_migration/needs_work_priority/dim-reduction/dimensional-reduction.md
+++ /dev/null
@@ -1,446 +0,0 @@
----
-layout: default
-title: Dimensional Reduction
-theme:
-   name: retro-mahout
----
-
-# Support for dimensional reduction
-
-Matrix algebra underpins the way many Big Data algorithms and data
-structures are composed: full-text search can be viewed as doing matrix
-multiplication of the term-document matrix by the query vector (giving a
-vector over documents where the components are the relevance score),
-computing co-occurrences in a collaborative filtering context (people who
-viewed X also viewed Y, or ratings-based CF like the Netflix Prize contest)
-is taking the squaring the user-item interaction matrix, calculating users
-who are k-degrees separated from each other in a social network or
-web-graph can be found by looking at the k-fold product of the graph
-adjacency matrix, and the list goes on (and these are all cases where the
-linear structure of the matrix is preserved!)
-
-Each of these examples deal with cases of matrices which tend to be
-tremendously large (often millions to tens of millions to hundreds of
-millions of rows or more, by sometimes a comparable number of columns), but
-also rather sparse. Sparse matrices are nice in some respects: dense
-matrices which are 10^7 on a side would have 100 trillion non-zero entries!
-But the sparsity is often problematic, because any given two rows (or
-columns) of the matrix may have zero overlap. Additionally, any
-machine-learning work done on the data which comprises the rows has to deal
-with what is known as "the curse of dimensionality", and for example, there
-are too many columns to train most regression or classification problems on
-them independently.
-
-One of the more useful approaches to dealing with such huge sparse data
-sets is the concept of dimensionality reduction, where a lower dimensional
-space of the original column (feature) space of your data is found /
-constructed, and your rows are mapped into that subspace (or sub-manifold).
- In this reduced dimensional space, "important" components to distance
-between points are exaggerated, and unimportant ones washed away, and
-additionally, sparsity of your rows is traded for drastically reduced
-dimensional, but dense "signatures". While this loss of sparsity can lead
-to its own complications, a proper dimensionality reduction can help reveal
-the most important features of your data, expose correlations among your
-supposedly independent original variables, and smooth over the zeroes in
-your correlation matrix.
-
-One of the most straightforward techniques for dimensionality reduction is
-the matrix decomposition: singular value decomposition, eigen
-decomposition, non-negative matrix factorization, etc. In their truncated
-form these decompositions are an excellent first approach toward linearity
-preserving unsupervised feature selection and dimensional reduction. Of
-course, sparse matrices which don't fit in RAM need special treatment as
-far as decomposition is concerned. Parallelizable and/or stream-oriented
-algorithms are needed.
-
-<a name="DimensionalReduction-SingularValueDecomposition"></a>
-# Singular Value Decomposition
-
-Currently implemented in Mahout (as of 0.3, the first release with MAHOUT-180 
applied), are two scalable implementations of SVD, a stream-oriented 
implementation using the Asymmetric Generalized Hebbian Algorithm outlined in 
Genevieve Gorrell & Brandyn Webb's paper ([Gorrell and Webb 
2005](-http://www.dcs.shef.ac.uk/~genevieve/gorrell_webb.pdf.html)
-); and there is a [Lanczos | http://en.wikipedia.org/wiki/Lanczos_algorithm]
- implementation, both single-threaded, and in the
-o.a.m.math.decomposer.lanczos package (math module), as a hadoop map-reduce
-(series of) job(s) in o.a.m.math.hadoop.decomposer package (core module).
-Coming soon: stochastic decomposition.
-
-See also: 
[https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition](Wikipedia
 - SVD)
-
-<a name="DimensionalReduction-Lanczos"></a>
-## Lanczos
-
-The Lanczos algorithm is designed for eigen-decomposition, but like any
-such algorithm, getting singular vectors out of it is immediate (singular
-vectors of matrix A are just the eigenvectors of A^t * A or A * A^t). 
-Lanczos works by taking a starting seed vector *v* (with cardinality equal
-to the number of columns of the matrix A), and repeatedly multiplying A by
-the result: *v'* = A.times(*v*) (and then subtracting off what is
-proportional to previous *v'*'s, and building up an auxiliary matrix of
-projections).  In the case where A is not square (in general: not
-symmetric), then you actually want to repeatedly multiply A*A^t by *v*:
-*v'* = (A * A^t).times(*v*), or equivalently, in Mahout,
-A.timesSquared(*v*) (timesSquared is merely an optimization: by changing
-the order of summation in A*A^t.times(*v*), you can do the same computation
-as one pass over the rows of A instead of two).
-
-After *k* iterations of *v_i* = A.timesSquared(*v_(i-1)*), a *k*- by -*k*
-tridiagonal matrix has been created (the auxiliary matrix mentioned above),
-out of which a good (often extremely good) approximation to *k* of the
-singular values (and with the basis spanned by the *v_i*, the *k* singular
-*vectors* may also be extracted) of A may be efficiently extracted.  Which
-*k*?  It's actually a spread across the entire spectrum: the first few will
-most certainly be the largest singular values, and the bottom few will be
-the smallest, but you have no guarantee that just because you have the n'th
-largest singular value of A, that you also have the (n-1)'st as well.  A
-good rule of thumb is to try and extract out the top 3k singular vectors
-via Lanczos, and then discard the bottom two thirds, if you want primarily
-the largest singular values (which is the case for using Lanczos for
-dimensional reduction).
-
-<a name="DimensionalReduction-ParallelizationStragegy"></a>
-### Parallelization Stragegy
-
-Lanczos is "embarassingly parallelizable": matrix multiplication of a
-matrix by a vector may be carried out row-at-a-time without communication
-until at the end, the results of the intermediate matrix-by-vector outputs
-are accumulated on one final vector.  When it's truly A.times(*v*), the
-final accumulation doesn't even have collision / synchronization issues
-(the outputs are individual separate entries on a single vector), and
-multicore approaches can be very fast, and there should also be tricks to
-speed things up on Hadoop.  In the asymmetric case, where the operation is
-A.timesSquared(*v*), the accumulation does require synchronization (the
-vectors to be summed have nonzero elements all across their range), but
-delaying writing to disk until Mapper close(), and remembering that having
-a Combiner be the same as the Reducer, the bottleneck in accumulation is
-nowhere near a single point.
-
-<a name="DimensionalReduction-Mahoutusage"></a>
-### Mahout usage
-
-The Mahout DistributedLanzcosSolver is invoked by the
-<MAHOUT_HOME>/bin/mahout svd command. This command takes the following
-arguments (which can be reproduced by just entering the command with no
-arguments):
-
-
-    Job-Specific Options:                                                      
    
-      --input (-i) input                         Path to job input directory.  
    
-      --output (-o) output                       The directory pathname for 
output.    
-      --numRows (-nr) numRows            Number of rows of the input matrix    
  
-      --numCols (-nc) numCols            Number of columns of the input matrix 
-      --rank (-r) rank                   Desired decomposition rank (note: 
-                                         only roughly 1/4 to 1/3 of these will 
-                                         have the top portion of the spectrum) 
-      --symmetric (-sym) symmetric               Is the input matrix square 
and    
-                                         symmetric?                        
-      --cleansvd (-cl) cleansvd                  Run the EigenVerificationJob 
to clean 
-                                         the eigenvectors after SVD        
-      --maxError (-err) maxError                 Maximum acceptable error      
    
-      --minEigenvalue (-mev) minEigenvalue       Minimum eigenvalue to keep 
the vector for                                 
-      --inMemory (-mem) inMemory                 Buffer eigen matrix into 
memory (if you have enough!)             
-      --help (-h)                                Print out help                
    
-      --tempDir tempDir                          Intermediate output directory 
    
-      --startPhase startPhase            First phase to run                
-      --endPhase endPhase                        Last phase to run             
    
-
-
-The short form invocation may be used to perform the SVD on the input data: 
-
-      <MAHOUT_HOME>/bin/mahout svd \
-      --input (-i) <Path to input matrix> \   
-      --output (-o) <The directory pathname for output> \      
-      --numRows (-nr) <Number of rows of the input matrix> \   
-      --numCols (-nc) <Number of columns of the input matrix> \
-      --rank (-r) <Desired decomposition rank> \
-      --symmetric (-sym) <Is the input matrix square and symmetric>    
-
-
-The --input argument is the location on HDFS where a
-SequenceFile<Writable,VectorWritable> (preferably
-SequentialAccessSparseVectors instances) lies which you wish to decompose. 
-Each vector of which has --numcols entries.  --numRows is the number of
-input rows and is used to properly size the matrix data structures.
-
-After execution, the --output directory will have a file named
-"rawEigenvectors" containing the raw eigenvectors. As the
-DistributedLanczosSolver sometimes produces "extra" eigenvectors, whose
-eigenvalues aren't valid, and also scales all of the eigenvalues down by
-the max eignenvalue (to avoid floating point overflow), there is an
-additional step which spits out the nice correctly scaled (and
-non-spurious) eigenvector/value pairs. This is done by the "cleansvd" shell
-script step (c.f. EigenVerificationJob).
-
-If you have run he short form svd invocation above and require this
-"cleaning" of the eigen/singular output you can run "cleansvd" as a
-separate command:
-
-      <MAHOUT_HOME>/bin/mahout cleansvd \
-      --eigenInput <path to raw eigenvectors> \
-      --corpusInput <path to corpus> \
-      --output <path to output directory> \
-      --maxError <maximum allowed error. Default is 0.5> \
-      --minEigenvalue <minimum allowed eigenvalue. Default is 0.0> \
-      --inMemory <true if the eigenvectors can all fit into memory. Default 
false>
-
-
-The --corpusInput is the input path from the previous step, --eigenInput is
-the output from the previous step (<output>/rawEigenvectors), and --output
-is the desired output path (same as svd argument). The two "cleaning"
-params are --maxError - the maximum allowed 1-cosAngle(v,
-A.timesSquared(v)), and --minEigenvalue.  Eigenvectors which have too large
-error, or too small eigenvalue are discarded.  Optional argument:
---inMemory, if you have enough memory on your local machine (not on the
-hadoop cluster nodes!) to load all eigenvectors into memory at once (at
-least 8 bytes/double * rank * numCols), then you will see some speedups on
-this cleaning process.
-
-After execution, the --output directory will have a file named
-"cleanEigenvectors" containing the clean eigenvectors. 
-
-These two steps can also be invoked together by the svd command by using
-the long form svd invocation:
-
-      <MAHOUT_HOME>/bin/mahout svd \
-      --input (-i) <Path to input matrix> \   
-      --output (-o) <The directory pathname for output> \      
-      --numRows (-nr) <Number of rows of the input matrix> \   
-      --numCols (-nc) <Number of columns of the input matrix> \
-      --rank (-r) <Desired decomposition rank> \
-      --symmetric (-sym) <Is the input matrix square and symmetric> \  
-      --cleansvd "true"   \
-      --maxError <maximum allowed error. Default is 0.5> \
-      --minEigenvalue <minimum allowed eigenvalue. Default is 0.0> \
-      --inMemory <true if the eigenvectors can all fit into memory. Default 
false>
-
-
-After execution, the --output directory will contain two files: the
-"rawEigenvectors" and the "cleanEigenvectors".
-
-TODO: also allow exclusion based on improper orthogonality (currently
-computed, but not checked against constraints).
-
-<a 
name="DimensionalReduction-Example:SVDofASFMailArchivesonAmazonElasticMapReduce"></a>
-#### Example: SVD of ASF Mail Archives on Amazon Elastic MapReduce
-
-This section walks you through a complete example of running the Mahout SVD
-job on Amazon Elastic MapReduce cluster and then preparing the output to be
-used for clustering. This example was developed as part of the effort to
-benchmark Mahout's clustering algorithms using a large document set (see 
[MAHOUT-588](https://issues.apache.org/jira/browse/MAHOUT-588)
-). Specifically, we use the ASF mail archives located at
-http://aws.amazon.com/datasets/7791434387204566.  You will need to likely
-run seq2sparse on these first. See
-$MAHOUT_HOME/examples/bin/build-asf-email.sh (on trunk) for examples of
-processing this data.
-
-At a high level, the steps we're going to perform are:
-
-bin/mahout svd (original -> svdOut)
-bin/mahout cleansvd ...
-bin/mahout transpose svdOut -> svdT
-bin/mahout transpose original -> originalT
-bin/mahout matrixmult originalT svdT -> newMatrix
-bin/mahout kmeans newMatrix
-
-The bulk of the content for this section was extracted from the Mahout user
-mailing list, see: [Using SVD with 
Canopy/KMeans](http://search.lucidimagination.com/search/document/6e5889ee6f0f253b/using_svd_with_canopy_kmeans#66a50fe017cebbe8)
- and [Need a little help with using 
SVD](http://search.lucidimagination.com/search/document/748181681ae5238b/need_a_little_help_with_using_svd#134fb2771fd52928)
-
-Note: Some of this work is due in part to credits donated by the Amazon
-Elastic MapReduce team.
-
-<a name="DimensionalReduction-1.LaunchEMRCluster"></a>
-##### 1. Launch EMR Cluster
-
-For a detailed explanation of the steps involved in launching an Amazon
-Elastic MapReduce cluster for running Mahout jobs, please read the
-"Building Vectors for Large Document Sets" section of [Mahout on Elastic 
MapReduce](https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce)
-.
-
-In the remaining steps below, remember to replace JOB_ID with the Job ID of
-your EMR cluster.
-
-<a name="DimensionalReduction-2.LoadMahout0.5+JARintoS3"></a>
-##### 2. Load Mahout 0.5+ JAR into S3
-
-These steps were created with the mahout-0.5-SNAPSHOT because they rely on
-the patch for [MAHOUT-639](https://issues.apache.org/jira/browse/MAHOUT-639)
-
-<a name="DimensionalReduction-3.CopyTFIDFVectorsintoHDFS"></a>
-##### 3. Copy TFIDF Vectors into HDFS
-
-Before running your SVD job on the vectors, you need to copy them from S3
-to your EMR cluster's HDFS.
-
-
-    elastic-mapreduce --jar s3://elasticmapreduce/samples/distcp/distcp.jar \
-      --arg 
s3n://ACCESS_KEY:SECRET_KEY@asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors\
-      --arg /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors \
-      -j JOB_ID
-
-
-<a name="DimensionalReduction-4.RuntheSVDJob"></a>
-##### 4. Run the SVD Job
-
-Now you're ready to run the SVD job on the vectors stored in HDFS:
-
-
-    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
-      --main-class org.apache.mahout.driver.MahoutDriver \
-      --arg svd \
-      --arg -i --arg 
/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors\
-      --arg -o --arg /asf-mail-archives/mahout/svd \
-      --arg --rank --arg 100 \
-      --arg --numCols --arg 20444 \
-      --arg --numRows --arg 6076937 \
-      --arg --cleansvd --arg "true" \
-      -j JOB_ID
-
-
-This will run 100 iterations of the LanczosSolver SVD job to produce 87
-eigenvectors in:
-
-
-    /asf-mail-archives/mahout/svd/cleanEigenvectors
-
-
-Only 87 eigenvectors were produced because of the cleanup step, which
-removes any duplicate eigenvectors caused by convergence issues and numeric
-overflow and any that don't appear to be "eigen" enough (ie, they don't
-satisfy the eigenvector criterion with high enough fidelity). - Jake Mannix
-
-<a name="DimensionalReduction-5.TransformyourTFIDFVectorsintoMahoutMatrix"></a>
-##### 5. Transform your TFIDF Vectors into Mahout Matrix
-
-The tfidf vectors created by the seq2sparse job are
-SequenceFile<Text,VectorWritable>. The Mahout RowId job transforms these
-vectors into a matrix form that is a
-SequenceFile<IntWritable,VectorWritable> and a
-SequenceFile<IntWritable,Text> (where the original one is the join of these
-new ones, on the new int key).
-
-
-    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
-      --main-class org.apache.mahout.driver.MahoutDriver \
-      --arg rowid \
-      --arg
--Dmapred.input.dir=/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors
-\
-      --arg
--Dmapred.output.dir=/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix
-\
-      -j JOB_ID
-
-
-This is not a distributed job and will only run on the master server in
-your EMR cluster. The job produces the following output:
-
-
-    /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/docIndex
-    /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/matrix
-
-
-where docIndex is the SequenceFile<IntWritable,Text> and matrix is
-SequenceFile<IntWritable,VectorWritable>.
-
-<a name="DimensionalReduction-6.TransposetheMatrix"></a>
-##### 6. Transpose the Matrix
-
-Our ultimate goal is to multiply the TFIDF vector matrix times our SVD
-eigenvectors. For the mathematically inclined, from the rowid job, we now
-have an m x n matrix T (m=6076937, n=20444). The SVD eigenvector matrix E
-is p x n (p=87, n=20444). So to multiply these two matrices, I need to
-transpose E so that the number of columns in T equals the number of rows in
-E (i.e. E^T is n x p) the result of the matrixmult would give me an m x p
-matrix (m=6076937, p=87).
-
-However, in practice, computing the matrix product of two matrices as a
-map-reduce job is efficiently done as a map-side join on two row-based
-matrices with the same number of rows, and the columns are the ones which
-are different. In particular, if you take a matrix X which is represented
-as a set of numRowsX rows, each of which has numColsX, and another matrix
-with numRowsY == numRowsX, each of which has numColsY (!= numColsX), then
-by summing the outer-products of each of the numRowsX pairs of vectors, you
-get a matrix of with numRowsZ == numColsX, and numColsZ == numColsY (if you
-instead take the reverse outer product of the vector pairs, you can end up
-with the transpose of this final result, with numRowsZ == numColsY, and
-numColsZ == numColsX). - Jake Mannix
-
-Thus, we need to transpose the matrix using Mahout's Transpose Job:
-
-
-    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
-      --main-class org.apache.mahout.driver.MahoutDriver \
-      --arg transpose \
-      --arg -i --arg
-/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/matrix \
-      --arg --numRows --arg 6076937 \
-      --arg --numCols --arg 20444 \
-      --arg --tempDir --arg
-/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose \
-      -j JOB_ID
-
-
-This job requires the patch to 
[MAHOUT-639](https://issues.apache.org/jira/browse/MAHOUT-639)
-
-The job creates the following output:
-
-
-    /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose
-
-
-<a name="DimensionalReduction-7.TransposeEigenvectors"></a>
-##### 7. Transpose Eigenvectors
-
-If you followed Jake's explanation in step 6 above, then you know that we
-also need to transpose the eigenvectors:
-
-
-    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
-      --main-class org.apache.mahout.driver.MahoutDriver \
-      --arg transpose \
-      --arg -i --arg /asf-mail-archives/mahout/svd/cleanEigenvectors \
-      --arg --numRows --arg 87 \
-      --arg --numCols --arg 20444 \
-      --arg --tempDir --arg /asf-mail-archives/mahout/svd/transpose \
-      -j JOB_ID
-
-
-Note: You need to use the same number of reducers that was used for
-transposing the matrix you are multiplying the vectors with.
-
-The job creates the following output:
-
-
-    /asf-mail-archives/mahout/svd/transpose
-
-
-<a name="DimensionalReduction-8.MatrixMultiplication"></a>
-##### 8. Matrix Multiplication
-
-Lastly, we need to multiply the transposed vectors using Mahout's
-matrixmult job:
-
-
-    elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \
-      --main-class org.apache.mahout.driver.MahoutDriver \
-      --arg matrixmult \
-      --arg --numRowsA --arg 20444 \
-      --arg --numColsA --arg 6076937 \
-      --arg --numRowsB --arg 20444 \
-      --arg --numColsB --arg 87 \
-      --arg --inputPathA --arg
-/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose \
-      --arg --inputPathB --arg /asf-mail-archives/mahout/svd/transpose \
-      -j JOB_ID
-
-
-This job produces output such as:
-
-
-    /user/hadoop/productWith-189
-
-
-<a name="DimensionalReduction-Resources"></a>
-# Resources
-
-* [LSA tutorial](http://www.dcs.shef.ac.uk/~genevieve/lsa_tutorial.htm)
-* [SVD 
tutorial](http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html)

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.md 
b/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.md
deleted file mode 100644
index 50ff7be..0000000
--- a/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.md
+++ /dev/null
@@ -1,127 +0,0 @@
----
-layout: default
-title:     Stochastic SVD
-theme:
-   name: retro-mahout
----
-
-# Stochastic Singular Value Decomposition #
-
-Stochastic SVD method in Mahout produces reduced rank Singular Value 
Decomposition output in its 
-strict mathematical definition: ` \(\mathbf{A\approx 
U}\boldsymbol{\Sigma}\mathbf{V}^{\top}\)`.
-
-##The benefits over other methods are:
-
- - reduced flops required compared to Krylov subspace methods
-
- - In map-reduce world, a fixed number of MR iterations required regardless of 
rank requested
-
- - Tweak precision/speed balance with options.
-
- - A is a Distributed Row Matrix where rows may be identified by any Writable 
(such as a document path). As such, it would work directly on the output of 
seq2sparse.
-
- - As of 0.7 trunk, includes PCA and dimensionality reduction workflow 
(EXPERIMENTAL! Feedback on performance/other PCA related issues/ blogs is 
greatly appreciated.)
-
-### Map-Reduce characteristics: 
-SSVD uses at most 3 MR sequential steps (map-only + map-reduce + 2 optional 
parallel map-reduce jobs) to produce reduced rank approximation of U, V and S 
matrices. Additionally, two more map-reduce steps are added for each power 
iteration step if requested.
-
-##Potential drawbacks:
-
-potentially less precise (but adding even one power iteration seems to fix 
that quite a bit).
-
-##Documentation
-
-[Overview and Usage][3]
-
-Note: Please use 0.6 or later! for PCA workflow, please use 0.7 or later.
-
-##Publications
-
-[Nathan Halko's dissertation][1] "Randomized methods for computing low-rank
-approximations of matrices" contains comprehensive definition of 
parallelization strategy taken in Mahout SSVD implementation and also some 
precision/scalability benchmarks, esp. w.r.t. Mahout Lanczos implementation on 
a typical corpus data set.
-
-[Halko, Martinsson, Tropp] paper discusses family of random projection-based 
algorithms and contains theoretical error estimates.
-
-**R simulation**
-
-[Non-parallel SSVD simulation in R][2] with power iterations and PCA options. 
Note that this implementation is not most optimal for sequential flow solver, 
but it is for demonstration purposes only.
-
-However, try this R code to simulate a meaningful input:
-
-
-
-**tests.R**
-
-
-
-    n<-1000
-    m<-2000
-    k<-10
-     
-    qi<-1
-     
-    #simulated input
-    svalsim<-diag(k:1)
-     
-    usim<- qr.Q(qr(matrix(rnorm(m*k, mean=3), nrow=m,ncol=k)))
-    vsim<- qr.Q(qr( matrix(rnorm(n*k,mean=5), nrow=n,ncol=k)))
-     
-     
-    x<- usim %*% svalsim %*% t(vsim)
-
-
-and try to compare ssvd.svd(x) and stock svd(x) performance for the same rank 
k, notice the difference in the running time. Also play with power iterations 
(qIter) and compare accuracies of standard svd and SSVD.
-
-Note: numerical stability of R algorithms may differ from that of Mahout's 
distributed version. We haven't studied accuracy of the R simulation. For study 
of accuracy of Mahout's version, please refer to Nathan's dissertation as 
referenced above.
-
-
-  [1]: 
http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf
-  [2]: ssvd.page/ssvd.R
-  [3]: ssvd.page/SSVD-CLI.pdf
-
-
-#### Modified SSVD Algorithm.
-
-Given an `\(m\times n\)`
-matrix `\(\mathbf{A}\)`, a target rank `\(k\in\mathbb{N}_{1}\)`
-, an oversampling parameter `\(p\in\mathbb{N}_{1}\)`, 
-and the number of additional power iterations `\(q\in\mathbb{N}_{0}\)`, 
-this procedure computes an `\(m\times\left(k+p\right)\)`
-SVD `\(\mathbf{A\approx U}\boldsymbol{\Sigma}\mathbf{V}^{\top}\)`:
-
-  1. Create seed for random `\(n\times\left(k+p\right)\)`
-  matrix `\(\boldsymbol{\Omega}\)`. The seed defines matrix 
`\(\mathbf{\Omega}\)`
-  using Gaussian unit vectors per one of suggestions in [Halko, Martinsson, 
Tropp].
-
-  2. 
`\(\mathbf{Y=A\boldsymbol{\Omega}},\,\mathbf{Y}\in\mathbb{R}^{m\times\left(k+p\right)}\)`
- 
-
-  3. Column-orthonormalize `\(\mathbf{Y}\rightarrow\mathbf{Q}\)`
-  by computing thin decomposition `\(\mathbf{Y}=\mathbf{Q}\mathbf{R}\)`.
-  Also, 
`\(\mathbf{Q}\in\mathbb{R}^{m\times\left(k+p\right)},\,\mathbf{R}\in\mathbb{R}^{\left(k+p\right)\times\left(k+p\right)}\)`.
-  I denote this as `\(\mathbf{Q}=\mbox{qr}\left(\mathbf{Y}\right).\mathbf{Q}\)`
- 
-
-  4. 
`\(\mathbf{B}_{0}=\mathbf{Q}^{\top}\mathbf{A}:\,\,\mathbf{B}\in\mathbb{R}^{\left(k+p\right)\times
 n}\)`.
- 
-  5. If `\(q>0\)`
-  repeat: for `\(i=1..q\)`: 
-  
`\(\mathbf{B}_{i}^{\top}=\mathbf{A}^{\top}\mbox{qr}\left(\mathbf{A}\mathbf{B}_{i-1}^{\top}\right).\mathbf{Q}\)`
-  (power iterations step).
-
-  6. Compute Eigensolution of a small Hermitian 
`\(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}=\mathbf{\hat{U}}\boldsymbol{\Lambda}\mathbf{\hat{U}}^{\top}\)`,
-  
`\(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}\in\mathbb{R}^{\left(k+p\right)\times\left(k+p\right)}\)`.
- 
-
-  7. Singular values 
`\(\mathbf{\boldsymbol{\Sigma}}=\boldsymbol{\Lambda}^{0.5}\)`,
-  or, in other words, `\(s_{i}=\sqrt{\sigma_{i}}\)`.
- 
-
-  8. If needed, compute `\(\mathbf{U}=\mathbf{Q}\hat{\mathbf{U}}\)`.
- 
-
-  9. If needed, compute 
`\(\mathbf{V}=\mathbf{B}_{q}^{\top}\hat{\mathbf{U}}\boldsymbol{\Sigma}^{-1}\)`.
-Another way is 
`\(\mathbf{V}=\mathbf{A}^{\top}\mathbf{U}\boldsymbol{\Sigma}^{-1}\)`.
-
-[Halko, Martinsson, Tropp]: http://arxiv.org/abs/0909.4061
- 

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/SSVD-CLI.pdf
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/SSVD-CLI.pdf
 
b/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/SSVD-CLI.pdf
deleted file mode 100644
index ab5999d..0000000
Binary files 
a/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/SSVD-CLI.pdf
 and /dev/null differ

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/ssvd.R
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/ssvd.R 
b/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/ssvd.R
deleted file mode 100644
index fa5fa84..0000000
--- 
a/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/ssvd.R
+++ /dev/null
@@ -1,181 +0,0 @@
-
-# standard SSVD
-ssvd.svd <- function(x, k, p=25, qiter=0 ) { 
-
-a <- as.matrix(x)
-m <- nrow(a)
-n <- ncol(a)
-p <- min( min(m,n)-k,p)
-r <- k+p
-
-omega <- matrix ( rnorm(r*n), nrow=n, ncol=r)
-
-y <- a %*% omega
-
-q <- qr.Q(qr(y))
-
-b<- t(q) %*% a
-
-#power iterations
-for ( i in 1:qiter ) { 
-  y <- a %*% t(b)
-  q <- qr.Q(qr(y))
-  b <- t(q) %*% a
-}
-
-bbt <- b %*% t(b)
-
-e <- eigen(bbt, symmetric=T)
-
-res <- list()
-
-res$svalues <- sqrt(e$values)[1:k]
-uhat=e$vectors[1:k,1:k]
-
-res$u <- (q %*% e$vectors)[,1:k]
-res$v <- (t(b) %*% e$vectors %*% diag(1/e$values))[,1:k]
-
-return(res)
-}
-
-#SSVD with Q=YR^-1 substitute.
-# this is just a simulation, because it is suboptimal to verify the actual 
result
-ssvd.svd1 <- function(x, k, p=25, qiter=0 ) { 
-
-a <- as.matrix(x)
-m <- nrow(a)
-n <- ncol(a)
-p <- min( min(m,n)-k,p)
-r <- k+p
-
-omega <- matrix ( rnorm(r*n), nrow=n, ncol=r)
-
-# in reality we of course don't need to form and persist y
-# but this is just verification
-y <- a %*% omega
-
-yty <- t(y) %*% y
-R <- chol(yty, pivot = T)
-q <- y %*% solve(R)
-
-b<- t( q ) %*% a   
-
-#power iterations
-for ( i in 1:qiter ) { 
-  y <- a %*% t(b)
-
-  yty <- t(y) %*% y
-  R <- chol(yty, pivot = T)
-  q <- y %*% solve(R)
-  b <- t(q) %*% a
-}
-
-bbt <- b %*% t(b)
-
-e <- eigen(bbt, symmetric=T)
-
-res <- list()
-
-res$svalues <- sqrt(e$values)[1:k]
-uhat=e$vectors[1:k,1:k]
-
-res$u <- (q %*% e$vectors)[,1:k]
-res$v <- (t(b) %*% e$vectors %*% diag(1/e$values))[,1:k]
-
-return(res)
-}
-
-
-#############
-## ssvd with pci options
-ssvd.cpca <- function ( x, k, p=25, qiter=0, fixY=T ) { 
-
-a <- as.matrix(x)
-m <- nrow(a)
-n <- ncol(a)
-p <- min( min(m,n)-k,p)
-r <- k+p
-
-
-# compute median xi
-xi<-colMeans(a)
-
-omega <- matrix ( rnorm(r*n), nrow=n, ncol=r)
-
-y <- a %*% omega
-
-#fix y
-if ( fixY ) { 
-  #debug
-  cat ("fixing Y...\n");
-
-  s_o = t(omega) %*% cbind(xi)
-  for (i in 1:r ) y[,i]<- y[,i]-s_o[i]
-}
-
-
-q <- qr.Q(qr(y))
-
-b<- t(q) %*% a
-
-# compute sum of q rows 
-s_q <- cbind(colSums(q))
-
-# compute B*xi
-# of course in MR implementation 
-# it will be collected as sums of ( B[,i] * xi[i] ) and reduced after.
-s_b <- b %*% cbind(xi)
-
-
-#power iterations
-for ( i in 1:qiter ) { 
-
-  # fix b 
-  b <- b - s_q %*% rbind(xi) 
-
-  y <- a %*% t(b)
-
-  # fix y 
-  if ( fixY )  
-    for (i in 1:r ) y[,i]<- y[,i]-s_b[i]
-  
-
-  q <- qr.Q(qr(y))
-  b <- t(q) %*% a
-
-  # recompute s_{q}
-  s_q <- cbind(colSums(q))
-
-  #recompute s_{b}
-  s_b <- b %*% cbind(xi)
-
-}
-
-
-
-#C is the outer product of S_q and S_b per doc
-C <- s_q %*% t(s_b)
-
-# fixing BB'
-bbt <- b %*% t(b) -C -t(C) + sum(xi * xi)* (s_q %*% t(s_q))
-
-e <- eigen(bbt, symmetric=T)
-
-res <- list()
-
-res$svalues <- sqrt(e$values)[1:k]
-uhat=e$vectors[1:k,1:k]
-
-res$u <- (q %*% e$vectors)[,1:k]
-
-res$v <- (t(b- s_q %*% rbind(xi) ) %*% e$vectors %*% diag(1/e$values))[,1:k]
-
-return(res)
-
-}
-
-
-
-
-
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/old_site/general/books-tutorials-and-talks.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/old_site/general/books-tutorials-and-talks.md 
b/website/old_site_migration/old_site/general/books-tutorials-and-talks.md
deleted file mode 100644
index bbbdeef..0000000
--- a/website/old_site_migration/old_site/general/books-tutorials-and-talks.md
+++ /dev/null
@@ -1,121 +0,0 @@
----
-layout: default
-title: Books Tutorials and Talks
-theme:
-    name: retro-mahout
----
-# Intro
-
-This page is a place for info about talks (past and upcoming), tutorials, 
articles, books, slides, PDFs, discussions, etc. about Mahout. No endorsements 
are implied or
-given.
-
-# Books
-
-## Mahout specific
-
-   * <a 
href="http://www.weatheringthroughtechdays.com/2016/02/mahout-samsara-book-is-out.html";>Apache
 Mahout: Beyond MapReduce</a> by Dmitriy Lyubimov and Andrew Palumbo published 
Feb 2016. Covers new features in Mahout "Samsara" releases (0.10, 0.11+).
-   * <a href="http://www.packtpub.com/apache-mahout-cookbook/book";>Apache 
Mahout cookbook</a>- Book by Piero Giacomelli published Dec 2013 by Packtpub.
-   * <a href="http://www.manning.com/owen/";>Mahout in Action</a> - Book by 
Sean Owen, Robin Anil, Ted Dunning and Ellen Friedman published Oct 2011 by 
Manning Publications.
-   * <a href="http://www.manning.com/ingersoll/";>Taming Text</a> - By Grant 
Ingersoll and Tom Morton, published by Manning Publications. Will have some 
Mahout coverage, but by no means as complete as Mahout in Action.
-
-## Engineering oriented machine learning books
-
-   * <a 
href="http://www.amazon.com/Collective-Intelligence-Action-Satnam-Alag/dp/1933988312/ref=pd_bbs_sr_3?ie=UTF8&s=books&qid=1214545249&sr=1-3";>Collective
 Intelligence in Action</a>
-   * <a 
href="http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325/ref=pd_bbs_sr_1/104-1017533-9408723?ie=UTF8&s=books&qid=1214593516&sr=1-1";>Programming
 Collective Intelligence</a>
-   * <a 
href="http://www.amazon.com/Algorithms-Intelligent-Web-Haralambos-Marmanis/dp/1933988665/ref=sr_1_1?s=books&ie=UTF8&qid=1298005918&sr=1-1";>Algorithms
 of the Intelligent Web</a>
-
-## Scientific background
-
-   * <a href="http://www.cs.waikato.ac.nz/~ml/weka/book.html";>Data Mining: 
Practical Machine Learning Tools and Techniques</a>
-   * <a href="http://www-nlp.stanford.edu/IR-book/";>Introduction to 
Information Retrieval</a>
-   * <a 
href="http://www.amazon.com/Machine-Learning-Mcgraw-Hill-International-Edit/dp/0071154671/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1214593709&sr=8-1";>Machine
 Learning</a>
-   * <a 
href="http://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738/ref=pd_bbs_sr_2?ie=UTF8&s=books&qid=1214593709&sr=8-2";>Pattern
 Recognition and Machine Learning (Information Science and Statistics) </a>
-
-# News, Articles and Tutorials
-
-   * [Mahout 0.10.x: first Mahout release as a programming 
environment](http://www.weatheringthroughtechdays.com/2015/04/mahout-010x-first-mahout-release-as.html)
   
-   * [Comparing Document Classification Functions of Lucene and 
Mahout](http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html)
-   * <a 
href="http://www.ibm.com/developerworks/java/library/j-mahout-scaling/";>Apache 
Mahout: Scalable Machine Learning for Everyone</a>
-   * <a 
href="http://emmaespina.wordpress.com/2011/04/26/ham-spam-and-elephants-or-how-to-build-a-spam-filter-server-with-mahout/";>How
 to build a spam filter server with Mahout</a> - Applying classification on a 
live server - April 2011
-   * <a 
href="http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/";>Deploying
 a massively scalable recommender system with Apache Mahout</a> - Blogpost of 
Sebastian Schelter in April 2011
-   * <a href="http://www.redmonk.com/cote/2010/11/04/makeall013/";>Apache 
Mahout & the commoditization of machine learning </a> - Podcast interview with 
Grant Ingersoll at ApacheCon 2010
-   * <a href="http://isabel-drost.de/hadoop/slides/devoxx.pdf";>Apache Mahout 
0.4 mit neuen Algorithmen</a> - published after the 0.4 release by heise Open/ 
Developer, November 2010
-   * <a href="http://www.infoq.com/news/2009/04/mahout";>Mahout on InfoQ</a> - 
Interview with Grant Ingersoll on InfoQ
-   * <a 
href="http://www.cloudera.com/blog/2009/04/21/hadoop-uk-user-group-meeting/";>Mahout
 in the Cloudera weblog</a> - published after the Hadoop user group UK.
-   * <a 
href="http://blog.athico.com/2008/08/machine-learning-and-apache-mahout.html";>Mahout
 in the Drools weblog</a> - Michael Neale published an article on Mahout in the 
drools weblog
-   * <a 
href="https://www.ibm.com/developerworks/java/library/j-mahout/index.html";>Introducing
 Apache Mahout</a> - Grant Ingersoll - Intro to Apache Mahout focused on 
clustering, classification and collaborative filtering. Japanese translation 
available at: 
[http://www.ibm.com/developerworks/jp/java/library/j-mahout/](http://www.ibm.com/developerworks/jp/java/library/j-mahout/)
-   * <a 
href="http://philippeadjiman.com/blog/2009/11/11/flexible-collaborative-filtering-in-java-with-mahout-taste/";>Flexible
 Collaborative Filtering In Java With Mahout Taste</a> - Philippe Adjiman - 
Quick starting guide on how to use the collaborative filtering package of 
Mahout (called Taste) to quickly and flexibly create, test and compare tailored 
recommendation engines.
-   * <a 
href="http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/";>Integrating
 Mahout with Lucene and Solr</a> Three part series on ways to integrate Mahout 
with Lucene and Solr
-   * <a href="https://www.youtube.com/watch?v=yD40rVKUwPI";>Mahout Item 
Recommender Tutorial using Java and Eclipse</a> - YouTube video tutorial by 
Steve Cook
-
-
-# Coursework/Lectures
-
-   * <a 
href="http://videolectures.net/mlss05us_chicago/";>http://videolectures.net/mlss05us_chicago/</a>
-   * <a 
href="http://videolectures.net/mlas06_pittsburgh/";>http://videolectures.net/mlas06_pittsburgh/</a>
-   * <a 
href="http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1";>Stanford
 Lectures on Machine Learning by Andrew Ng</a>
-   * <a 
href="https://docs.google.com/open?id=0ByhGL2_SCeitMDQ3OTczNjItM2ZjYi00ZDg5LWE0MzItZGQxODQ5NzkzYjNj";>CMU@Qatar
 Introduction to Mahout lecture</a>
-
-
-# Talks
-
-In reverse chronological order, so that most recent talks are at the top
-
-   * [Distributed Machine Learning with Apache Mahout] Suneel Marthi at Apache 
Big Data North America, Vancouver, Canada, May 11, 2016 and MapR Washington DC 
Big Data Everywhere, Tysons, VA, June 2 2016
-   * [Declarative Machine Learning with the Samsara 
DSL](http://www.slideshare.net/FlinkForward/sebastian-schelter-distributed-machine-learing-with-the-samsara-dsl)
 Sebastian Schelter at Flink Forward Conference, Berlin Germany, October 2015.
-   * [Bringing Algebraic Semantics to 
Mahout](http://www.slideshare.net/sscdotopen/bringing-algebraic-semantics-to-mahout)
 Sebastian Schelter at HPI Infolunch, Potsdam Germany, May 2014
-   * Mahout Spark and Scala bindings: Bringing Algebraic Semantics 
([slides](http://www.slideshare.net/DmitriyLyubimov/mahout-scala-and-spark-bindings)/[video](http://youtu.be/h9dpmvNW1Dw))
 - Dmitriy Lyubimov at Mahout Meetup, April 17, 2014. 
-   * Mahout Future Directions - Ted Dunning, Suneel Marthi, Sebastian Schelter 
at Hadoop Summit Europe 2014, Amsterdam, April 3, 2014
-   * Building Recommender Systems for Mere-Mortals - Sebastian Schelter at 
Researchgate Developer Day, Berlin, November 2013
-   * Recommendations with Apache Mahout - Sebastian Schelter at IBM Almaden 
Research Center, San Jose, September 2013
-   * <a 
href="http://de.slideshare.net/sscdotopen/next-directions-in-mahouts-recommenders";>Next
 Directions in Mahoutâs Recommenders</a> - Sebastian Schelter at Bay Area 
Mahout Meetup, Redwood City, August 2013 
-   * <a 
href="http://de.slideshare.net/sscdotopen/new-directions-in-mahouts-recommenders";>New
 Directions in Mahoutâs Recommenders</a> - Sebastian Schelter at Recommender 
Systems Get Together Berlin, April 2013
-   * <a 
href="http://www.slideshare.net/VaradMeru/introduction-to-mahout-and-machine-learning";>Introduction
 to Mahout and Machine Learning</a> - Slides by Varad Meru, Software 
Development Engineer at Orzota. July 27th, 2013.
-   * <a 
href="http://de.slideshare.net/sscdotopen/introduction-to-collaborative-filtering-with-apache-mahout";>An
 Introduction to Collaborative Filtering with Apache Mahout</a> - Sebastian 
Schelter at Recommender Systems Challenge Workshop in conjunction with ACM 
RecSys 2012, Dublin, September 2012
-   * <a 
href="https://github.com/ManuelB/facebook-recommender-demo/raw/master/docs/Talk-BedCon-Berlin-2012.pdf";>How
 to build a recommender system based on Mahout and JavaEE</a> - Slides by 
Manuel Blechschmidt at Berlin Expert Days March, 2012.
-   * <a href="http://lanyrd.com/2011/apachecon-north-america/skdtb/";>Apache 
Mahout for intelligent data analysis</a> - Slides from Isabel Drost at Apache 
Con NA November, 2011.
-   * <a href="http://lanyrd.com/2011/apachecon-north-america/skdrk/";>Dr. 
Mahout: Analyzing clinical data using scalable and distributed computing</a> - 
Slides from Shannon Quinn at Apache Con NA November, 2011.
-   * Frank Scholten at Berlin Buzzwords on June 7, 2011.
-   * Introduction to Collaborative Filtering using Mahout (updated) - Talk by 
Sean Owen at the London Hadoop User Group on April 14, 2011.
-   *  <a 
href="http://www.meetup.com/LA-HUG/pages/Video_from_March_16th_LA-HUG_Ted_Dunning_Mahout";>Cool
 Tricks with Classifiers</a> - Talk by Ted Dunning at the Los Angeles HUG 
talking about Mahout classifiers on March 16, 2011.
-   * First Mahout Hackathon, Berlin, March 2011
-   * <a 
href="http://blog.jteam.nl/2011/01/13/announcement-lucene-nl-mahout-meetup-with-isabel-drost-feb-7/";>Mahout
 meetup</a> - there were two talks at the Apache Mahout meetup at JTeam in 
Amsterdam, February 2011. <a 
href="http://isabel-drost.de/hadoop/slides/jteam.pdf";>intro slides</a>
-   * <a 
href="http://www.fosdem.org/2011/schedule/event/mahoutclustering.html";>Mahout 
clustering </a> - Talk on Mahout clustering at data dev room FOSDEM, February 
2011.
-   * Scaling Data Analysis with Apache Mahout - talk on Mahout at O'Reilly 
Strata, February 2011. 
-   * <a 
href="http://www.slideshare.net/jaganadhg/mahout-tutorial-fossmeet-nitc";>Practical
 Machine Learning</a> - Slides from Biju B and Jaganadh G, FOSSMEET-NITC, 
Calicut, India, February 2011.
-   * <a href="http://www.javaedge.com/jedge/pdf/Mahout.pdf";>Mahout at 
AlphaCSPs The Edge 2010 (pdf)</a> - <a 
href="http://www.slideshare.net/arikogan/mahouts-presentation-at-alphacsps-the-edge-2010";>slideshare</a>
 - Slides from <a href="http://il.linkedin.com/in/arielkogan";>Ariel Kogan</a> 
AlphaCSP's The Edge, December 2010.
-   * <a href="http://isabel-drost.de/hadoop/slides/devoxx.pdf";>Intelligent 
data analysis with Apache Mahout</a> - Slides from Isabel Drost, Devoxx 
Antwerp, November 2010.
-   * <a href="http://isabel-drost.de/hadoop/slides/codebits.pdf";>Apache Mahout 
introduction</a> - Slides from Isabel Drost, codebits Lisbon, November 2010.
-   * <a href="http://isabel-drost.de/hadoop/slides/apachecon_2010.pdf";>Apache 
Mahout - Making Data Analysis Easy</a> - Slides from Isabel Drost, Apache Con 
US Atlanta, November 2010.
-   * <a href="http://www.slideshare.net/jaganadhg/bck9";>Practical Machine 
Learning</a> - Slides from Jaganadh G, BarCamp Kerala 9, November 2010.
-   * <a href="http://www.slideshare.net/tdunning/sdforum-11042010";>Mahout and 
its new classification framework</a> - Slides from Ted Dunning, SDForum, 
November 2010.
-   * <a href="http://www.slideshare.net/sscdotopen/mahoutcf";>Distributed 
Item-based Collaborative Filtering with Apache Mahout</a> - Slides from 
Sebastian Schelter, Hadoop Get Together Berlin, October 2010.
-   * <a href="http://isabel-drost.de/hadoop/slides/HMM.pdf";>Hidden Markov 
Models for Mahout</a> - Slides from Max Heimel, Hadoop Get Together Berlin, 
October 2010.
-   * <a 
href="http://www.slideshare.net/robinanil/oscon-apache-mahout-mammoth-scale-machine-learning";>Apache
 Mahout Mammoth Scale Machine Learning </a> - Slides from Robin Anil, OSCON 
2010.
-   * <a href="http://slidesha.re/9LxOIu";>Intro to Apache Mahout</a> - Slides 
from Grant Ingersoll,  RTP Semantic Web Group.
-   * <a href="http://www.slideshare.net/ydn/3-biometric-hadoopsummit2010";>Case 
study: Biometric Databases and Hadoop </a> - Slides from Jason Trost, Hadoop 
Summit 2010.
-   * <a 
href="http://www.slideshare.net/hadoopusergroup/mail-antispam?from=ss_embed";>Spam
 Fighting at Yahoo</a>
-   * <a 
href="http://www.slideshare.net/hadoopusergroup/bixo-hug-talk?from=ss_embed";>Web
 Mining with Ken Krugler</a>
-   * <a 
href="http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/ingersoll_bbuzz2010.pdf";>Keynote
 on intelligent search</a> - Slides from Grant Ingersoll, Berlin Buzzwords, 
June 2010.
-   * <a 
href="http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/owen_bbuzz2010.pdf";>Simple
 co-occurrence-based recommendation on Hadoop</a> - Slides from Sean Owen, 
Berlin Buzzwords, June, 2010.
-   * <a 
href="http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/scholten_bbuzz2010.odp";>Introduction
 to Collaborative Filtering using Mahout</a> - Slides from Frank Scholten, 
Berlin Buzzwords, June, 2010.
-   * <a 
href="http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/";>Introduction
 to Scalable Machine Learning</a> - Slides and demos from Grant Ingersoll, 
March, 2010.
-   * Mahout @ India Hadoop Summit - Slides from a 1 hour talk on Mahout at the 
India Hadoop Summit by Robin Anil, February 2010.
-   * <a 
href="http://www.isabel-drost.de/hadoop/slides/opensourceexpo09.pdf";>Mahout in 
10 minutes</a> - Slides from a 10 min intro to Mahout at the Map Reduce 
tutorial by David Z&uuml;lke at Open Source Expo in Karlsruhe, Isabel Drost, 
November 2009.
-   * <a 
href="http://www.isabel-drost.de/hadoop/slides/apacheconus2009.pdf";>Mahout at 
Apache Con US </a> - Slides from a talk on "Going from raw data to information" 
(with Mahout) at Apache Con US in Oakland, Isabel Drost, November 2009.
-   * <a href="http://www.isabel-drost.de/hadoop/slides/froscon2009.pdf";>Mahout 
at FrOSCon</a> - Slides from a talk on Mahout at FrOSCon in Sankt Augustin, 
Isabel Drost, August 2009.
-   * <a href="http://www.isabel-drost.de/hadoop/slides/dai.pdf";>Mahout at DAI 
group TU Berlin</a> - Slides from a talk on Mahout at the DAI Laboratories TU 
Berlin, Isabel Drost, July 2009.
-   * <a href="http://www.isabel-drost.de/hadoop/slides/ulf.pdf";>Mahout at 
Machine Learning Group TU Berlin</a> - Slides from a talk on Hadoop with some 
detour to Mahout at the Machine
-   * Learning Group of Prof. Dr. Klaus-Robert M&uuml;ller at TU Berlin, Isabel 
Drost, June 2009.
-   * <a href="http://www.isabel-drost.de/hadoop/slides/google.pdf";>Mahout at 
Google Z&uuml;rich</a> - Slides from a Google tech-talk on the past, present 
and future of Mahout, Isabel Drost, May 2009.
-   * <a 
href="http://static.last.fm/johan/huguk-20090414/isabel_drost-introducing_apache_mahout.pdf";>Hadoop
 user group UK</a> - Slides from a talk on April 14, 2009 at the Hadoop User 
Group UK in London, Isabel Drost, April 2009.
-   * <a 
href="http://cwiki.apache.org/confluence/download/attachments/88410/SDForum.pdf";>BI
 Over Petabytes: Meet Apache Mahout</a> - Slides from a talk by Jeff Eastman on 
April 21, 2009 at the Bay Area SD Forum Business Intelligence SIG meeting at 
SAP in Palo Alto, CA.
-   * Lucene Meetup and Apache Barcamp in Amsterdam, March 2009.
-   * BarCampRDU - (Raleigh) on Aug. 2, 2008
-   * Introducing Mahout: Apache Machine Learning - Committer Grant Ingersoll 
gave a gentle introduction to Mahout and Machine Learning at ApacheCon in 
November (3rd through 7th) in New Orleans, USA. 
-   * Mahout: Scaling Machine Learning - Introduction to Mahout and machine 
learning at FrOSCon in Sankt Augustin/Germany, Isabel Drost, August 2008.  (<a 
href="http://cwiki.apache.org/confluence/download/attachments/88410/froscon.pdf";>slides</a>)
-   * Mahout: Scalable Machine Learning - An introduction to Mahout and machine 
learning at the first German Hadoop gathering in newthinking store/ Berlin, 
Isabel Drost, July 2008.
-   * Apache Mahout: Industrial Strength Machine Learning - Committer Jeff 
Eastman gave an introduction to Mahout at Yahoo\!, May 2008
-   * <a 
href="http://people.apache.org/~berndf/openexpode08-lucene-talk.pdf";>Apache 
Lucene - Mach's wie Google</a> - Bernd Fondermann presented an overview of the 
Apache Lucene project,
-   * including Mahout at Open Source Expo 2008 in Karlsruhe, May 2008.
-   * Apache Mahout: Bringing Machine Learning to Industrial Strength - 
Committer Isabel Drost gave a Fast Feather introduction the the new project 
Mahout at Apache Con EU April, 2008
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/old_site/general/mahout-wiki.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/old_site/general/mahout-wiki.md 
b/website/old_site_migration/old_site/general/mahout-wiki.md
deleted file mode 100644
index 2df16d4..0000000
--- a/website/old_site_migration/old_site/general/mahout-wiki.md
+++ /dev/null
@@ -1,202 +0,0 @@
----
-layout: default
-title: Mahout Wiki
-theme:
-    name: retro-mahout
----
-
-On the fence about including this in new site. lol at "new Apache TLP"
-
-Apache Mahout is a new Apache TLP project to create scalable, machine
-learning algorithms under the Apache license. 
-
-{toc:style=disc|minlevel=2}
-
-<a name="MahoutWiki-General"></a>
-## General
-[Overview](overview.html)
- -- Mahout? What's that supposed to be?
-
-[Quickstart](quickstart.html)
- -- learn how to quickly setup Apache Mahout for your project.
-
-[FAQ](faq.html)
- -- Frequent questions encountered on the mailing lists.
-
-[Developer Resources](developer-resources.html)
- -- overview of the Mahout development infrastructure.
-
-[How To Contribute](how-to-contribute.html)
- -- get involved with the Mahout community.
-
-[How To Become A Committer](how-to-become-a-committer.html)
- -- become a member of the Mahout development community.
-
-[Hadoop](http://hadoop.apache.org)
- -- several of our implementations depend on Hadoop.
-
-[Machine Learning Open Source Software](http://mloss.org/software/)
- -- other projects implementing Open Source Machine Learning libraries.
-
-[Mahout -- The name, history and its pronunciation](mahoutname.html)
-
-<a name="MahoutWiki-Community"></a>
-## Community
-
-[Who we are](who-we-are.html)
- -- who are the developers behind Apache Mahout?
-
-[Books, Tutorials, Talks, Articles, News, Background Reading, etc. on 
Mahout](books-tutorials-and-talks.html)
-
-[Issue Tracker](issue-tracker.html)
- -- see what features people are working on, submit patches and file bugs.
-
-[Source Code (SVN)](https://svn.apache.org/repos/asf/mahout/)
- -- [Fisheye|http://fisheye6.atlassian.com/browse/mahout]
- -- download the Mahout source code from svn.
-
-[Mailing lists and IRC](mailing-lists,-irc-and-archives.html)
- -- links to our mailing lists, IRC channel and archived design and
-algorithm discussions, maybe your questions was answered there already?
-
-[Version Control](version-control.html)
- -- where we track our code.
-
-[Powered By Mahout](powered-by-mahout.html)
- -- who is using Mahout in production?
-
-[Professional Support](professional-support.html)
- -- who is offering professional support for Mahout?
-
-[Mahout and Google Summer of Code](gsoc.html)
-  -- All you need to know about Mahout and GSoC.
-
-
-[Glossary of commonly used terms and abbreviations](glossary.html)
-
-<a name="MahoutWiki-Installation/Setup"></a>
-## Installation/Setup
-
-[System Requirements](system-requirements.html)
- -- what do you need to run Mahout?
-
-[Quickstart](quickstart.html)
- -- get started with Mahout, run the examples and get pointers to further
-resources.
-
-[Downloads](downloads.html)
- -- a list of Mahout releases.
-
-[Download and installation](buildingmahout.html)
- -- build Mahout from the sources.
-
-[Mahout on Amazon's EC2 Service](mahout-on-amazon-ec2.html)
- -- run Mahout on Amazon's EC2.
-
-[Mahout on Amazon's EMR](mahout-on-elastic-mapreduce.html)
- -- Run Mahout on Amazon's Elastic Map Reduce
-
-[Integrating Mahout into an Application](mahoutintegration.html)
- -- integrate Mahout's capabilities in your application.
-
-<a name="MahoutWiki-Examples"></a>
-## Examples
-
-1. [ASF Email Examples](asfemail.html)
- -- Examples of recommenders, clustering and classification all using a
-public domain collection of 7 million emails.
-
-<a name="MahoutWiki-ImplementationBackground"></a>
-## Implementation Background
-
-<a name="MahoutWiki-RequirementsandDesign"></a>
-### Requirements and Design
-
-[Matrix and Vector Needs](matrix-and-vector-needs.html)
- -- requirements for Mahout vectors.
-
-[Collection(De-)Serialization](collection(de-)serialization.html)
-
-<a name="MahoutWiki-CollectionsandAlgorithms"></a>
-### Collections and Algorithms
-
-Learn more about [mahout-collections](mahout-collections.html)
-, containers for efficient storage of primitive-type data and open hash
-tables.
-
-Learn more about the [Algorithms](algorithms.html)
- discussed and employed by Mahout.
-
-Learn more about the [Mahout recommender 
implementation](recommender-documentation.html)
-.
-
-<a name="MahoutWiki-Utilities"></a>
-### Utilities
-
-This section describes tools that might be useful for working with Mahout.
-
-[Converting Content](converting-content.html)
- -- Mahout has some utilities for converting content such as logs to
-formats more amenable for consumption by Mahout.
-[Creating Vectors](creating-vectors.html)
- -- Mahout's algorithms operate on vectors. Learn more on how to generate
-these from raw data.
-[Viewing Result](viewing-result.html)
- -- How to visualize the result of your trained algorithms.
-
-<a name="MahoutWiki-Data"></a>
-### Data
-
-[Collections](collections.html)
- -- To try out and test Mahout's algorithms you need training data. We are
-always looking for new training data collections.
-
-<a name="MahoutWiki-Benchmarks"></a>
-### Benchmarks
-
-[Mahout Benchmarks](mahout-benchmarks.html)
-
-<a name="MahoutWiki-Committer'sResources"></a>
-## Committer's Resources
-
-* [Testing](testing.html)
- -- Information on test plans and ideas for testing
-
-<a name="MahoutWiki-ProjectResources"></a>
-### Project Resources
-
-* [Dealing with Third Party Dependencies not in 
Maven](thirdparty-dependencies.html)
-* [How To Update The Website](how-to-update-the-website.html)
-* [Patch Check List](patch-check-list.html)
-* [How To 
Release](http://cwiki.apache.org/confluence/display/MAHOUT/How+to+release)
-* [Release Planning](release-planning.html)
-* [Sonar Code Quality 
Analysis](https://analysis.apache.org/dashboard/index/63921)
-
-<a name="MahoutWiki-AdditionalResources"></a>
-### Additional Resources
-
-* [Apache Machine Status](http://monitoring.apache.org/status/)
- \- Check to see if SVN, other resources are available.
-* [Committer's FAQ](http://www.apache.org/dev/committers.html)
-* [Apache Dev](http://www.apache.org/dev/)
-
-
-<a name="MahoutWiki-HowToEditThisWiki"></a>
-## How To Edit This Wiki
-
-How to edit this Wiki
-
-This Wiki is a collaborative site, anyone can contribute and share:
-
-* Create an account by clicking the "Login" link at the top of any page,
-and picking a username and password.
-* Edit any page by pressing Edit at the top of the page
-
-There are some conventions used on the Mahout wiki:
-
-    * {noformat}+*TODO:*+{noformat} (+*TODO:*+ ) is used to denote sections
-that definitely need to be cleaned up.
-    * {noformat}+*Mahout_(version)*+{noformat} (+*Mahout_0.2*+) is used to
-draw attention to which version of Mahout a feature was (or will be) added
-to Mahout.
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/old_site/general/professional-support.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/old_site/general/professional-support.md 
b/website/old_site_migration/old_site/general/professional-support.md
deleted file mode 100644
index 45d798c..0000000
--- a/website/old_site_migration/old_site/general/professional-support.md
+++ /dev/null
@@ -1,41 +0,0 @@
----
-layout: default
-title: Professional Support
-theme:
-    name: retro-mahout
----
-
-NOTE: on the fence about including this in new site.
-
-<a name="ProfessionalSupport-ProfessionalsupportforMahout"></a>
-# Professional support for Mahout
-
-Add yourself or your company if you are offering support for Mahout
-users. Please keep lists in alphabetical order. An entry here
-is not an endorsement by the Apache Software Foundation nor any of its
-committers.
-
-
-<a name="ProfessionalSupport-Peopleandcompaniesforhire"></a>
-## People and companies for hire
-
-| Name | Contact details | Notes |
-|------|-----------------|-------|
-| Accenture | [email protected] | [Consulting services in big 
data analytics](http://accenture.com) |
-| Boston Predictive Analytics | [email protected] | 
[http://tutorteddy.com/site/free_statistics_help.php](http://tutorteddy.com/site/free_statistics_help.php)
 |
-| Frank Scholten | [email protected] | |
-| GridLine | [http://www.gridline.nl/contact](http://www.gridline.nl/contact) 
| Specialised in search and thesauri |
-| Jagdish Nomula | [email protected] | ML, Search, Algorithms, Java 
[http://www.kosmex.com](http://www.kosmex.com) |
-| LucidWorks | [http://www.lucidworks.com](http://www.lucidworks.com) | Big 
data platform including Mahout as a service for clustering, classification and 
more |
-| Sematext International | [http://sematext.com/](http://sematext.com/) | |
-| Ted Dunning | [email protected] | Full commercial support |
-| Winterwell | [email protected] | Business/maths concept development & 
algorithms [http://winterwell.com](http://winterwell.com) |
-
-<a name="ProfessionalSupport-Talksandpresentations"></a>
-## Talks and presentations
-
-| Name | Contact details | Notes |
-|------|-----------------|-------|
-| Andrew Musselman | [email protected] | ["Building a Recommender with Apache 
Mahout on Amazon 
Elastic-MapReduce"](https://blogs.aws.amazon.com/bigdata/post/Tx1TDK3HHBD4EZL/Building-a-Recommender-with-Apache-Mahout-on-Amazon-Elastic-MapReduce-EMR)
 |
-| Frank Scholten | [email protected] | Mahout/Taste 
[http://blog.jteam.nl/author/frank/](http://blog.jteam.nl/author/frank/) |
-| Isabel Drost-Fromm | [email protected] | If travel and accommodation costs 
are covered scheduling a talk is a lot easier. |

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/old_site/general/reference-reading.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/old_site/general/reference-reading.md 
b/website/old_site_migration/old_site/general/reference-reading.md
deleted file mode 100644
index ba969ac..0000000
--- a/website/old_site_migration/old_site/general/reference-reading.md
+++ /dev/null
@@ -1,71 +0,0 @@
----
-layout: default
-title: Reference Reading
-theme:
-    name: retro-mahout
----
-
-# Reference Reading
-
-Here we provide references to books and courses about data analysis in 
general, which might also be helpful in the context of Mahout.
-
-<a name="ReferenceReading-GeneralBackgroundMaterials"></a>
-## General Background Materials
-
-Don't be overwhelmed by all the maths, you can do a lot in Mahout with some
-basic knowledge. The books will help you understand your
-data better, and ask better questions both of Mahout's APIs, and also of
-the Mahout community. And unlike learning some particular software tool,
-these are skills that will remain useful decades later.
-
- * [Gilbert Strang](http://www-math.mit.edu/~gs)
-'s [Introduction to Linear Algebra](http://math.mit.edu/linearalgebra/). His 
[lectures](http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/)
 are also [available online](http://web.mit.edu/18.06/www/)
- and are strongly recommended. 
- * [Mathematical Tools for Applied Mulitvariate 
Analysis](http://www.amazon.com/Mathematical-Tools-Applied-Multivariate-Analysis/dp/0121609553/ref=sr_1_1?ie=UTF8&qid=1299602805&sr=8-1)
 by J.Douglass
-Carroll.
- * [Stanford Machine Learning online 
courseware](http://www.stanford.edu/class/cs229/)
- * [MIT Machine Learning online 
courseware](http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/)
  has [lecture 
notes](http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/lecture-notes/)
 online.
- * As a pre-requisite to probability and statistics, you'll need [basic 
calculus](http://en.wikipedia.org/wiki/Calculus). A maths for scientists text 
might be useful here such as 'Mathematics for Engineers and Scientists', Alan 
Jeffrey, Chapman & Hall/CRC. 
([openlibrary](http://openlibrary.org/books/OL3305993M/Mathematics_for_engineers_and_scientists))
- * One of the best writers in the probability/statistics world is Sheldon 
Ross. Try [A First Course in Probability (8th 
Edition)](http://www.pearsonhighered.com/educator/product/First-Course-in-Probability-A/9780136033134.page)
 and then move on to his [Introduction to Probability 
Models](http://www.amazon.com/Introduction-Probability-Models-Sixth-Sheldon/dp/0125984707)
-
-Some good introductory alternatives here are:
-
- * [Kahn Academy](http://www.khanacademy.org/) -- videos on stats, 
probability, linear algebra
- * [Probability and Statistics (7th 
Edition)](http://www.amazon.com/Probability-Statistics-Engineering-Sciences-InfoTrac/dp/0534399339),
 Jay L. Devore, Chapman.
- * [Probability and Statistical Inference (7th 
Edition)](http://www.amazon.com/Probability-Statistical-Inference-Robert-Hogg/dp/0132546086),
 Hogg and Tanis, Pearson.
-
-Once you have a grasp of the basics then there are a slew of great texts that 
you might consult:
-
- * [Statistical 
Inference](http://www.amazon.com/Statistical-Inference-George-Casella/dp/0534243126),
 Casell and Berger, Duxbury/Thomson Learning.
- * [Introduction to Bayesian 
Statistics](http://www.amazon.com/Introduction-Bayesian-Statistics-William-Bolstad/dp/0471270202),
 William H. Bolstad, Wiley. 
- * [Understanding Computational Bayesian 
Statistics](http://www.amazon.com/Understanding-Computational-Bayesian-Statistics-Wiley/dp/0470046090),
 Bolstadt
- * [Bayesian Data Analysis, Gelman et 
al.](http://www.stat.columbia.edu/~gelman/book/)
-
-
-## For statistics related to machine learning, these are particularly helpful:
-
- * [Pattern Recognition and Machine Learning by Chris 
Bishop](http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm)
- * [Elements of Statistical 
Learning](http://www-stat.stanford.edu/~tibs/ElemStatLearn/) by Trevor Hastie, 
Robert Tibshirani, Jerome Friedman 
- * 
[http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm](http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm)
- 
-
-## For matrix computations/decomposition/factorization etc.:
-
- * Peter V. O'Neil [Introduction to Linear 
Algebra](http://www.amazon.com/Introduction-Linear-Algebra-Theory-Applications/dp/053400606X),
 great book for beginners (with some knowledge in calculus). It is not 
comprehensive, but, it will be a good place to start and the author starts by 
explaining the concepts with regards to vector spaces which I found to be a 
more natural way of explaining.
- * David S. Watkins [Fundamentals of Matrix 
Computations](http://www.amazon.com/Fundamentals-Matrix-Computations-Applied-Mathematics/dp/0470528338/)
- * [Matrix 
Computations](http://www.amazon.com/Computations-Hopkins-Studies-Mathematical-Sciences/dp/0801854148/ref=sr_1_2?s=books&ie=UTF8&qid=1394307676&sr=1-2&keywords=golub+van+loan)
 is the classic text for numerical linear algebra. Can't go wrong with it - 
great for researchers.  
- * Nick Trefethen's [Numerical Linear 
Algebra](http://people.maths.ox.ac.uk/trefethen/books.html).  It's a bit more 
approachable for practitioners. Many chapters on SVD, there are even chapters 
on Lanczos.
-
-
-## Books specifically on R:
-
-* Learning about R is a difficult thing. The best introduction is in MASS 
[http://www.stats.ox.ac.uk/pub/MASS4/](http://www.stats.ox.ac.uk/pub/MASS4/)
-* [R Tutor](http://www.r-tutor.com/r-introduction)
-* [Manual](http://cran.r-project.org/doc/manuals/R-intro.pdf)
-* [R Course](http://faculty.washington.edu/tlumley/Rcourse/)
-
-In addition, you should see how to plot data well:
-
-* [Trellis plotting](http://www.statmethods.net/advgraphs/trellis.html)
-* [ggplot2](http://had.co.nz/ggplot2/)
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md 
b/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md
deleted file mode 100644
index 39f4bfd..0000000
--- 
a/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md
+++ /dev/null
@@ -1,88 +0,0 @@
----
-layout: default
-title: Matrix and Vector Needs
-theme:
-    name: retro-mahout
----
-
-<a name="MatrixandVectorNeeds-Intro"></a>
-# Intro
-
-Most ML algorithms require the ability to represent multidimensional data
-concisely and to be able to easily perform common operations on that data.
-MAHOUT-6 introduced Vector and Matrix datatypes of arbitrary cardinality,
-along with a set of common operations on their instances. Vectors and
-matrices are provided with sparse and dense implementations that are memory
-resident and are suitable for manipulating intermediate results within
-mapper, combiner and reducer implementations. They are not intended for
-applications requiring vectors or matrices that exceed the size of a single
-JVM, though such applications might be able to utilize them within a larger
-organizing framework.
-
-<a name="MatrixandVectorNeeds-Background"></a>
-## Background
-
-See 
[http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser](http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser)
-
-<a name="MatrixandVectorNeeds-Vectors"></a>
-## Vectors
-
-Mahout supports a Vector interface that defines the following operations over 
all implementation classes: assign, cardinality, copy, divide, dot, get, 
haveSharedCells, like, minus, normalize, plus, set, size, times, toArray, 
viewPart, zSum and cross. The class DenseVector implements vectors as a 
double[](.html)
- that is storage and access efficient. The class SparseVector implements
-vectors as a HashMap<Integer, Double> that is surprisingly fast and
-efficient. For sparse vectors, the size() method returns the current number
-of elements whereas the cardinality() method returns the number of
-dimensions it holds. An additional VectorView class allows views of an
-underlying vector to be specified by the viewPart() method. See the
-JavaDocs for more complete definitions.
-
-<a name="MatrixandVectorNeeds-Matrices"></a>
-## Matrices
-
-Mahout also supports a Matrix interface that defines a similar set of 
operations over all implementation classes: assign, assignColumn, assignRow, 
cardinality, copy, divide, get, haveSharedCells, like, minus, plus, set, size, 
times, transpose, toArray, viewPart and zSum. The class DenseMatrix implements 
matrices as a double[](.html)
-[] that is storage and access efficient. The class SparseRowMatrix
-implements matrices as a Vector[] holding the rows of the matrix in a
-SparseVector, and the symmetric class SparseColumnMatrix implements
-matrices as a Vector[] holding the columns in a SparseVector. Each of these
-classes can quickly produce a given row or column, respectively. A fourth
-class SparseMatrix, uses a HashMap<Integer, Vector> which is also a
-SparseVector. For sparse matrices, the size() method returns an int\[2\]
-containing the actual row and column sizes whereas the cardinality() method
-returns an int\[2\] with the number of dimensions of each. An additional
-MatrixView class allows views of an underlying matrix to be specified by
-the viewPart() method. See the JavaDocs for more complete definitions.
-
-The Matrix interface does not currently provide invert or determinant
-methods, though these are desirable. It is arguable that the
-implementations of SparseRowMatrix and SparseColumnMatrix ought to use the
-HashMap<Integer, Vector> implementations and that SparseMatrix should
-instead use a HashMap<Integer, HashMap<Integer, Double>>. Other forms of
-sparse matrices can also be envisioned that support different storage and
-access characteristics. Because the arguments of assignColumn and assignRow
-operations accept all forms of Vector, it is possible to construct
-instances of sparse matrices containing dense rows or columns. See the
-JavaDocs for more complete definitions.
-
-For applications like PageRank/TextRank, iterative approaches to calculate
-eigenvectors would also be useful. Batching of row/column operations would
-also be useful, such as perhaps assignRow or assighColumn accepting
-UnaryFunction and BinaryFunction arguments.
-
-
-<a name="MatrixandVectorNeeds-Ideas"></a>
-## Ideas
-
-As Vector and Matrix implementations are currently memory-resident, very
-large instances greater than available memory are not supported. An
-extended set of implementations that use HBase (BigTable) in Hadoop to
-represent their instances would facilitate applications requiring such
-large collections.  
-See [MAHOUT-6](https://issues.apache.org/jira/browse/MAHOUT-6)
-See [Hama](http://wiki.apache.org/hadoop/Hama)
-
-
-<a name="MatrixandVectorNeeds-References"></a>
-## References
-
-Have a look at the old parallel computing libraries like 
[ScalaPACK](http://www.netlib.org/scalapack/)
-, others

[21/29] mahout git commit: WEBSITE final cleanup before merge to master

Reply via email to