http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md b/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md deleted file mode 100644 index 14dd276..0000000 --- a/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md +++ /dev/null @@ -1,291 +0,0 @@ ---- -layout: default -title: Creating Vectors from Text -theme: - name: retro-mahout ---- - - -# Creating vectors from text -<a name="CreatingVectorsfromText-Introduction"></a> -# Introduction - -For clustering and classifying documents it is usually necessary to convert the raw text -into vectors that can then be consumed by the clustering [Algorithms](algorithms.html). These approaches are described below. - -<a name="CreatingVectorsfromText-FromLucene"></a> -# From Lucene - -*NOTE: Your Lucene index must be created with the same version of Lucene -used in Mahout. As of Mahout 0.9 this is Lucene 4.6.1. If these versions dont match you will likely get "Exception in thread "main" -org.apache.lucene.index.CorruptIndexException: Unknown format version: -11" -as an error.* - -Mahout has utilities that allow one to easily produce Mahout Vector -representations from a Lucene (and Solr, since they are they same) index. - -For this, we assume you know how to build a Lucene/Solr index. For those -who don't, it is probably easiest to get up and running using [Solr](http://lucene.apache.org/solr) - as it can ingest things like PDFs, XML, Office, etc. and create a Lucene -index. For those wanting to use just Lucene, see the [Lucene website](http://lucene.apache.org/core) - or check out _Lucene In Action_ by Erik Hatcher, Otis Gospodnetic and Mike -McCandless. - -To get started, make sure you get a fresh copy of Mahout from [GitHub](http://mahout.apache.org/developers/buildingmahout.html) - and are comfortable building it. It defines interfaces and implementations -for efficiently iterating over a data source (it only supports Lucene -currently, but should be extensible to databases, Solr, etc.) and produces -a Mahout Vector file and term dictionary which can then be used for -clustering. The main code for driving this is the driver program located -in the org.apache.mahout.utils.vectors package. The driver program offers -several input options, which can be displayed by specifying the --help -option. Examples of running the driver are included below: - -<a name="CreatingVectorsfromText-GeneratinganoutputfilefromaLuceneIndex"></a> -#### Generating an output file from a Lucene Index - - - $MAHOUT_HOME/bin/mahout lucene.vector - --dir (-d) dir The Lucene directory - --idField idField The field in the index - containing the index. If - null, then the Lucene - internal doc id is used - which is prone to error - if the underlying index - changes - --output (-o) output The output file - --delimiter (-l) delimiter The delimiter for - outputting the dictionary - --help (-h) Print out help - --field (-f) field The field in the index - --max (-m) max The maximum number of - vectors to output. If - not specified, then it - will loop over all docs - --dictOut (-t) dictOut The output of the - dictionary - --seqDictOut (-st) seqDictOut The output of the - dictionary as sequence - file - --norm (-n) norm The norm to use, - expressed as either a - double or "INF" if you - want to use the Infinite - norm. Must be greater or - equal to 0. The default - is not to normalize - --maxDFPercent (-x) maxDFPercent The max percentage of - docs for the DF. Can be - used to remove really - high frequency terms. - Expressed as an integer - between 0 and 100. - Default is 99. - --weight (-w) weight The kind of weight to - use. Currently TF or - TFIDF - --minDF (-md) minDF The minimum document - frequency. Default is 1 - --maxPercentErrorDocs (-err) mErr The max percentage of - docs that can have a null - term vector. These are - noise document and can - occur if the analyzer - used strips out all terms - in the target field. This - percentage is expressed - as a value between 0 and - 1. The default is 0. - -#### Example: Create 50 Vectors from an Index - - $MAHOUT_HOME/bin/mahout lucene.vector - --dir $WORK_DIR/wikipedia/solr/data/index - --field body - --dictOut $WORK_DIR/solr/wikipedia/dict.txt - --output $WORK_DIR/solr/wikipedia/out.txt - --max 50 - - -This uses the index specified by --dir and the body field in it and writes -out the info to the output dir and the dictionary to dict.txt. It only -outputs 50 vectors. If you don't specify --max, then all the documents in -the index are output. - -<a name="CreatingVectorsfromText-50VectorsFromLuceneL2Norm"></a> -#### Example: Creating 50 Normalized Vectors from a Lucene Index using the [L_2 Norm](http://en.wikipedia.org/wiki/Lp_space) - - $MAHOUT_HOME/bin/mahout lucene.vector - --dir $WORK_DIR/wikipedia/solr/data/index - --field body - --dictOut $WORK_DIR/solr/wikipedia/dict.txt - --output $WORK_DIR/solr/wikipedia/out.txt - --max 50 - --norm 2 - - -<a name="CreatingVectorsfromText-FromDirectoryofTextdocuments"></a> -## From A Directory of Text documents -Mahout has utilities to generate Vectors from a directory of text -documents. Before creating the vectors, you need to convert the documents -to SequenceFile format. SequenceFile is a hadoop class which allows us to -write arbitary (key, value) pairs into it. The DocumentVectorizer requires the -key to be a Text with a unique document id, and value to be the Text -content in UTF-8 format. - -You may find [Tika](http://tika.apache.org/) helpful in converting -binary documents to text. - -<a name="CreatingVectorsfromText-ConvertingdirectoryofdocumentstoSequenceFileformat"></a> -#### Converting directory of documents to SequenceFile format -Mahout has a nifty utility which reads a directory path including its -sub-directories and creates the SequenceFile in a chunked manner for us. - - $MAHOUT_HOME/bin/mahout seqdirectory - --input (-i) input Path to job input directory. - --output (-o) output The directory pathname for - output. - --overwrite (-ow) If present, overwrite the - output directory before - running job - --method (-xm) method The execution method to use: - sequential or mapreduce. - Default is mapreduce - --chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. - Defaults to 64 - --fileFilterClass (-filter) fFilterClass The name of the class to use - for file parsing. Default: - org.apache.mahout.text.PrefixAdditionFilter - --keyPrefix (-prefix) keyPrefix The prefix to be prepended to - the key - --charset (-c) charset The name of the character - encoding of the input files. - Default to UTF-8 {accepts: cp1252|ascii...} - --method (-xm) method The execution method to use: - sequential or mapreduce. - Default is mapreduce - --overwrite (-ow) If present, overwrite the - output directory before - running job - --help (-h) Print out help - --tempDir tempDir Intermediate output directory - --startPhase startPhase First phase to run - --endPhase endPhase Last phase to run - -The output of seqDirectory will be a Sequence file < Text, Text > of all documents (/sub-directory-path/documentFileName, documentText). - -<a name="CreatingVectorsfromText-CreatingVectorsfromSequenceFile"></a> -#### Creating Vectors from SequenceFile - -From the sequence file generated from the above step run the following to -generate vectors. - - - $MAHOUT_HOME/bin/mahout seq2sparse - --minSupport (-s) minSupport (Optional) Minimum Support. Default - Value: 2 - --analyzerName (-a) analyzerName The class name of the analyzer - --chunkSize (-chunk) chunkSize The chunkSize in MegaBytes. Default - Value: 100MB - --output (-o) output The directory pathname for output. - --input (-i) input Path to job input directory. - --minDF (-md) minDF The minimum document frequency. Default - is 1 - --maxDFSigma (-xs) maxDFSigma What portion of the tf (tf-idf) vectors - to be used, expressed in times the - standard deviation (sigma) of the - document frequencies of these vectors. - Can be used to remove really high - frequency terms. Expressed as a double - value. Good value to be specified is 3.0. - In case the value is less than 0 no - vectors will be filtered out. Default is - -1.0. Overrides maxDFPercent - --maxDFPercent (-x) maxDFPercent The max percentage of docs for the DF. - Can be used to remove really high - frequency terms. Expressed as an integer - between 0 and 100. Default is 99. If - maxDFSigma is also set, it will override - this value. - --weight (-wt) weight The kind of weight to use. Currently TF - or TFIDF. Default: TFIDF - --norm (-n) norm The norm to use, expressed as either a - float or "INF" if you want to use the - Infinite norm. Must be greater or equal - to 0. The default is not to normalize - --minLLR (-ml) minLLR (Optional)The minimum Log Likelihood - Ratio(Float) Default is 1.0 - --numReducers (-nr) numReducers (Optional) Number of reduce tasks. - Default Value: 1 - --maxNGramSize (-ng) ngramSize (Optional) The maximum size of ngrams to - create (2 = bigrams, 3 = trigrams, etc) - Default Value:1 - --overwrite (-ow) If set, overwrite the output directory - --help (-h) Print out help - --sequentialAccessVector (-seq) (Optional) Whether output vectors should - be SequentialAccessVectors. Default is false; - true required for running some algorithms - (LDA,Lanczos) - --namedVector (-nv) (Optional) Whether output vectors should - be NamedVectors. If set true else false - --logNormalize (-lnorm) (Optional) Whether output vectors should - be logNormalize. If set true else false - - - -This will create SequenceFiles of tokenized documents < Text, StringTuple > (docID, tokenizedDoc) and vectorized documents < Text, VectorWritable > (docID, TF-IDF Vector). - -As well, seq2sparse will create SequenceFiles for: a dictionary (wordIndex, word), a word frequency count (wordIndex, count) and a document frequency count (wordIndex, DFCount) in the output directory. - -The --minSupport option is the min frequency for the word to be considered as a feature; --minDF is the min number of documents the word needs to be in; --maxDFPercent is the max value of the expression (document frequency of a word/total number of document) to be considered as good feature to be in the document. These options are helpful in removing high frequency features like stop words. - -The vectorized documents can then be used as input to many of Mahout's classification and clustering algorithms. - -#### Example: Creating Normalized [TF-IDF](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) Vectors from a directory of text documents using [trigrams](http://en.wikipedia.org/wiki/N-gram) and the [L_2 Norm](http://en.wikipedia.org/wiki/Lp_space) -Create sequence files from the directory of text documents: - - $MAHOUT_HOME/bin/mahout seqdirectory - -i $WORK_DIR/reuters - -o $WORK_DIR/reuters-seqdir - -c UTF-8 - -chunk 64 - -xm sequential - -Vectorize the documents using trigrams, L_2 length normalization and a maximum document frequency cutoff of 85%. - - $MAHOUT_HOME/bin/mahout seq2sparse - -i $WORK_DIR/reuters-out-seqdir/ - -o $WORK_DIR/reuters-out-seqdir-sparse-kmeans - --namedVec - -wt tfidf - -ng 3 - -n 2 - --maxDFPercent 85 - -The sequence file in the $WORK_DIR/reuters-out-seqdir-sparse-kmeans/tfidf-vectors directory can now be used as input to the Mahout [k-Means](http://mahout.apache.org/users/clustering/k-means-clustering.html) clustering algorithm. - -<a name="CreatingVectorsfromText-Background"></a> -## Background - -* [Discussion on centroid calculations with sparse vectors](http://markmail.org/thread/l5zi3yk446goll3o) - -<a name="CreatingVectorsfromText-ConvertingexistingvectorstoMahout'sformat"></a> -## Converting existing vectors to Mahout's format - -If you are in the happy position to already own a document (as in: texts, -images or whatever item you wish to treat) processing pipeline, the -question arises of how to convert the vectors into the Mahout vector -format. Probably the easiest way to go would be to implement your own -Iterable<Vector> (called VectorIterable in the example below) and then -reuse the existing VectorWriter classes: - - - VectorWriter vectorWriter = SequenceFile.createWriter(filesystem, - configuration, - outfile, - LongWritable.class, - SparseVector.class); - - long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE); -
http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_priority/creating-vectors.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_priority/creating-vectors.md b/website/old_site_migration/needs_work_priority/creating-vectors.md deleted file mode 100644 index 10cbd8e..0000000 --- a/website/old_site_migration/needs_work_priority/creating-vectors.md +++ /dev/null @@ -1,16 +0,0 @@ ---- -layout: default -title: Creating Vectors -theme: - name: retro-mahout ---- - - -<a name="CreatingVectors-UtilitiesforCreatingVectors"></a> -# Utilities for Creating Vectors - -1. [Text](creating-vectors-from-text.html) ... utilities to turn plain text into Mahout vectors. - -1. Mahout also has rudimentary support for the arff file format. See [arff junit doc](https://builds.apache.org/job/Mahout-Quality/ws/trunk/integration/target/site/apidocs/org/apache/mahout/utils/vectors/arff/package-summary.html). - -1. There is also support for reading vectors from [csv files](https://builds.apache.org/job/Mahout-Quality/ws/trunk/integration/target/site/apidocs/org/apache/mahout/utils/vectors/csv/package-summary.html). http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_priority/dim-reduction/dimensional-reduction.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_priority/dim-reduction/dimensional-reduction.md b/website/old_site_migration/needs_work_priority/dim-reduction/dimensional-reduction.md deleted file mode 100644 index 2a157f6..0000000 --- a/website/old_site_migration/needs_work_priority/dim-reduction/dimensional-reduction.md +++ /dev/null @@ -1,446 +0,0 @@ ---- -layout: default -title: Dimensional Reduction -theme: - name: retro-mahout ---- - -# Support for dimensional reduction - -Matrix algebra underpins the way many Big Data algorithms and data -structures are composed: full-text search can be viewed as doing matrix -multiplication of the term-document matrix by the query vector (giving a -vector over documents where the components are the relevance score), -computing co-occurrences in a collaborative filtering context (people who -viewed X also viewed Y, or ratings-based CF like the Netflix Prize contest) -is taking the squaring the user-item interaction matrix, calculating users -who are k-degrees separated from each other in a social network or -web-graph can be found by looking at the k-fold product of the graph -adjacency matrix, and the list goes on (and these are all cases where the -linear structure of the matrix is preserved!) - -Each of these examples deal with cases of matrices which tend to be -tremendously large (often millions to tens of millions to hundreds of -millions of rows or more, by sometimes a comparable number of columns), but -also rather sparse. Sparse matrices are nice in some respects: dense -matrices which are 10^7 on a side would have 100 trillion non-zero entries! -But the sparsity is often problematic, because any given two rows (or -columns) of the matrix may have zero overlap. Additionally, any -machine-learning work done on the data which comprises the rows has to deal -with what is known as "the curse of dimensionality", and for example, there -are too many columns to train most regression or classification problems on -them independently. - -One of the more useful approaches to dealing with such huge sparse data -sets is the concept of dimensionality reduction, where a lower dimensional -space of the original column (feature) space of your data is found / -constructed, and your rows are mapped into that subspace (or sub-manifold). - In this reduced dimensional space, "important" components to distance -between points are exaggerated, and unimportant ones washed away, and -additionally, sparsity of your rows is traded for drastically reduced -dimensional, but dense "signatures". While this loss of sparsity can lead -to its own complications, a proper dimensionality reduction can help reveal -the most important features of your data, expose correlations among your -supposedly independent original variables, and smooth over the zeroes in -your correlation matrix. - -One of the most straightforward techniques for dimensionality reduction is -the matrix decomposition: singular value decomposition, eigen -decomposition, non-negative matrix factorization, etc. In their truncated -form these decompositions are an excellent first approach toward linearity -preserving unsupervised feature selection and dimensional reduction. Of -course, sparse matrices which don't fit in RAM need special treatment as -far as decomposition is concerned. Parallelizable and/or stream-oriented -algorithms are needed. - -<a name="DimensionalReduction-SingularValueDecomposition"></a> -# Singular Value Decomposition - -Currently implemented in Mahout (as of 0.3, the first release with MAHOUT-180 applied), are two scalable implementations of SVD, a stream-oriented implementation using the Asymmetric Generalized Hebbian Algorithm outlined in Genevieve Gorrell & Brandyn Webb's paper ([Gorrell and Webb 2005](-http://www.dcs.shef.ac.uk/~genevieve/gorrell_webb.pdf.html) -); and there is a [Lanczos | http://en.wikipedia.org/wiki/Lanczos_algorithm] - implementation, both single-threaded, and in the -o.a.m.math.decomposer.lanczos package (math module), as a hadoop map-reduce -(series of) job(s) in o.a.m.math.hadoop.decomposer package (core module). -Coming soon: stochastic decomposition. - -See also: [https://cwiki.apache.org/confluence/display/MAHOUT/SVD+-+Singular+Value+Decomposition](Wikipedia - SVD) - -<a name="DimensionalReduction-Lanczos"></a> -## Lanczos - -The Lanczos algorithm is designed for eigen-decomposition, but like any -such algorithm, getting singular vectors out of it is immediate (singular -vectors of matrix A are just the eigenvectors of A^t * A or A * A^t). -Lanczos works by taking a starting seed vector *v* (with cardinality equal -to the number of columns of the matrix A), and repeatedly multiplying A by -the result: *v'* = A.times(*v*) (and then subtracting off what is -proportional to previous *v'*'s, and building up an auxiliary matrix of -projections). In the case where A is not square (in general: not -symmetric), then you actually want to repeatedly multiply A*A^t by *v*: -*v'* = (A * A^t).times(*v*), or equivalently, in Mahout, -A.timesSquared(*v*) (timesSquared is merely an optimization: by changing -the order of summation in A*A^t.times(*v*), you can do the same computation -as one pass over the rows of A instead of two). - -After *k* iterations of *v_i* = A.timesSquared(*v_(i-1)*), a *k*- by -*k* -tridiagonal matrix has been created (the auxiliary matrix mentioned above), -out of which a good (often extremely good) approximation to *k* of the -singular values (and with the basis spanned by the *v_i*, the *k* singular -*vectors* may also be extracted) of A may be efficiently extracted. Which -*k*? It's actually a spread across the entire spectrum: the first few will -most certainly be the largest singular values, and the bottom few will be -the smallest, but you have no guarantee that just because you have the n'th -largest singular value of A, that you also have the (n-1)'st as well. A -good rule of thumb is to try and extract out the top 3k singular vectors -via Lanczos, and then discard the bottom two thirds, if you want primarily -the largest singular values (which is the case for using Lanczos for -dimensional reduction). - -<a name="DimensionalReduction-ParallelizationStragegy"></a> -### Parallelization Stragegy - -Lanczos is "embarassingly parallelizable": matrix multiplication of a -matrix by a vector may be carried out row-at-a-time without communication -until at the end, the results of the intermediate matrix-by-vector outputs -are accumulated on one final vector. When it's truly A.times(*v*), the -final accumulation doesn't even have collision / synchronization issues -(the outputs are individual separate entries on a single vector), and -multicore approaches can be very fast, and there should also be tricks to -speed things up on Hadoop. In the asymmetric case, where the operation is -A.timesSquared(*v*), the accumulation does require synchronization (the -vectors to be summed have nonzero elements all across their range), but -delaying writing to disk until Mapper close(), and remembering that having -a Combiner be the same as the Reducer, the bottleneck in accumulation is -nowhere near a single point. - -<a name="DimensionalReduction-Mahoutusage"></a> -### Mahout usage - -The Mahout DistributedLanzcosSolver is invoked by the -<MAHOUT_HOME>/bin/mahout svd command. This command takes the following -arguments (which can be reproduced by just entering the command with no -arguments): - - - Job-Specific Options: - --input (-i) input Path to job input directory. - --output (-o) output The directory pathname for output. - --numRows (-nr) numRows Number of rows of the input matrix - --numCols (-nc) numCols Number of columns of the input matrix - --rank (-r) rank Desired decomposition rank (note: - only roughly 1/4 to 1/3 of these will - have the top portion of the spectrum) - --symmetric (-sym) symmetric Is the input matrix square and - symmetric? - --cleansvd (-cl) cleansvd Run the EigenVerificationJob to clean - the eigenvectors after SVD - --maxError (-err) maxError Maximum acceptable error - --minEigenvalue (-mev) minEigenvalue Minimum eigenvalue to keep the vector for - --inMemory (-mem) inMemory Buffer eigen matrix into memory (if you have enough!) - --help (-h) Print out help - --tempDir tempDir Intermediate output directory - --startPhase startPhase First phase to run - --endPhase endPhase Last phase to run - - -The short form invocation may be used to perform the SVD on the input data: - - <MAHOUT_HOME>/bin/mahout svd \ - --input (-i) <Path to input matrix> \ - --output (-o) <The directory pathname for output> \ - --numRows (-nr) <Number of rows of the input matrix> \ - --numCols (-nc) <Number of columns of the input matrix> \ - --rank (-r) <Desired decomposition rank> \ - --symmetric (-sym) <Is the input matrix square and symmetric> - - -The --input argument is the location on HDFS where a -SequenceFile<Writable,VectorWritable> (preferably -SequentialAccessSparseVectors instances) lies which you wish to decompose. -Each vector of which has --numcols entries. --numRows is the number of -input rows and is used to properly size the matrix data structures. - -After execution, the --output directory will have a file named -"rawEigenvectors" containing the raw eigenvectors. As the -DistributedLanczosSolver sometimes produces "extra" eigenvectors, whose -eigenvalues aren't valid, and also scales all of the eigenvalues down by -the max eignenvalue (to avoid floating point overflow), there is an -additional step which spits out the nice correctly scaled (and -non-spurious) eigenvector/value pairs. This is done by the "cleansvd" shell -script step (c.f. EigenVerificationJob). - -If you have run he short form svd invocation above and require this -"cleaning" of the eigen/singular output you can run "cleansvd" as a -separate command: - - <MAHOUT_HOME>/bin/mahout cleansvd \ - --eigenInput <path to raw eigenvectors> \ - --corpusInput <path to corpus> \ - --output <path to output directory> \ - --maxError <maximum allowed error. Default is 0.5> \ - --minEigenvalue <minimum allowed eigenvalue. Default is 0.0> \ - --inMemory <true if the eigenvectors can all fit into memory. Default false> - - -The --corpusInput is the input path from the previous step, --eigenInput is -the output from the previous step (<output>/rawEigenvectors), and --output -is the desired output path (same as svd argument). The two "cleaning" -params are --maxError - the maximum allowed 1-cosAngle(v, -A.timesSquared(v)), and --minEigenvalue. Eigenvectors which have too large -error, or too small eigenvalue are discarded. Optional argument: ---inMemory, if you have enough memory on your local machine (not on the -hadoop cluster nodes!) to load all eigenvectors into memory at once (at -least 8 bytes/double * rank * numCols), then you will see some speedups on -this cleaning process. - -After execution, the --output directory will have a file named -"cleanEigenvectors" containing the clean eigenvectors. - -These two steps can also be invoked together by the svd command by using -the long form svd invocation: - - <MAHOUT_HOME>/bin/mahout svd \ - --input (-i) <Path to input matrix> \ - --output (-o) <The directory pathname for output> \ - --numRows (-nr) <Number of rows of the input matrix> \ - --numCols (-nc) <Number of columns of the input matrix> \ - --rank (-r) <Desired decomposition rank> \ - --symmetric (-sym) <Is the input matrix square and symmetric> \ - --cleansvd "true" \ - --maxError <maximum allowed error. Default is 0.5> \ - --minEigenvalue <minimum allowed eigenvalue. Default is 0.0> \ - --inMemory <true if the eigenvectors can all fit into memory. Default false> - - -After execution, the --output directory will contain two files: the -"rawEigenvectors" and the "cleanEigenvectors". - -TODO: also allow exclusion based on improper orthogonality (currently -computed, but not checked against constraints). - -<a name="DimensionalReduction-Example:SVDofASFMailArchivesonAmazonElasticMapReduce"></a> -#### Example: SVD of ASF Mail Archives on Amazon Elastic MapReduce - -This section walks you through a complete example of running the Mahout SVD -job on Amazon Elastic MapReduce cluster and then preparing the output to be -used for clustering. This example was developed as part of the effort to -benchmark Mahout's clustering algorithms using a large document set (see [MAHOUT-588](https://issues.apache.org/jira/browse/MAHOUT-588) -). Specifically, we use the ASF mail archives located at -http://aws.amazon.com/datasets/7791434387204566. You will need to likely -run seq2sparse on these first. See -$MAHOUT_HOME/examples/bin/build-asf-email.sh (on trunk) for examples of -processing this data. - -At a high level, the steps we're going to perform are: - -bin/mahout svd (original -> svdOut) -bin/mahout cleansvd ... -bin/mahout transpose svdOut -> svdT -bin/mahout transpose original -> originalT -bin/mahout matrixmult originalT svdT -> newMatrix -bin/mahout kmeans newMatrix - -The bulk of the content for this section was extracted from the Mahout user -mailing list, see: [Using SVD with Canopy/KMeans](http://search.lucidimagination.com/search/document/6e5889ee6f0f253b/using_svd_with_canopy_kmeans#66a50fe017cebbe8) - and [Need a little help with using SVD](http://search.lucidimagination.com/search/document/748181681ae5238b/need_a_little_help_with_using_svd#134fb2771fd52928) - -Note: Some of this work is due in part to credits donated by the Amazon -Elastic MapReduce team. - -<a name="DimensionalReduction-1.LaunchEMRCluster"></a> -##### 1. Launch EMR Cluster - -For a detailed explanation of the steps involved in launching an Amazon -Elastic MapReduce cluster for running Mahout jobs, please read the -"Building Vectors for Large Document Sets" section of [Mahout on Elastic MapReduce](https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+on+Elastic+MapReduce) -. - -In the remaining steps below, remember to replace JOB_ID with the Job ID of -your EMR cluster. - -<a name="DimensionalReduction-2.LoadMahout0.5+JARintoS3"></a> -##### 2. Load Mahout 0.5+ JAR into S3 - -These steps were created with the mahout-0.5-SNAPSHOT because they rely on -the patch for [MAHOUT-639](https://issues.apache.org/jira/browse/MAHOUT-639) - -<a name="DimensionalReduction-3.CopyTFIDFVectorsintoHDFS"></a> -##### 3. Copy TFIDF Vectors into HDFS - -Before running your SVD job on the vectors, you need to copy them from S3 -to your EMR cluster's HDFS. - - - elastic-mapreduce --jar s3://elasticmapreduce/samples/distcp/distcp.jar \ - --arg s3n://ACCESS_KEY:SECRET_KEY@asf-mail-archives/mahout-0.4/sparse-1-gram-stem/tfidf-vectors\ - --arg /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors \ - -j JOB_ID - - -<a name="DimensionalReduction-4.RuntheSVDJob"></a> -##### 4. Run the SVD Job - -Now you're ready to run the SVD job on the vectors stored in HDFS: - - - elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \ - --main-class org.apache.mahout.driver.MahoutDriver \ - --arg svd \ - --arg -i --arg /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors\ - --arg -o --arg /asf-mail-archives/mahout/svd \ - --arg --rank --arg 100 \ - --arg --numCols --arg 20444 \ - --arg --numRows --arg 6076937 \ - --arg --cleansvd --arg "true" \ - -j JOB_ID - - -This will run 100 iterations of the LanczosSolver SVD job to produce 87 -eigenvectors in: - - - /asf-mail-archives/mahout/svd/cleanEigenvectors - - -Only 87 eigenvectors were produced because of the cleanup step, which -removes any duplicate eigenvectors caused by convergence issues and numeric -overflow and any that don't appear to be "eigen" enough (ie, they don't -satisfy the eigenvector criterion with high enough fidelity). - Jake Mannix - -<a name="DimensionalReduction-5.TransformyourTFIDFVectorsintoMahoutMatrix"></a> -##### 5. Transform your TFIDF Vectors into Mahout Matrix - -The tfidf vectors created by the seq2sparse job are -SequenceFile<Text,VectorWritable>. The Mahout RowId job transforms these -vectors into a matrix form that is a -SequenceFile<IntWritable,VectorWritable> and a -SequenceFile<IntWritable,Text> (where the original one is the join of these -new ones, on the new int key). - - - elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \ - --main-class org.apache.mahout.driver.MahoutDriver \ - --arg rowid \ - --arg --Dmapred.input.dir=/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-vectors -\ - --arg --Dmapred.output.dir=/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix -\ - -j JOB_ID - - -This is not a distributed job and will only run on the master server in -your EMR cluster. The job produces the following output: - - - /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/docIndex - /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/matrix - - -where docIndex is the SequenceFile<IntWritable,Text> and matrix is -SequenceFile<IntWritable,VectorWritable>. - -<a name="DimensionalReduction-6.TransposetheMatrix"></a> -##### 6. Transpose the Matrix - -Our ultimate goal is to multiply the TFIDF vector matrix times our SVD -eigenvectors. For the mathematically inclined, from the rowid job, we now -have an m x n matrix T (m=6076937, n=20444). The SVD eigenvector matrix E -is p x n (p=87, n=20444). So to multiply these two matrices, I need to -transpose E so that the number of columns in T equals the number of rows in -E (i.e. E^T is n x p) the result of the matrixmult would give me an m x p -matrix (m=6076937, p=87). - -However, in practice, computing the matrix product of two matrices as a -map-reduce job is efficiently done as a map-side join on two row-based -matrices with the same number of rows, and the columns are the ones which -are different. In particular, if you take a matrix X which is represented -as a set of numRowsX rows, each of which has numColsX, and another matrix -with numRowsY == numRowsX, each of which has numColsY (!= numColsX), then -by summing the outer-products of each of the numRowsX pairs of vectors, you -get a matrix of with numRowsZ == numColsX, and numColsZ == numColsY (if you -instead take the reverse outer product of the vector pairs, you can end up -with the transpose of this final result, with numRowsZ == numColsY, and -numColsZ == numColsX). - Jake Mannix - -Thus, we need to transpose the matrix using Mahout's Transpose Job: - - - elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \ - --main-class org.apache.mahout.driver.MahoutDriver \ - --arg transpose \ - --arg -i --arg -/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/matrix \ - --arg --numRows --arg 6076937 \ - --arg --numCols --arg 20444 \ - --arg --tempDir --arg -/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose \ - -j JOB_ID - - -This job requires the patch to [MAHOUT-639](https://issues.apache.org/jira/browse/MAHOUT-639) - -The job creates the following output: - - - /asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose - - -<a name="DimensionalReduction-7.TransposeEigenvectors"></a> -##### 7. Transpose Eigenvectors - -If you followed Jake's explanation in step 6 above, then you know that we -also need to transpose the eigenvectors: - - - elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \ - --main-class org.apache.mahout.driver.MahoutDriver \ - --arg transpose \ - --arg -i --arg /asf-mail-archives/mahout/svd/cleanEigenvectors \ - --arg --numRows --arg 87 \ - --arg --numCols --arg 20444 \ - --arg --tempDir --arg /asf-mail-archives/mahout/svd/transpose \ - -j JOB_ID - - -Note: You need to use the same number of reducers that was used for -transposing the matrix you are multiplying the vectors with. - -The job creates the following output: - - - /asf-mail-archives/mahout/svd/transpose - - -<a name="DimensionalReduction-8.MatrixMultiplication"></a> -##### 8. Matrix Multiplication - -Lastly, we need to multiply the transposed vectors using Mahout's -matrixmult job: - - - elastic-mapreduce --jar s3://BUCKET/mahout-examples-0.5-SNAPSHOT-job.jar \ - --main-class org.apache.mahout.driver.MahoutDriver \ - --arg matrixmult \ - --arg --numRowsA --arg 20444 \ - --arg --numColsA --arg 6076937 \ - --arg --numRowsB --arg 20444 \ - --arg --numColsB --arg 87 \ - --arg --inputPathA --arg -/asf-mail-archives/mahout/sparse-1-gram-stem/tfidf-matrix/transpose \ - --arg --inputPathB --arg /asf-mail-archives/mahout/svd/transpose \ - -j JOB_ID - - -This job produces output such as: - - - /user/hadoop/productWith-189 - - -<a name="DimensionalReduction-Resources"></a> -# Resources - -* [LSA tutorial](http://www.dcs.shef.ac.uk/~genevieve/lsa_tutorial.htm) -* [SVD tutorial](http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html) http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.md b/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.md deleted file mode 100644 index 50ff7be..0000000 --- a/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.md +++ /dev/null @@ -1,127 +0,0 @@ ---- -layout: default -title: Stochastic SVD -theme: - name: retro-mahout ---- - -# Stochastic Singular Value Decomposition # - -Stochastic SVD method in Mahout produces reduced rank Singular Value Decomposition output in its -strict mathematical definition: ` \(\mathbf{A\approx U}\boldsymbol{\Sigma}\mathbf{V}^{\top}\)`. - -##The benefits over other methods are: - - - reduced flops required compared to Krylov subspace methods - - - In map-reduce world, a fixed number of MR iterations required regardless of rank requested - - - Tweak precision/speed balance with options. - - - A is a Distributed Row Matrix where rows may be identified by any Writable (such as a document path). As such, it would work directly on the output of seq2sparse. - - - As of 0.7 trunk, includes PCA and dimensionality reduction workflow (EXPERIMENTAL! Feedback on performance/other PCA related issues/ blogs is greatly appreciated.) - -### Map-Reduce characteristics: -SSVD uses at most 3 MR sequential steps (map-only + map-reduce + 2 optional parallel map-reduce jobs) to produce reduced rank approximation of U, V and S matrices. Additionally, two more map-reduce steps are added for each power iteration step if requested. - -##Potential drawbacks: - -potentially less precise (but adding even one power iteration seems to fix that quite a bit). - -##Documentation - -[Overview and Usage][3] - -Note: Please use 0.6 or later! for PCA workflow, please use 0.7 or later. - -##Publications - -[Nathan Halko's dissertation][1] "Randomized methods for computing low-rank -approximations of matrices" contains comprehensive definition of parallelization strategy taken in Mahout SSVD implementation and also some precision/scalability benchmarks, esp. w.r.t. Mahout Lanczos implementation on a typical corpus data set. - -[Halko, Martinsson, Tropp] paper discusses family of random projection-based algorithms and contains theoretical error estimates. - -**R simulation** - -[Non-parallel SSVD simulation in R][2] with power iterations and PCA options. Note that this implementation is not most optimal for sequential flow solver, but it is for demonstration purposes only. - -However, try this R code to simulate a meaningful input: - - - -**tests.R** - - - - n<-1000 - m<-2000 - k<-10 - - qi<-1 - - #simulated input - svalsim<-diag(k:1) - - usim<- qr.Q(qr(matrix(rnorm(m*k, mean=3), nrow=m,ncol=k))) - vsim<- qr.Q(qr( matrix(rnorm(n*k,mean=5), nrow=n,ncol=k))) - - - x<- usim %*% svalsim %*% t(vsim) - - -and try to compare ssvd.svd(x) and stock svd(x) performance for the same rank k, notice the difference in the running time. Also play with power iterations (qIter) and compare accuracies of standard svd and SSVD. - -Note: numerical stability of R algorithms may differ from that of Mahout's distributed version. We haven't studied accuracy of the R simulation. For study of accuracy of Mahout's version, please refer to Nathan's dissertation as referenced above. - - - [1]: http://amath.colorado.edu/faculty/martinss/Pubs/2012_halko_dissertation.pdf - [2]: ssvd.page/ssvd.R - [3]: ssvd.page/SSVD-CLI.pdf - - -#### Modified SSVD Algorithm. - -Given an `\(m\times n\)` -matrix `\(\mathbf{A}\)`, a target rank `\(k\in\mathbb{N}_{1}\)` -, an oversampling parameter `\(p\in\mathbb{N}_{1}\)`, -and the number of additional power iterations `\(q\in\mathbb{N}_{0}\)`, -this procedure computes an `\(m\times\left(k+p\right)\)` -SVD `\(\mathbf{A\approx U}\boldsymbol{\Sigma}\mathbf{V}^{\top}\)`: - - 1. Create seed for random `\(n\times\left(k+p\right)\)` - matrix `\(\boldsymbol{\Omega}\)`. The seed defines matrix `\(\mathbf{\Omega}\)` - using Gaussian unit vectors per one of suggestions in [Halko, Martinsson, Tropp]. - - 2. `\(\mathbf{Y=A\boldsymbol{\Omega}},\,\mathbf{Y}\in\mathbb{R}^{m\times\left(k+p\right)}\)` - - - 3. Column-orthonormalize `\(\mathbf{Y}\rightarrow\mathbf{Q}\)` - by computing thin decomposition `\(\mathbf{Y}=\mathbf{Q}\mathbf{R}\)`. - Also, `\(\mathbf{Q}\in\mathbb{R}^{m\times\left(k+p\right)},\,\mathbf{R}\in\mathbb{R}^{\left(k+p\right)\times\left(k+p\right)}\)`. - I denote this as `\(\mathbf{Q}=\mbox{qr}\left(\mathbf{Y}\right).\mathbf{Q}\)` - - - 4. `\(\mathbf{B}_{0}=\mathbf{Q}^{\top}\mathbf{A}:\,\,\mathbf{B}\in\mathbb{R}^{\left(k+p\right)\times n}\)`. - - 5. If `\(q>0\)` - repeat: for `\(i=1..q\)`: - `\(\mathbf{B}_{i}^{\top}=\mathbf{A}^{\top}\mbox{qr}\left(\mathbf{A}\mathbf{B}_{i-1}^{\top}\right).\mathbf{Q}\)` - (power iterations step). - - 6. Compute Eigensolution of a small Hermitian `\(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}=\mathbf{\hat{U}}\boldsymbol{\Lambda}\mathbf{\hat{U}}^{\top}\)`, - `\(\mathbf{B}_{q}\mathbf{B}_{q}^{\top}\in\mathbb{R}^{\left(k+p\right)\times\left(k+p\right)}\)`. - - - 7. Singular values `\(\mathbf{\boldsymbol{\Sigma}}=\boldsymbol{\Lambda}^{0.5}\)`, - or, in other words, `\(s_{i}=\sqrt{\sigma_{i}}\)`. - - - 8. If needed, compute `\(\mathbf{U}=\mathbf{Q}\hat{\mathbf{U}}\)`. - - - 9. If needed, compute `\(\mathbf{V}=\mathbf{B}_{q}^{\top}\hat{\mathbf{U}}\boldsymbol{\Sigma}^{-1}\)`. -Another way is `\(\mathbf{V}=\mathbf{A}^{\top}\mathbf{U}\boldsymbol{\Sigma}^{-1}\)`. - -[Halko, Martinsson, Tropp]: http://arxiv.org/abs/0909.4061 - http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/SSVD-CLI.pdf ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/SSVD-CLI.pdf b/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/SSVD-CLI.pdf deleted file mode 100644 index ab5999d..0000000 Binary files a/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/SSVD-CLI.pdf and /dev/null differ http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/ssvd.R ---------------------------------------------------------------------- diff --git a/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/ssvd.R b/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/ssvd.R deleted file mode 100644 index fa5fa84..0000000 --- a/website/old_site_migration/needs_work_priority/dim-reduction/ssvd.page/ssvd.R +++ /dev/null @@ -1,181 +0,0 @@ - -# standard SSVD -ssvd.svd <- function(x, k, p=25, qiter=0 ) { - -a <- as.matrix(x) -m <- nrow(a) -n <- ncol(a) -p <- min( min(m,n)-k,p) -r <- k+p - -omega <- matrix ( rnorm(r*n), nrow=n, ncol=r) - -y <- a %*% omega - -q <- qr.Q(qr(y)) - -b<- t(q) %*% a - -#power iterations -for ( i in 1:qiter ) { - y <- a %*% t(b) - q <- qr.Q(qr(y)) - b <- t(q) %*% a -} - -bbt <- b %*% t(b) - -e <- eigen(bbt, symmetric=T) - -res <- list() - -res$svalues <- sqrt(e$values)[1:k] -uhat=e$vectors[1:k,1:k] - -res$u <- (q %*% e$vectors)[,1:k] -res$v <- (t(b) %*% e$vectors %*% diag(1/e$values))[,1:k] - -return(res) -} - -#SSVD with Q=YR^-1 substitute. -# this is just a simulation, because it is suboptimal to verify the actual result -ssvd.svd1 <- function(x, k, p=25, qiter=0 ) { - -a <- as.matrix(x) -m <- nrow(a) -n <- ncol(a) -p <- min( min(m,n)-k,p) -r <- k+p - -omega <- matrix ( rnorm(r*n), nrow=n, ncol=r) - -# in reality we of course don't need to form and persist y -# but this is just verification -y <- a %*% omega - -yty <- t(y) %*% y -R <- chol(yty, pivot = T) -q <- y %*% solve(R) - -b<- t( q ) %*% a - -#power iterations -for ( i in 1:qiter ) { - y <- a %*% t(b) - - yty <- t(y) %*% y - R <- chol(yty, pivot = T) - q <- y %*% solve(R) - b <- t(q) %*% a -} - -bbt <- b %*% t(b) - -e <- eigen(bbt, symmetric=T) - -res <- list() - -res$svalues <- sqrt(e$values)[1:k] -uhat=e$vectors[1:k,1:k] - -res$u <- (q %*% e$vectors)[,1:k] -res$v <- (t(b) %*% e$vectors %*% diag(1/e$values))[,1:k] - -return(res) -} - - -############# -## ssvd with pci options -ssvd.cpca <- function ( x, k, p=25, qiter=0, fixY=T ) { - -a <- as.matrix(x) -m <- nrow(a) -n <- ncol(a) -p <- min( min(m,n)-k,p) -r <- k+p - - -# compute median xi -xi<-colMeans(a) - -omega <- matrix ( rnorm(r*n), nrow=n, ncol=r) - -y <- a %*% omega - -#fix y -if ( fixY ) { - #debug - cat ("fixing Y...\n"); - - s_o = t(omega) %*% cbind(xi) - for (i in 1:r ) y[,i]<- y[,i]-s_o[i] -} - - -q <- qr.Q(qr(y)) - -b<- t(q) %*% a - -# compute sum of q rows -s_q <- cbind(colSums(q)) - -# compute B*xi -# of course in MR implementation -# it will be collected as sums of ( B[,i] * xi[i] ) and reduced after. -s_b <- b %*% cbind(xi) - - -#power iterations -for ( i in 1:qiter ) { - - # fix b - b <- b - s_q %*% rbind(xi) - - y <- a %*% t(b) - - # fix y - if ( fixY ) - for (i in 1:r ) y[,i]<- y[,i]-s_b[i] - - - q <- qr.Q(qr(y)) - b <- t(q) %*% a - - # recompute s_{q} - s_q <- cbind(colSums(q)) - - #recompute s_{b} - s_b <- b %*% cbind(xi) - -} - - - -#C is the outer product of S_q and S_b per doc -C <- s_q %*% t(s_b) - -# fixing BB' -bbt <- b %*% t(b) -C -t(C) + sum(xi * xi)* (s_q %*% t(s_q)) - -e <- eigen(bbt, symmetric=T) - -res <- list() - -res$svalues <- sqrt(e$values)[1:k] -uhat=e$vectors[1:k,1:k] - -res$u <- (q %*% e$vectors)[,1:k] - -res$v <- (t(b- s_q %*% rbind(xi) ) %*% e$vectors %*% diag(1/e$values))[,1:k] - -return(res) - -} - - - - - - http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/old_site/general/books-tutorials-and-talks.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/old_site/general/books-tutorials-and-talks.md b/website/old_site_migration/old_site/general/books-tutorials-and-talks.md deleted file mode 100644 index bbbdeef..0000000 --- a/website/old_site_migration/old_site/general/books-tutorials-and-talks.md +++ /dev/null @@ -1,121 +0,0 @@ ---- -layout: default -title: Books Tutorials and Talks -theme: - name: retro-mahout ---- -# Intro - -This page is a place for info about talks (past and upcoming), tutorials, articles, books, slides, PDFs, discussions, etc. about Mahout. No endorsements are implied or -given. - -# Books - -## Mahout specific - - * <a href="http://www.weatheringthroughtechdays.com/2016/02/mahout-samsara-book-is-out.html">Apache Mahout: Beyond MapReduce</a> by Dmitriy Lyubimov and Andrew Palumbo published Feb 2016. Covers new features in Mahout "Samsara" releases (0.10, 0.11+). - * <a href="http://www.packtpub.com/apache-mahout-cookbook/book">Apache Mahout cookbook</a>- Book by Piero Giacomelli published Dec 2013 by Packtpub. - * <a href="http://www.manning.com/owen/">Mahout in Action</a> - Book by Sean Owen, Robin Anil, Ted Dunning and Ellen Friedman published Oct 2011 by Manning Publications. - * <a href="http://www.manning.com/ingersoll/">Taming Text</a> - By Grant Ingersoll and Tom Morton, published by Manning Publications. Will have some Mahout coverage, but by no means as complete as Mahout in Action. - -## Engineering oriented machine learning books - - * <a href="http://www.amazon.com/Collective-Intelligence-Action-Satnam-Alag/dp/1933988312/ref=pd_bbs_sr_3?ie=UTF8&s=books&qid=1214545249&sr=1-3">Collective Intelligence in Action</a> - * <a href="http://www.amazon.com/Programming-Collective-Intelligence-Building-Applications/dp/0596529325/ref=pd_bbs_sr_1/104-1017533-9408723?ie=UTF8&s=books&qid=1214593516&sr=1-1">Programming Collective Intelligence</a> - * <a href="http://www.amazon.com/Algorithms-Intelligent-Web-Haralambos-Marmanis/dp/1933988665/ref=sr_1_1?s=books&ie=UTF8&qid=1298005918&sr=1-1">Algorithms of the Intelligent Web</a> - -## Scientific background - - * <a href="http://www.cs.waikato.ac.nz/~ml/weka/book.html">Data Mining: Practical Machine Learning Tools and Techniques</a> - * <a href="http://www-nlp.stanford.edu/IR-book/">Introduction to Information Retrieval</a> - * <a href="http://www.amazon.com/Machine-Learning-Mcgraw-Hill-International-Edit/dp/0071154671/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1214593709&sr=8-1">Machine Learning</a> - * <a href="http://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738/ref=pd_bbs_sr_2?ie=UTF8&s=books&qid=1214593709&sr=8-2">Pattern Recognition and Machine Learning (Information Science and Statistics) </a> - -# News, Articles and Tutorials - - * [Mahout 0.10.x: first Mahout release as a programming environment](http://www.weatheringthroughtechdays.com/2015/04/mahout-010x-first-mahout-release-as.html) - * [Comparing Document Classification Functions of Lucene and Mahout](http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html) - * <a href="http://www.ibm.com/developerworks/java/library/j-mahout-scaling/">Apache Mahout: Scalable Machine Learning for Everyone</a> - * <a href="http://emmaespina.wordpress.com/2011/04/26/ham-spam-and-elephants-or-how-to-build-a-spam-filter-server-with-mahout/">How to build a spam filter server with Mahout</a> - Applying classification on a live server - April 2011 - * <a href="http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/">Deploying a massively scalable recommender system with Apache Mahout</a> - Blogpost of Sebastian Schelter in April 2011 - * <a href="http://www.redmonk.com/cote/2010/11/04/makeall013/">Apache Mahout & the commoditization of machine learning </a> - Podcast interview with Grant Ingersoll at ApacheCon 2010 - * <a href="http://isabel-drost.de/hadoop/slides/devoxx.pdf">Apache Mahout 0.4 mit neuen Algorithmen</a> - published after the 0.4 release by heise Open/ Developer, November 2010 - * <a href="http://www.infoq.com/news/2009/04/mahout">Mahout on InfoQ</a> - Interview with Grant Ingersoll on InfoQ - * <a href="http://www.cloudera.com/blog/2009/04/21/hadoop-uk-user-group-meeting/">Mahout in the Cloudera weblog</a> - published after the Hadoop user group UK. - * <a href="http://blog.athico.com/2008/08/machine-learning-and-apache-mahout.html">Mahout in the Drools weblog</a> - Michael Neale published an article on Mahout in the drools weblog - * <a href="https://www.ibm.com/developerworks/java/library/j-mahout/index.html">Introducing Apache Mahout</a> - Grant Ingersoll - Intro to Apache Mahout focused on clustering, classification and collaborative filtering. Japanese translation available at: [http://www.ibm.com/developerworks/jp/java/library/j-mahout/](http://www.ibm.com/developerworks/jp/java/library/j-mahout/) - * <a href="http://philippeadjiman.com/blog/2009/11/11/flexible-collaborative-filtering-in-java-with-mahout-taste/">Flexible Collaborative Filtering In Java With Mahout Taste</a> - Philippe Adjiman - Quick starting guide on how to use the collaborative filtering package of Mahout (called Taste) to quickly and flexibly create, test and compare tailored recommendation engines. - * <a href="http://www.lucidimagination.com/blog/2010/03/16/integrating-apache-mahout-with-apache-lucene-and-solr-part-i-of-3/">Integrating Mahout with Lucene and Solr</a> Three part series on ways to integrate Mahout with Lucene and Solr - * <a href="https://www.youtube.com/watch?v=yD40rVKUwPI">Mahout Item Recommender Tutorial using Java and Eclipse</a> - YouTube video tutorial by Steve Cook - - -# Coursework/Lectures - - * <a href="http://videolectures.net/mlss05us_chicago/">http://videolectures.net/mlss05us_chicago/</a> - * <a href="http://videolectures.net/mlas06_pittsburgh/">http://videolectures.net/mlas06_pittsburgh/</a> - * <a href="http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1">Stanford Lectures on Machine Learning by Andrew Ng</a> - * <a href="https://docs.google.com/open?id=0ByhGL2_SCeitMDQ3OTczNjItM2ZjYi00ZDg5LWE0MzItZGQxODQ5NzkzYjNj">CMU@Qatar Introduction to Mahout lecture</a> - - -# Talks - -In reverse chronological order, so that most recent talks are at the top - - * [Distributed Machine Learning with Apache Mahout] Suneel Marthi at Apache Big Data North America, Vancouver, Canada, May 11, 2016 and MapR Washington DC Big Data Everywhere, Tysons, VA, June 2 2016 - * [Declarative Machine Learning with the Samsara DSL](http://www.slideshare.net/FlinkForward/sebastian-schelter-distributed-machine-learing-with-the-samsara-dsl) Sebastian Schelter at Flink Forward Conference, Berlin Germany, October 2015. - * [Bringing Algebraic Semantics to Mahout](http://www.slideshare.net/sscdotopen/bringing-algebraic-semantics-to-mahout) Sebastian Schelter at HPI Infolunch, Potsdam Germany, May 2014 - * Mahout Spark and Scala bindings: Bringing Algebraic Semantics ([slides](http://www.slideshare.net/DmitriyLyubimov/mahout-scala-and-spark-bindings)/[video](http://youtu.be/h9dpmvNW1Dw)) - Dmitriy Lyubimov at Mahout Meetup, April 17, 2014. - * Mahout Future Directions - Ted Dunning, Suneel Marthi, Sebastian Schelter at Hadoop Summit Europe 2014, Amsterdam, April 3, 2014 - * Building Recommender Systems for Mere-Mortals - Sebastian Schelter at Researchgate Developer Day, Berlin, November 2013 - * Recommendations with Apache Mahout - Sebastian Schelter at IBM Almaden Research Center, San Jose, September 2013 - * <a href="http://de.slideshare.net/sscdotopen/next-directions-in-mahouts-recommenders">Next Directions in Mahoutâs Recommenders</a> - Sebastian Schelter at Bay Area Mahout Meetup, Redwood City, August 2013 - * <a href="http://de.slideshare.net/sscdotopen/new-directions-in-mahouts-recommenders">New Directions in Mahoutâs Recommenders</a> - Sebastian Schelter at Recommender Systems Get Together Berlin, April 2013 - * <a href="http://www.slideshare.net/VaradMeru/introduction-to-mahout-and-machine-learning">Introduction to Mahout and Machine Learning</a> - Slides by Varad Meru, Software Development Engineer at Orzota. July 27th, 2013. - * <a href="http://de.slideshare.net/sscdotopen/introduction-to-collaborative-filtering-with-apache-mahout">An Introduction to Collaborative Filtering with Apache Mahout</a> - Sebastian Schelter at Recommender Systems Challenge Workshop in conjunction with ACM RecSys 2012, Dublin, September 2012 - * <a href="https://github.com/ManuelB/facebook-recommender-demo/raw/master/docs/Talk-BedCon-Berlin-2012.pdf">How to build a recommender system based on Mahout and JavaEE</a> - Slides by Manuel Blechschmidt at Berlin Expert Days March, 2012. - * <a href="http://lanyrd.com/2011/apachecon-north-america/skdtb/">Apache Mahout for intelligent data analysis</a> - Slides from Isabel Drost at Apache Con NA November, 2011. - * <a href="http://lanyrd.com/2011/apachecon-north-america/skdrk/">Dr. Mahout: Analyzing clinical data using scalable and distributed computing</a> - Slides from Shannon Quinn at Apache Con NA November, 2011. - * Frank Scholten at Berlin Buzzwords on June 7, 2011. - * Introduction to Collaborative Filtering using Mahout (updated) - Talk by Sean Owen at the London Hadoop User Group on April 14, 2011. - * <a href="http://www.meetup.com/LA-HUG/pages/Video_from_March_16th_LA-HUG_Ted_Dunning_Mahout">Cool Tricks with Classifiers</a> - Talk by Ted Dunning at the Los Angeles HUG talking about Mahout classifiers on March 16, 2011. - * First Mahout Hackathon, Berlin, March 2011 - * <a href="http://blog.jteam.nl/2011/01/13/announcement-lucene-nl-mahout-meetup-with-isabel-drost-feb-7/">Mahout meetup</a> - there were two talks at the Apache Mahout meetup at JTeam in Amsterdam, February 2011. <a href="http://isabel-drost.de/hadoop/slides/jteam.pdf">intro slides</a> - * <a href="http://www.fosdem.org/2011/schedule/event/mahoutclustering.html">Mahout clustering </a> - Talk on Mahout clustering at data dev room FOSDEM, February 2011. - * Scaling Data Analysis with Apache Mahout - talk on Mahout at O'Reilly Strata, February 2011. - * <a href="http://www.slideshare.net/jaganadhg/mahout-tutorial-fossmeet-nitc">Practical Machine Learning</a> - Slides from Biju B and Jaganadh G, FOSSMEET-NITC, Calicut, India, February 2011. - * <a href="http://www.javaedge.com/jedge/pdf/Mahout.pdf">Mahout at AlphaCSPs The Edge 2010 (pdf)</a> - <a href="http://www.slideshare.net/arikogan/mahouts-presentation-at-alphacsps-the-edge-2010">slideshare</a> - Slides from <a href="http://il.linkedin.com/in/arielkogan">Ariel Kogan</a> AlphaCSP's The Edge, December 2010. - * <a href="http://isabel-drost.de/hadoop/slides/devoxx.pdf">Intelligent data analysis with Apache Mahout</a> - Slides from Isabel Drost, Devoxx Antwerp, November 2010. - * <a href="http://isabel-drost.de/hadoop/slides/codebits.pdf">Apache Mahout introduction</a> - Slides from Isabel Drost, codebits Lisbon, November 2010. - * <a href="http://isabel-drost.de/hadoop/slides/apachecon_2010.pdf">Apache Mahout - Making Data Analysis Easy</a> - Slides from Isabel Drost, Apache Con US Atlanta, November 2010. - * <a href="http://www.slideshare.net/jaganadhg/bck9">Practical Machine Learning</a> - Slides from Jaganadh G, BarCamp Kerala 9, November 2010. - * <a href="http://www.slideshare.net/tdunning/sdforum-11042010">Mahout and its new classification framework</a> - Slides from Ted Dunning, SDForum, November 2010. - * <a href="http://www.slideshare.net/sscdotopen/mahoutcf">Distributed Item-based Collaborative Filtering with Apache Mahout</a> - Slides from Sebastian Schelter, Hadoop Get Together Berlin, October 2010. - * <a href="http://isabel-drost.de/hadoop/slides/HMM.pdf">Hidden Markov Models for Mahout</a> - Slides from Max Heimel, Hadoop Get Together Berlin, October 2010. - * <a href="http://www.slideshare.net/robinanil/oscon-apache-mahout-mammoth-scale-machine-learning">Apache Mahout Mammoth Scale Machine Learning </a> - Slides from Robin Anil, OSCON 2010. - * <a href="http://slidesha.re/9LxOIu">Intro to Apache Mahout</a> - Slides from Grant Ingersoll, RTP Semantic Web Group. - * <a href="http://www.slideshare.net/ydn/3-biometric-hadoopsummit2010">Case study: Biometric Databases and Hadoop </a> - Slides from Jason Trost, Hadoop Summit 2010. - * <a href="http://www.slideshare.net/hadoopusergroup/mail-antispam?from=ss_embed">Spam Fighting at Yahoo</a> - * <a href="http://www.slideshare.net/hadoopusergroup/bixo-hug-talk?from=ss_embed">Web Mining with Ken Krugler</a> - * <a href="http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/ingersoll_bbuzz2010.pdf">Keynote on intelligent search</a> - Slides from Grant Ingersoll, Berlin Buzzwords, June 2010. - * <a href="http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/owen_bbuzz2010.pdf">Simple co-occurrence-based recommendation on Hadoop</a> - Slides from Sean Owen, Berlin Buzzwords, June, 2010. - * <a href="http://berlinbuzzwords.wikidot.com/local--files/links-to-slides/scholten_bbuzz2010.odp">Introduction to Collaborative Filtering using Mahout</a> - Slides from Frank Scholten, Berlin Buzzwords, June, 2010. - * <a href="http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/">Introduction to Scalable Machine Learning</a> - Slides and demos from Grant Ingersoll, March, 2010. - * Mahout @ India Hadoop Summit - Slides from a 1 hour talk on Mahout at the India Hadoop Summit by Robin Anil, February 2010. - * <a href="http://www.isabel-drost.de/hadoop/slides/opensourceexpo09.pdf">Mahout in 10 minutes</a> - Slides from a 10 min intro to Mahout at the Map Reduce tutorial by David Zülke at Open Source Expo in Karlsruhe, Isabel Drost, November 2009. - * <a href="http://www.isabel-drost.de/hadoop/slides/apacheconus2009.pdf">Mahout at Apache Con US </a> - Slides from a talk on "Going from raw data to information" (with Mahout) at Apache Con US in Oakland, Isabel Drost, November 2009. - * <a href="http://www.isabel-drost.de/hadoop/slides/froscon2009.pdf">Mahout at FrOSCon</a> - Slides from a talk on Mahout at FrOSCon in Sankt Augustin, Isabel Drost, August 2009. - * <a href="http://www.isabel-drost.de/hadoop/slides/dai.pdf">Mahout at DAI group TU Berlin</a> - Slides from a talk on Mahout at the DAI Laboratories TU Berlin, Isabel Drost, July 2009. - * <a href="http://www.isabel-drost.de/hadoop/slides/ulf.pdf">Mahout at Machine Learning Group TU Berlin</a> - Slides from a talk on Hadoop with some detour to Mahout at the Machine - * Learning Group of Prof. Dr. Klaus-Robert Müller at TU Berlin, Isabel Drost, June 2009. - * <a href="http://www.isabel-drost.de/hadoop/slides/google.pdf">Mahout at Google Zürich</a> - Slides from a Google tech-talk on the past, present and future of Mahout, Isabel Drost, May 2009. - * <a href="http://static.last.fm/johan/huguk-20090414/isabel_drost-introducing_apache_mahout.pdf">Hadoop user group UK</a> - Slides from a talk on April 14, 2009 at the Hadoop User Group UK in London, Isabel Drost, April 2009. - * <a href="http://cwiki.apache.org/confluence/download/attachments/88410/SDForum.pdf">BI Over Petabytes: Meet Apache Mahout</a> - Slides from a talk by Jeff Eastman on April 21, 2009 at the Bay Area SD Forum Business Intelligence SIG meeting at SAP in Palo Alto, CA. - * Lucene Meetup and Apache Barcamp in Amsterdam, March 2009. - * BarCampRDU - (Raleigh) on Aug. 2, 2008 - * Introducing Mahout: Apache Machine Learning - Committer Grant Ingersoll gave a gentle introduction to Mahout and Machine Learning at ApacheCon in November (3rd through 7th) in New Orleans, USA. - * Mahout: Scaling Machine Learning - Introduction to Mahout and machine learning at FrOSCon in Sankt Augustin/Germany, Isabel Drost, August 2008. (<a href="http://cwiki.apache.org/confluence/download/attachments/88410/froscon.pdf">slides</a>) - * Mahout: Scalable Machine Learning - An introduction to Mahout and machine learning at the first German Hadoop gathering in newthinking store/ Berlin, Isabel Drost, July 2008. - * Apache Mahout: Industrial Strength Machine Learning - Committer Jeff Eastman gave an introduction to Mahout at Yahoo\!, May 2008 - * <a href="http://people.apache.org/~berndf/openexpode08-lucene-talk.pdf">Apache Lucene - Mach's wie Google</a> - Bernd Fondermann presented an overview of the Apache Lucene project, - * including Mahout at Open Source Expo 2008 in Karlsruhe, May 2008. - * Apache Mahout: Bringing Machine Learning to Industrial Strength - Committer Isabel Drost gave a Fast Feather introduction the the new project Mahout at Apache Con EU April, 2008 \ No newline at end of file http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/old_site/general/mahout-wiki.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/old_site/general/mahout-wiki.md b/website/old_site_migration/old_site/general/mahout-wiki.md deleted file mode 100644 index 2df16d4..0000000 --- a/website/old_site_migration/old_site/general/mahout-wiki.md +++ /dev/null @@ -1,202 +0,0 @@ ---- -layout: default -title: Mahout Wiki -theme: - name: retro-mahout ---- - -On the fence about including this in new site. lol at "new Apache TLP" - -Apache Mahout is a new Apache TLP project to create scalable, machine -learning algorithms under the Apache license. - -{toc:style=disc|minlevel=2} - -<a name="MahoutWiki-General"></a> -## General -[Overview](overview.html) - -- Mahout? What's that supposed to be? - -[Quickstart](quickstart.html) - -- learn how to quickly setup Apache Mahout for your project. - -[FAQ](faq.html) - -- Frequent questions encountered on the mailing lists. - -[Developer Resources](developer-resources.html) - -- overview of the Mahout development infrastructure. - -[How To Contribute](how-to-contribute.html) - -- get involved with the Mahout community. - -[How To Become A Committer](how-to-become-a-committer.html) - -- become a member of the Mahout development community. - -[Hadoop](http://hadoop.apache.org) - -- several of our implementations depend on Hadoop. - -[Machine Learning Open Source Software](http://mloss.org/software/) - -- other projects implementing Open Source Machine Learning libraries. - -[Mahout -- The name, history and its pronunciation](mahoutname.html) - -<a name="MahoutWiki-Community"></a> -## Community - -[Who we are](who-we-are.html) - -- who are the developers behind Apache Mahout? - -[Books, Tutorials, Talks, Articles, News, Background Reading, etc. on Mahout](books-tutorials-and-talks.html) - -[Issue Tracker](issue-tracker.html) - -- see what features people are working on, submit patches and file bugs. - -[Source Code (SVN)](https://svn.apache.org/repos/asf/mahout/) - -- [Fisheye|http://fisheye6.atlassian.com/browse/mahout] - -- download the Mahout source code from svn. - -[Mailing lists and IRC](mailing-lists,-irc-and-archives.html) - -- links to our mailing lists, IRC channel and archived design and -algorithm discussions, maybe your questions was answered there already? - -[Version Control](version-control.html) - -- where we track our code. - -[Powered By Mahout](powered-by-mahout.html) - -- who is using Mahout in production? - -[Professional Support](professional-support.html) - -- who is offering professional support for Mahout? - -[Mahout and Google Summer of Code](gsoc.html) - -- All you need to know about Mahout and GSoC. - - -[Glossary of commonly used terms and abbreviations](glossary.html) - -<a name="MahoutWiki-Installation/Setup"></a> -## Installation/Setup - -[System Requirements](system-requirements.html) - -- what do you need to run Mahout? - -[Quickstart](quickstart.html) - -- get started with Mahout, run the examples and get pointers to further -resources. - -[Downloads](downloads.html) - -- a list of Mahout releases. - -[Download and installation](buildingmahout.html) - -- build Mahout from the sources. - -[Mahout on Amazon's EC2 Service](mahout-on-amazon-ec2.html) - -- run Mahout on Amazon's EC2. - -[Mahout on Amazon's EMR](mahout-on-elastic-mapreduce.html) - -- Run Mahout on Amazon's Elastic Map Reduce - -[Integrating Mahout into an Application](mahoutintegration.html) - -- integrate Mahout's capabilities in your application. - -<a name="MahoutWiki-Examples"></a> -## Examples - -1. [ASF Email Examples](asfemail.html) - -- Examples of recommenders, clustering and classification all using a -public domain collection of 7 million emails. - -<a name="MahoutWiki-ImplementationBackground"></a> -## Implementation Background - -<a name="MahoutWiki-RequirementsandDesign"></a> -### Requirements and Design - -[Matrix and Vector Needs](matrix-and-vector-needs.html) - -- requirements for Mahout vectors. - -[Collection(De-)Serialization](collection(de-)serialization.html) - -<a name="MahoutWiki-CollectionsandAlgorithms"></a> -### Collections and Algorithms - -Learn more about [mahout-collections](mahout-collections.html) -, containers for efficient storage of primitive-type data and open hash -tables. - -Learn more about the [Algorithms](algorithms.html) - discussed and employed by Mahout. - -Learn more about the [Mahout recommender implementation](recommender-documentation.html) -. - -<a name="MahoutWiki-Utilities"></a> -### Utilities - -This section describes tools that might be useful for working with Mahout. - -[Converting Content](converting-content.html) - -- Mahout has some utilities for converting content such as logs to -formats more amenable for consumption by Mahout. -[Creating Vectors](creating-vectors.html) - -- Mahout's algorithms operate on vectors. Learn more on how to generate -these from raw data. -[Viewing Result](viewing-result.html) - -- How to visualize the result of your trained algorithms. - -<a name="MahoutWiki-Data"></a> -### Data - -[Collections](collections.html) - -- To try out and test Mahout's algorithms you need training data. We are -always looking for new training data collections. - -<a name="MahoutWiki-Benchmarks"></a> -### Benchmarks - -[Mahout Benchmarks](mahout-benchmarks.html) - -<a name="MahoutWiki-Committer'sResources"></a> -## Committer's Resources - -* [Testing](testing.html) - -- Information on test plans and ideas for testing - -<a name="MahoutWiki-ProjectResources"></a> -### Project Resources - -* [Dealing with Third Party Dependencies not in Maven](thirdparty-dependencies.html) -* [How To Update The Website](how-to-update-the-website.html) -* [Patch Check List](patch-check-list.html) -* [How To Release](http://cwiki.apache.org/confluence/display/MAHOUT/How+to+release) -* [Release Planning](release-planning.html) -* [Sonar Code Quality Analysis](https://analysis.apache.org/dashboard/index/63921) - -<a name="MahoutWiki-AdditionalResources"></a> -### Additional Resources - -* [Apache Machine Status](http://monitoring.apache.org/status/) - \- Check to see if SVN, other resources are available. -* [Committer's FAQ](http://www.apache.org/dev/committers.html) -* [Apache Dev](http://www.apache.org/dev/) - - -<a name="MahoutWiki-HowToEditThisWiki"></a> -## How To Edit This Wiki - -How to edit this Wiki - -This Wiki is a collaborative site, anyone can contribute and share: - -* Create an account by clicking the "Login" link at the top of any page, -and picking a username and password. -* Edit any page by pressing Edit at the top of the page - -There are some conventions used on the Mahout wiki: - - * {noformat}+*TODO:*+{noformat} (+*TODO:*+ ) is used to denote sections -that definitely need to be cleaned up. - * {noformat}+*Mahout_(version)*+{noformat} (+*Mahout_0.2*+) is used to -draw attention to which version of Mahout a feature was (or will be) added -to Mahout. - http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/old_site/general/professional-support.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/old_site/general/professional-support.md b/website/old_site_migration/old_site/general/professional-support.md deleted file mode 100644 index 45d798c..0000000 --- a/website/old_site_migration/old_site/general/professional-support.md +++ /dev/null @@ -1,41 +0,0 @@ ---- -layout: default -title: Professional Support -theme: - name: retro-mahout ---- - -NOTE: on the fence about including this in new site. - -<a name="ProfessionalSupport-ProfessionalsupportforMahout"></a> -# Professional support for Mahout - -Add yourself or your company if you are offering support for Mahout -users. Please keep lists in alphabetical order. An entry here -is not an endorsement by the Apache Software Foundation nor any of its -committers. - - -<a name="ProfessionalSupport-Peopleandcompaniesforhire"></a> -## People and companies for hire - -| Name | Contact details | Notes | -|------|-----------------|-------| -| Accenture | [email protected] | [Consulting services in big data analytics](http://accenture.com) | -| Boston Predictive Analytics | [email protected] | [http://tutorteddy.com/site/free_statistics_help.php](http://tutorteddy.com/site/free_statistics_help.php) | -| Frank Scholten | [email protected] | | -| GridLine | [http://www.gridline.nl/contact](http://www.gridline.nl/contact) | Specialised in search and thesauri | -| Jagdish Nomula | [email protected] | ML, Search, Algorithms, Java [http://www.kosmex.com](http://www.kosmex.com) | -| LucidWorks | [http://www.lucidworks.com](http://www.lucidworks.com) | Big data platform including Mahout as a service for clustering, classification and more | -| Sematext International | [http://sematext.com/](http://sematext.com/) | | -| Ted Dunning | [email protected] | Full commercial support | -| Winterwell | [email protected] | Business/maths concept development & algorithms [http://winterwell.com](http://winterwell.com) | - -<a name="ProfessionalSupport-Talksandpresentations"></a> -## Talks and presentations - -| Name | Contact details | Notes | -|------|-----------------|-------| -| Andrew Musselman | [email protected] | ["Building a Recommender with Apache Mahout on Amazon Elastic-MapReduce"](https://blogs.aws.amazon.com/bigdata/post/Tx1TDK3HHBD4EZL/Building-a-Recommender-with-Apache-Mahout-on-Amazon-Elastic-MapReduce-EMR) | -| Frank Scholten | [email protected] | Mahout/Taste [http://blog.jteam.nl/author/frank/](http://blog.jteam.nl/author/frank/) | -| Isabel Drost-Fromm | [email protected] | If travel and accommodation costs are covered scheduling a talk is a lot easier. | http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/old_site/general/reference-reading.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/old_site/general/reference-reading.md b/website/old_site_migration/old_site/general/reference-reading.md deleted file mode 100644 index ba969ac..0000000 --- a/website/old_site_migration/old_site/general/reference-reading.md +++ /dev/null @@ -1,71 +0,0 @@ ---- -layout: default -title: Reference Reading -theme: - name: retro-mahout ---- - -# Reference Reading - -Here we provide references to books and courses about data analysis in general, which might also be helpful in the context of Mahout. - -<a name="ReferenceReading-GeneralBackgroundMaterials"></a> -## General Background Materials - -Don't be overwhelmed by all the maths, you can do a lot in Mahout with some -basic knowledge. The books will help you understand your -data better, and ask better questions both of Mahout's APIs, and also of -the Mahout community. And unlike learning some particular software tool, -these are skills that will remain useful decades later. - - * [Gilbert Strang](http://www-math.mit.edu/~gs) -'s [Introduction to Linear Algebra](http://math.mit.edu/linearalgebra/). His [lectures](http://ocw.mit.edu/courses/mathematics/18-06-linear-algebra-spring-2010/) are also [available online](http://web.mit.edu/18.06/www/) - and are strongly recommended. - * [Mathematical Tools for Applied Mulitvariate Analysis](http://www.amazon.com/Mathematical-Tools-Applied-Multivariate-Analysis/dp/0121609553/ref=sr_1_1?ie=UTF8&qid=1299602805&sr=8-1) by J.Douglass -Carroll. - * [Stanford Machine Learning online courseware](http://www.stanford.edu/class/cs229/) - * [MIT Machine Learning online courseware](http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/) has [lecture notes](http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/lecture-notes/) online. - * As a pre-requisite to probability and statistics, you'll need [basic calculus](http://en.wikipedia.org/wiki/Calculus). A maths for scientists text might be useful here such as 'Mathematics for Engineers and Scientists', Alan Jeffrey, Chapman & Hall/CRC. ([openlibrary](http://openlibrary.org/books/OL3305993M/Mathematics_for_engineers_and_scientists)) - * One of the best writers in the probability/statistics world is Sheldon Ross. Try [A First Course in Probability (8th Edition)](http://www.pearsonhighered.com/educator/product/First-Course-in-Probability-A/9780136033134.page) and then move on to his [Introduction to Probability Models](http://www.amazon.com/Introduction-Probability-Models-Sixth-Sheldon/dp/0125984707) - -Some good introductory alternatives here are: - - * [Kahn Academy](http://www.khanacademy.org/) -- videos on stats, probability, linear algebra - * [Probability and Statistics (7th Edition)](http://www.amazon.com/Probability-Statistics-Engineering-Sciences-InfoTrac/dp/0534399339), Jay L. Devore, Chapman. - * [Probability and Statistical Inference (7th Edition)](http://www.amazon.com/Probability-Statistical-Inference-Robert-Hogg/dp/0132546086), Hogg and Tanis, Pearson. - -Once you have a grasp of the basics then there are a slew of great texts that you might consult: - - * [Statistical Inference](http://www.amazon.com/Statistical-Inference-George-Casella/dp/0534243126), Casell and Berger, Duxbury/Thomson Learning. - * [Introduction to Bayesian Statistics](http://www.amazon.com/Introduction-Bayesian-Statistics-William-Bolstad/dp/0471270202), William H. Bolstad, Wiley. - * [Understanding Computational Bayesian Statistics](http://www.amazon.com/Understanding-Computational-Bayesian-Statistics-Wiley/dp/0470046090), Bolstadt - * [Bayesian Data Analysis, Gelman et al.](http://www.stat.columbia.edu/~gelman/book/) - - -## For statistics related to machine learning, these are particularly helpful: - - * [Pattern Recognition and Machine Learning by Chris Bishop](http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm) - * [Elements of Statistical Learning](http://www-stat.stanford.edu/~tibs/ElemStatLearn/) by Trevor Hastie, Robert Tibshirani, Jerome Friedman - * [http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm](http://research.microsoft.com/en-us/um/people/cmbishop/PRML/index.htm) - - -## For matrix computations/decomposition/factorization etc.: - - * Peter V. O'Neil [Introduction to Linear Algebra](http://www.amazon.com/Introduction-Linear-Algebra-Theory-Applications/dp/053400606X), great book for beginners (with some knowledge in calculus). It is not comprehensive, but, it will be a good place to start and the author starts by explaining the concepts with regards to vector spaces which I found to be a more natural way of explaining. - * David S. Watkins [Fundamentals of Matrix Computations](http://www.amazon.com/Fundamentals-Matrix-Computations-Applied-Mathematics/dp/0470528338/) - * [Matrix Computations](http://www.amazon.com/Computations-Hopkins-Studies-Mathematical-Sciences/dp/0801854148/ref=sr_1_2?s=books&ie=UTF8&qid=1394307676&sr=1-2&keywords=golub+van+loan) is the classic text for numerical linear algebra. Can't go wrong with it - great for researchers. - * Nick Trefethen's [Numerical Linear Algebra](http://people.maths.ox.ac.uk/trefethen/books.html). It's a bit more approachable for practitioners. Many chapters on SVD, there are even chapters on Lanczos. - - -## Books specifically on R: - -* Learning about R is a difficult thing. The best introduction is in MASS [http://www.stats.ox.ac.uk/pub/MASS4/](http://www.stats.ox.ac.uk/pub/MASS4/) -* [R Tutor](http://www.r-tutor.com/r-introduction) -* [Manual](http://cran.r-project.org/doc/manuals/R-intro.pdf) -* [R Course](http://faculty.washington.edu/tlumley/Rcourse/) - -In addition, you should see how to plot data well: - -* [Trellis plotting](http://www.statmethods.net/advgraphs/trellis.html) -* [ggplot2](http://had.co.nz/ggplot2/) - http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md ---------------------------------------------------------------------- diff --git a/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md b/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md deleted file mode 100644 index 39f4bfd..0000000 --- a/website/old_site_migration/old_site/users/basics/matrix-and-vector-needs.md +++ /dev/null @@ -1,88 +0,0 @@ ---- -layout: default -title: Matrix and Vector Needs -theme: - name: retro-mahout ---- - -<a name="MatrixandVectorNeeds-Intro"></a> -# Intro - -Most ML algorithms require the ability to represent multidimensional data -concisely and to be able to easily perform common operations on that data. -MAHOUT-6 introduced Vector and Matrix datatypes of arbitrary cardinality, -along with a set of common operations on their instances. Vectors and -matrices are provided with sparse and dense implementations that are memory -resident and are suitable for manipulating intermediate results within -mapper, combiner and reducer implementations. They are not intended for -applications requiring vectors or matrices that exceed the size of a single -JVM, though such applications might be able to utilize them within a larger -organizing framework. - -<a name="MatrixandVectorNeeds-Background"></a> -## Background - -See [http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser](http://mail-archives.apache.org/mod_mbox/lucene-mahout-dev/200802.mbox/browser) - -<a name="MatrixandVectorNeeds-Vectors"></a> -## Vectors - -Mahout supports a Vector interface that defines the following operations over all implementation classes: assign, cardinality, copy, divide, dot, get, haveSharedCells, like, minus, normalize, plus, set, size, times, toArray, viewPart, zSum and cross. The class DenseVector implements vectors as a double[](.html) - that is storage and access efficient. The class SparseVector implements -vectors as a HashMap<Integer, Double> that is surprisingly fast and -efficient. For sparse vectors, the size() method returns the current number -of elements whereas the cardinality() method returns the number of -dimensions it holds. An additional VectorView class allows views of an -underlying vector to be specified by the viewPart() method. See the -JavaDocs for more complete definitions. - -<a name="MatrixandVectorNeeds-Matrices"></a> -## Matrices - -Mahout also supports a Matrix interface that defines a similar set of operations over all implementation classes: assign, assignColumn, assignRow, cardinality, copy, divide, get, haveSharedCells, like, minus, plus, set, size, times, transpose, toArray, viewPart and zSum. The class DenseMatrix implements matrices as a double[](.html) -[] that is storage and access efficient. The class SparseRowMatrix -implements matrices as a Vector[] holding the rows of the matrix in a -SparseVector, and the symmetric class SparseColumnMatrix implements -matrices as a Vector[] holding the columns in a SparseVector. Each of these -classes can quickly produce a given row or column, respectively. A fourth -class SparseMatrix, uses a HashMap<Integer, Vector> which is also a -SparseVector. For sparse matrices, the size() method returns an int\[2\] -containing the actual row and column sizes whereas the cardinality() method -returns an int\[2\] with the number of dimensions of each. An additional -MatrixView class allows views of an underlying matrix to be specified by -the viewPart() method. See the JavaDocs for more complete definitions. - -The Matrix interface does not currently provide invert or determinant -methods, though these are desirable. It is arguable that the -implementations of SparseRowMatrix and SparseColumnMatrix ought to use the -HashMap<Integer, Vector> implementations and that SparseMatrix should -instead use a HashMap<Integer, HashMap<Integer, Double>>. Other forms of -sparse matrices can also be envisioned that support different storage and -access characteristics. Because the arguments of assignColumn and assignRow -operations accept all forms of Vector, it is possible to construct -instances of sparse matrices containing dense rows or columns. See the -JavaDocs for more complete definitions. - -For applications like PageRank/TextRank, iterative approaches to calculate -eigenvectors would also be useful. Batching of row/column operations would -also be useful, such as perhaps assignRow or assighColumn accepting -UnaryFunction and BinaryFunction arguments. - - -<a name="MatrixandVectorNeeds-Ideas"></a> -## Ideas - -As Vector and Matrix implementations are currently memory-resident, very -large instances greater than available memory are not supported. An -extended set of implementations that use HBase (BigTable) in Hadoop to -represent their instances would facilitate applications requiring such -large collections. -See [MAHOUT-6](https://issues.apache.org/jira/browse/MAHOUT-6) -See [Hama](http://wiki.apache.org/hadoop/Hama) - - -<a name="MatrixandVectorNeeds-References"></a> -## References - -Have a look at the old parallel computing libraries like [ScalaPACK](http://www.netlib.org/scalapack/) -, others
