http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/map-reduce/misc/parallel-frequent-pattern-mining.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/misc/parallel-frequent-pattern-mining.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/misc/parallel-frequent-pattern-mining.md
deleted file mode 100644
index e2978a4..0000000
--- 
a/website/old_site_migration/needs_work_convenience/map-reduce/misc/parallel-frequent-pattern-mining.md
+++ /dev/null
@@ -1,185 +0,0 @@
----
-layout: default
-title: Parallel Frequent Pattern Mining
-theme:
-    name: retro-mahout
----
-Mahout has a Top K Parallel FPGrowth Implementation. Its based on the paper 
[http://infolab.stanford.edu/~echang/recsys08-69.pdf](http://infolab.stanford.edu/~echang/recsys08-69.pdf)
- with some optimisations in mining the data.
-
-Given a huge transaction list, the algorithm finds all unique features(sets
-of field values) and eliminates those features whose frequency in the whole
-dataset is less that minSupport. Using these remaining features N, we find
-the top K closed patterns for each of them, generating a total of NxK
-patterns. FPGrowth Algorithm is a generic implementation, we can use any
-Object type to denote a feature. Current implementation requires you to use
-a String as the object type. You may implement a version for any object by
-creating Iterators, Convertors and TopKPatternWritable for that particular
-object. For more information please refer the package
-org.apache.mahout.fpm.pfpgrowth.convertors.string
-
-    e.g:
-     FPGrowth<String> fp = new FPGrowth<String>();
-     Set<String> features = new HashSet<String>();
-     fp.generateTopKStringFrequentPatterns(
-         new StringRecordIterator(new FileLineIterable(new File(input),
-encoding, false), pattern),
-       fp.generateFList(
-         new StringRecordIterator(new FileLineIterable(new File(input),
-encoding, false), pattern), minSupport),
-        minSupport,
-       maxHeapSize,
-       features,
-       new StringOutputConvertor(new SequenceFileOutputCollector<Text,
-TopKStringPatterns>(writer))
-      );
-
-* The first argument is the iterator of transaction in this case its
-Iterator<List<String>>
-* The second argument is the output of generateFList function, which
-returns the frequent items and their frequencies from the given database
-transaction iterator
-* The third argument is the minimum Support of the pattern to be generated
-* The fourth argument is the maximum number of patterns to be mined for
-each feature
-* The fifth argument is the set of features for which the frequent patterns
-has to be mined
-* The last argument is an output collector which takes \[key, 
value\](key,-value\.html)
- of Feature and TopK Patterns of the format \[String,
-List<Pair<List<String>, Long>>\] and writes them to the appropriate writer
-class which takes care of storing the object, in this case in a Sequence
-File Output format
-
-<a 
name="ParallelFrequentPatternMining-RunningFrequentPatternGrowthviacommandline"></a>
-## Running Frequent Pattern Growth via command line
-
-The command line launcher for string transaction data
-org.apache.mahout.fpm.pfpgrowth.FPGrowthDriver has other features including
-specifying the regex pattern for spitting a string line of a transaction
-into the constituent features.
-
-Input files have to be in the following format.
-
-<optional document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE....
-
-instead of tab you could use , or \| as the default tokenization is done using 
a java Regex pattern {code}[,\t](,\t.html)
-*[,|\t][ ,\t]*{code}
-You can override this parameter to parse your log files or transaction
-files (each line is a transaction.) The FPGrowth algorithm mines the top K
-frequently occurring sets of items and their counts from the given input
-data
-
-$MAHOUT_HOME/core/src/test/resources/retail.dat is a sample dataset in this
-format. 
-Other sample files are accident.dat.gz from 
[http://fimi.cs.helsinki.fi/data/](http://fimi.cs.helsinki.fi/data/)
-. As a quick test, try this:
-
-
-    bin/mahout fpg \
-         -i core/src/test/resources/retail.dat \
-         -o patterns \
-         -k 50 \
-         -method sequential \
-         -regex '[\ ]
-' \
-         -s 2
-
-
-The minimumSupport parameter \-s is the minimum number of times a pattern
-or a feature needs to occur in the dataset so that it is included in the
-patterns generated. You can speed up the process by having a large value of
-s. There are cases where you will have less than k patterns for a
-particular feature as the rest don't for qualify the minimum support
-criteria
-
-Note that the input to the algorithm, could be uncompressed or compressed
-gz file or even a directory containing any number of such files.
-We modified the regex to use space to split the token. Note that input
-regex string is escaped.
-
-<a name="ParallelFrequentPatternMining-RunningParallelFPGrowth"></a>
-## Running Parallel FPGrowth
-
-Running parallel FPGrowth is as easy as adding changing the flag \-method
-mapreduce and adding the number of groups parameter e.g. \-g 20 for 20
-groups. First, let's run the above sample test in map-reduce mode:
-
-    bin/mahout fpg \
-         -i core/src/test/resources/retail.dat \
-         -o patterns \
-         -k 50 \
-         -method mapreduce \
-         -regex '[\ ]
-' \
-         -s 2
-
-The above test took 102 seconds on dual-core laptop, v.s. 609 seconds in
-the sequential mode, (with 5 gigs of ram allocated). In a separate test,
-the first 1000 lines of retail.dat took 20 seconds in map/reduce v.s. 30
-seconds in sequential mode.
-
-Here is another dataset which, while several times larger, requires much
-less time to find frequent patterns, as there are very few. Get
-accidents.dat.gz from 
[http://fimi.cs.helsinki.fi/data/](http://fimi.cs.helsinki.fi/data/)
- and place it on your hdfs in a folder named accidents. Then, run the
-hadoop version of the FPGrowth job:
-
-    bin/mahout fpg \
-         -i accidents \
-         -o patterns \
-         -k 50 \
-         -method mapreduce \
-         -regex '[\ ]
-' \
-         -s 2
-
-
-OR to run a dataset of this size in sequential mode on a single machine
-let's give Mahout a lot more memory and only keep features with more than
-300 members:
-
-    export MAHOUT_HEAPSIZE=-Xmx5000m
-    bin/mahout fpg \
-         -i accidents \
-         -o patterns \
-         -k 50 \
-         -method sequential \
-         -regex '[\ ]
-' \
-         -s 2
-
-
-
-The numGroups parameter \-g in FPGrowthJob specifies the number of groups
-into which transactions have to be decomposed. The default of 1000 works
-very well on a single-machine cluster; this may be very different on large
-clusters.
-
-Note that accidents.dat has 340 unique features. So we chose \-g 10 to
-split the transactions across 10 shards where 34 patterns are mined from
-each shard. (Note: g doesnt need to be exactly divisible.) The Algorithm
-takes care of calculating the split. For better performance in large
-datasets and clusters, try not to mine for more than 20-25 features per
-shard. Stick to the defaults on a small machine.
-
-The numTreeCacheEntries parameter \-tc specifies the number of generated
-conditional FP-Trees to be kept in memory so that subsequent operations do
-not to regenerate them. Increasing this number increases the memory
-consumption but might improve speed until a certain point. This depends
-entirely on the dataset in question. A value of 5-10 is recommended for
-mining up to top 100 patterns for each feature.
-
-<a name="ParallelFrequentPatternMining-Viewingtheresults"></a>
-## Viewing the results
-The output will be dumped to a SequenceFile in the frequentpatterns
-directory in Text=>TopKStringPatterns format. Run this command to see a few
-of the Frequent Patterns:
-
-    bin/mahout seqdumper \
-         -i patterns/frequentpatterns/part-?-00000 \
-         -n 4
-
-or replace -n 4 with -c for the count of patterns.
- 
-Open questions: how does one experiment and monitor with these various
-parameters?

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/map-reduce/misc/perceptron-and-winnow.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/misc/perceptron-and-winnow.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/misc/perceptron-and-winnow.md
deleted file mode 100644
index 308040c..0000000
--- 
a/website/old_site_migration/needs_work_convenience/map-reduce/misc/perceptron-and-winnow.md
+++ /dev/null
@@ -1,41 +0,0 @@
----
-layout: default
-title: Perceptron and Winnow
-theme:
-    name: retro-mahout
----
-<a name="PerceptronandWinnow-ClassificationwithPerceptronorWinnow"></a>
-# Classification with Perceptron or Winnow
-
-Both algorithms are comparably simple linear classifiers. Given training
-data in some n-dimensional vector space that is annotated with binary
-labels the algorithms are guaranteed to find a linear separating hyperplane
-if one exists. In contrast to the Perceptron, Winnow works only for binary
-feature vectors.
-
-For more information on the Perceptron see for instance:
-http://en.wikipedia.org/wiki/Perceptron
-
-Concise course notes on both algorithms:
-http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture24.pdf
-
-Although the algorithms are comparably simple they still work pretty well
-for text classification and are fast to train even for huge example sets.
-In contrast to Naive Bayes they are not based on the assumption that all
-features (in the domain of text classification: all terms in a document)
-are independent.
-
-<a name="PerceptronandWinnow-Strategyforparallelisation"></a>
-## Strategy for parallelisation
-
-Currently the strategy for parallelisation is simple: Given there is enough
-training data, split the training data. Train the classifier on each split.
-The resulting hyperplanes are then averaged.
-
-<a name="PerceptronandWinnow-Roadmap"></a>
-## Roadmap
-
-Currently the patch only contains the code for the classifier itself. It is
-planned to provide unit tests and at least one example based on the WebKB
-dataset by the end of November for the serial version. After that the
-parallelisation will be added.

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/map-reduce/misc/testing.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/misc/testing.md 
b/website/old_site_migration/needs_work_convenience/map-reduce/misc/testing.md
deleted file mode 100644
index dc3fd43..0000000
--- 
a/website/old_site_migration/needs_work_convenience/map-reduce/misc/testing.md
+++ /dev/null
@@ -1,46 +0,0 @@
----
-layout: default
-title: Testing
-theme:
-    name: retro-mahout
----
-<a name="Testing-Intro"></a>
-# Intro
-
-As Mahout matures, solid testing procedures are needed.  This page and its
-children capture test plans along with ideas for improving our testing.
-
-<a name="Testing-TestPlans"></a>
-# Test Plans
-
-* [0.6](0.6.html)
- - Test Plans for the 0.6 release
-There are no special plans except for unit tests, and user testing of the
-Hadoop jobs.
-
-<a name="Testing-TestIdeas"></a>
-# Test Ideas
-
-<a name="Testing-Regressions/Benchmarks/Integrations"></a>
-## Regressions/Benchmarks/Integrations
-* Algorithmic quality and speed are not tested, except in a few instances.
-Such tests often require much longer run times (minutes to hours), a
-running Hadoop cluster, and downloads of large datasets (in the megabytes). 
-* Standardized speed tests are difficult on different hardware. 
-* Unit tests of external integrations require access to externals: HDFS,
-S3, JDBC, Cassandra, etc. 
-
-Apache Jenkins is not able to support these environments. Commercial
-donations would help. 
-
-<a name="Testing-UnitTests"></a>
-## Unit Tests
-Mahout's current tests are almost entirely unit tests. Algorithm tests
-generally supply a few numbers to code paths and verify that expected
-numbers come out. 'mvn test' runs these tests. There is "positive" coverage
-of a great many utilities and algorithms. A much smaller percent include
-"negative" coverage (bogus setups, inputs, combinations).
-
-<a name="Testing-Other"></a>
-## Other
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/map-reduce/misc/using-mahout-with-python-via-jpype.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/misc/using-mahout-with-python-via-jpype.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/misc/using-mahout-with-python-via-jpype.md
deleted file mode 100644
index 57378ba..0000000
--- 
a/website/old_site_migration/needs_work_convenience/map-reduce/misc/using-mahout-with-python-via-jpype.md
+++ /dev/null
@@ -1,222 +0,0 @@
----
-layout: default
-title: Using Mahout with Python via JPype
-theme:
-    name: retro-mahout
----
-
-<a name="UsingMahoutwithPythonviaJPype-overview"></a>
-# Mahout over Jython - some examples
-This tutorial provides some sample code illustrating how we can read and
-write sequence files containing Mahout vectors from Python using JPype.
-This tutorial is intended for people who want to use Python for analyzing
-and plotting Mahout data. Using Mahout from Python turns out to be quite
-easy.
-
-This tutorial concerns the use of cPython (cython) as opposed to Jython.
-JPython wasn't an option for me, because  (to the best of my knowledge)
-JPython doesn't work with Python extensions numpy, matplotlib, or h5py
-which I rely on heavily.
-
-The instructions below explain how to setup a python script to read and
-write the output of Mahout clustering.
-
-You will first need to download and install the JPype package for python.
-
-The first step to setting up JPype is determining the path to the dynamic
-library for the jvm ; on linux this will be a .so file on and on windows it
-will be a .dll.
-
-In your python script, create a global variable with the path to this dll
-
-
-
-Next we need to figure out how we need to set the classpath for mahout. The
-easiest way to do this is to edit the script in "bin/mahout" to print out
-the classpath. Add the line "echo $CLASSPATH" to the script somewhere after
-the comment "run it" (this is line 195 or so). Execute the script to print
-out the classpath.  Copy this output and paste it into a variable in your
-python script. The result for me looks like the following
-
-
-
-
-Now we can create a function to start the jvm in python using jype
-
-    jvm=None
-    def start_jpype():
-    global jvm
-    if (jvm is None):
-    cpopt="-Djava.class.path={cp}".format(cp=classpath)
-    startJVM(jvmlib,"-ea",cpopt)
-    jvm="started"
-
-
-
-<a 
name="UsingMahoutwithPythonviaJPype-WritingNamedVectorstoSequenceFilesfromPython"></a>
-# Writing Named Vectors to Sequence Files from Python
-We can now use JPype to create sequence files which will contain vectors to
-be used by Mahout for kmeans. The example below is a function which creates
-vectors from two Gaussian distributions with unit variance.
-
-
-    def create_inputs(ifile,*args,**param):
-     """Create a sequence file containing some normally distributed
-       ifile - path to the sequence file to create
-     """
-     
-     #matrix of the cluster means
-     cmeans=np.array([[1,1] ,[-1,-1]],np.int)
-     
-     nperc=30  #number of points per cluster
-     
-     vecs=[]
-     
-     vnames=[]
-     for cind in range(cmeans.shape[0]):
-      pts=np.random.randn(nperc,2)
-      pts=pts+cmeans[cind,:].reshape([1,cmeans.shape[1]])
-      vecs.append(pts)
-     
-      #names for the vectors
-      #names are just the points with an index
-      #we do this so we can validate by cross-refencing the name with thevector
-      vn=np.empty(nperc,dtype=(np.str,30))
-      for row in range(nperc):
-       
vn[row]="c"+str(cind)+"_"+pts[row,0].astype((np.str,4))+"_"+pts[row,1].astype((np.str,4))
-      vnames.append(vn)
-      
-     vecs=np.vstack(vecs)
-     vnames=np.hstack(vnames)
-     
-    
-     #start the jvm
-     start_jpype()
-     
-     #create the sequence file that we will write to
-     io=JPackage("org").apache.hadoop.io 
-     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
-     
-     PathCls=JPackage("org").apache.hadoop.fs.Path
-     path=PathCls(ifile)
-    
-     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
-     conf=ConfCls()
-     
-     fs=FileSystemCls.get(conf)
-     
-     #vector classes
-     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
-     DenseVectorCls=JPackage("org").apache.mahout.math.DenseVector
-     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
-     writer=io.SequenceFile.createWriter(fs, conf, 
path,io.Text,VectorWritableCls)
-     
-     
-     vecwritable=VectorWritableCls()
-     for row in range(vecs.shape[0]):
-      
nvector=NamedVectorCls(DenseVectorCls(JArray(JDouble,1)(vecs[row,:])),vnames[row])
-      #need to wrap key and value because of overloading
-      wrapkey=JObject(io.Text("key "+str(row)),io.Writable)
-      wrapval=JObject(vecwritable,io.Writable)
-      
-      vecwritable.set(nvector)
-      writer.append(wrapkey,wrapval)
-      
-     writer.close()
-
-
-<a 
name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansClusteredPointsfromPython"></a>
-# Reading the KMeans Clustered Points from Python
-Similarly we can use JPype to easily read the clustered points outputted by
-mahout.
-
-    def read_clustered_pts(ifile,*args,**param):
-     """Read the clustered points
-     ifile - path to the sequence file containing the clustered points
-     """ 
-    
-     #start the jvm
-     start_jpype()
-     
-     #create the sequence file that we will write to
-     io=JPackage("org").apache.hadoop.io 
-     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
-     
-     PathCls=JPackage("org").apache.hadoop.fs.Path
-     path=PathCls(ifile)
-    
-     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
-     conf=ConfCls()
-     
-     fs=FileSystemCls.get(conf)
-     
-     #vector classes
-     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
-     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
-     
-     
-     ReaderCls=io.__getattribute__("SequenceFile$Reader") 
-     reader=ReaderCls(fs, path,conf)
-     
-    
-     key=reader.getKeyClass()()
-     
-    
-     valcls=reader.getValueClass()
-     vecwritable=valcls()
-     while (reader.next(key,vecwritable)):     
-      weight=vecwritable.getWeight()
-      nvec=vecwritable.getVector()
-      
-      cname=nvec.__class__.__name__
-      if (cname.rsplit('.',1)[1]=="NamedVector"):  
-       print "cluster={key} Name={name} 
x={x}y={y}".format(key=key.toString(),name=nvec.getName(),x=nvec.get(0),y=nvec.get(1))
-      else:
-       raise NotImplementedError("Vector isn't a NamedVector. Need 
tomodify/test the code to handle this case.")
-
-
-<a name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansCentroids"></a>
-# Reading the KMeans Centroids
-Finally we can create a function to print out the actual cluster centers
-found by mahout,
-
-    def getClusters(ifile,*args,**param):
-     """Read the centroids from the clusters outputted by kmenas
-          ifile - Path to the sequence file containing the centroids
-     """ 
-    
-     #start the jvm
-     start_jpype()
-     
-     #create the sequence file that we will write to
-     io=JPackage("org").apache.hadoop.io 
-     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
-     
-     PathCls=JPackage("org").apache.hadoop.fs.Path
-     path=PathCls(ifile)
-    
-     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
-     conf=ConfCls()
-     
-     fs=FileSystemCls.get(conf)
-     
-     #vector classes
-     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
-     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
-     ReaderCls=io.__getattribute__("SequenceFile$Reader")
-     reader=ReaderCls(fs, path,conf)
-     
-    
-     key=io.Text()
-     
-    
-     valcls=reader.getValueClass()
-    
-     vecwritable=valcls()
-     
-     while (reader.next(key,vecwritable)):     
-      center=vecwritable.getCenter()
-      
-      print 
"id={cid}center={center}".format(cid=vecwritable.getId(),center=center.values)
-      pass
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-als-hadoop.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-als-hadoop.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-als-hadoop.md
deleted file mode 100644
index 2acacd0..0000000
--- 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-als-hadoop.md
+++ /dev/null
@@ -1,98 +0,0 @@
----
-layout: default
-title: Perceptron and Winnow
-theme:
-    name: retro-mahout
----
-
-# Introduction to ALS Recommendations with Hadoop
-
-##Overview
-
-Mahout’s ALS recommender is a matrix factorization algorithm that uses 
Alternating Least Squares with Weighted-Lamda-Regularization (ALS-WR). It 
factors the user to item matrix *A* into the user-to-feature matrix *U* and the 
item-to-feature matrix *M*: It runs the ALS algorithm in a parallel fashion. 
The algorithm details can be referred to in the following papers: 
-
-* [Large-scale Parallel Collaborative Filtering for
-the Netflix 
Prize](http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf)
-* [Collaborative Filtering for Implicit Feedback 
Datasets](http://research.yahoo.com/pub/2433) 
-
-This recommendation algorithm can be used in eCommerce platform to recommend 
products to customers. Unlike the user or item based recommenders that computes 
the similarity of users or items to make recommendations, the ALS algorithm 
uncovers the latent factors that explain the observed user to item ratings and 
tries to find optimal factor weights to minimize the least squares between 
predicted and actual ratings.
-
-Mahout's ALS recommendation algorithm takes as input user preferences by item 
and generates an output of recommending items for a user. The input customer 
preference could either be explicit user ratings or implicit feedback such as 
user's click on a web page.
-
-One of the strengths of the ALS based recommender, compared to the user or 
item based recommender, is its ability to handle large sparse data sets and its 
better prediction performance. It could also gives an intuitive rationale of 
the factors that influence recommendations.
-
-##Implementation
-At present Mahout has a map-reduce implementation of ALS, which is composed of 
2 jobs: a parallel matrix factorization job and a recommendation job.
-The matrix factorization job computes the user-to-feature matrix and 
item-to-feature matrix given the user to item ratings. Its input includes: 
-<pre>
-    --input: directory containing files of explicit user to item rating or 
implicit feedback;
-    --output: output path of the user-feature matrix and feature-item matrix;
-    --lambda: regularization parameter to avoid overfitting;
-    --alpha: confidence parameter only used on implicit feedback
-    --implicitFeedback: boolean flag to indicate whether the input dataset 
contains implicit feedback;
-    --numFeatures: dimensions of feature space;
-    --numThreadsPerSolver: number of threads per solver mapper for concurrent 
execution;
-    --numIterations: number of iterations
-    --usesLongIDs: boolean flag to indicate whether the input contains long 
IDs that need to be translated
-</pre>
-and it outputs the matrices in sequence file format. 
-
-The recommendation job uses the user feature matrix and item feature matrix 
calculated from the factorization job to compute the top-N recommendations per 
user. Its input includes:
-<pre>
-    --input: directory containing files of user ids;
-    --output: output path of the recommended items for each input user id;
-    --userFeatures: path to the user feature matrix;
-    --itemFeatures: path to the item feature matrix;
-    --numRecommendations: maximum number of recommendations per user, default 
is 10;
-    --maxRating: maximum rating available;
-    --numThreads: number of threads per mapper;
-    --usesLongIDs: boolean flag to indicate whether the input contains long 
IDs that need to be translated;
-    --userIDIndex: index for user long IDs (necessary if usesLongIDs is true);
-    --itemIDIndex: index for item long IDs (necessary if usesLongIDs is true) 
-</pre>
-and it outputs a list of recommended item ids for each user. The predicted 
rating between user and item is a dot product of the user's feature vector and 
the item's feature vector.  
-
-##Example
-
-Let’s look at a simple example of how we could use Mahout’s ALS 
recommender to recommend items for users. First, you’ll need to get Mahout up 
and running, the instructions for which can be found 
[here](https://mahout.apache.org/users/basics/quickstart.html). After you've 
ensured Mahout is properly installed, we’re ready to run the example.
-
-**Step 1: Prepare test data**
-
-Similar to Mahout's item based recommender, the ALS recommender relies on the 
user to item preference data: *userID*, *itemID* and *preference*. The 
preference could be explicit numeric rating or counts of actions such as a 
click (implicit feedback). The test data file is organized as each line is a 
tab-delimited string, the 1st field is user id, which must be numeric, the 2nd 
field is item id, which must be numeric and the 3rd field is preference, which 
should also be a number.
-
-**Note:** You must create IDs that are ordinal positive integers for all user 
and item IDs. Often this will require you to keep a dictionary
-to map into and out of Mahout IDs. For instance if the first user has ID "xyz" 
in your application, this would get an Mahout ID of the integer 1 and so on. 
The same
-for item IDs. Then after recommendations are calculated you will have to 
translate the Mahout user and item IDs back into your application IDs.
-
-To quickly start, you could specify a text file like following as the input:
-<pre>
-1      100     1
-1      200     5
-1      400     1
-2      200     2
-2      300     1
-</pre>
-
-**Step 2: Determine parameters**
-
-In addition, users need to determine dimension of feature space, the number of 
iterations to run the alternating least square algorithm, Using 10 features and 
15 iterations is a reasonable default to try first. Optionally a confidence 
parameter can be set if the input preference is implicit user feedback.  
-
-**Step 3: Run ALS**
-
-Assuming your *JAVA_HOME* is appropriately set and Mahout was installed 
properly we’re ready to configure our syntax. Enter the following command:
-
-    $ mahout parallelALS --input $als_input --output $als_output --lambda 0.1 
--implicitFeedback true --alpha 0.8 --numFeatures 2 --numIterations 5  
--numThreadsPerSolver 1 --tempDir tmp 
-
-Running the command will execute a series of jobs the final product of which 
will be an output file deposited to the output directory specified in the 
command syntax. The output directory contains 3 sub-directories: *M* stores the 
item to feature matrix, *U* stores the user to feature matrix and userRatings 
stores the user's ratings on the items. The *tempDir* parameter specifies the 
directory to store the intermediate output of the job, such as the matrix 
output in each iteration and each item's average rating. Using the *tempDir* 
will help on debugging.
-
-**Step 4: Make Recommendations**
-
-Based on the output feature matrices from step 3, we could make 
recommendations for users. Enter the following command:
-
-     $ mahout recommendfactorized --input $als_recommender_input 
--userFeatures $als_output/U/ --itemFeatures $als_output/M/ 
--numRecommendations 1 --output recommendations --maxRating 1
-
-The input user file is a sequence file, the sequence record key is user id and 
value is the user's rated item ids which will be removed from recommendation. 
The output file generated in our simple example will be a text file giving the 
recommended item ids for each user. 
-Remember to translate the Mahout ids back into your application specific ids. 
-
-There exist a variety of parameters for Mahout’s ALS recommender to 
accommodate custom business requirements; exploring and testing various 
configurations to suit your needs will doubtless lead to additional questions. 
Feel free to ask such questions on the [mailing 
list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html).
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-itembased-hadoop.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-itembased-hadoop.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-itembased-hadoop.md
deleted file mode 100644
index ee2c3e8..0000000
--- 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-itembased-hadoop.md
+++ /dev/null
@@ -1,54 +0,0 @@
----
-layout: default
-title: Perceptron and Winnow
-theme:
-    name: retro-mahout
----
-# Introduction to Item-Based Recommendations with Hadoop
-
-##Overview
-
-Mahout’s item based recommender is a flexible and easily implemented 
algorithm with a diverse range of applications. The minimalism of the primary 
input file’s structure and availability of ancillary filtering controls can 
make sourcing required data and shaping a desired output both efficient and 
straightforward.
-
-Typical use cases include:
-
-* Recommend products to customers via an eCommerce platform (think: Amazon, 
Netflix, Overstock)
-* Identify organic sales opportunities
-* Segment users/customers based on similar item preferences
-
-Broadly speaking, Mahout's item-based recommendation algorithm takes as input 
customer preferences by item and generates an output recommending similar items 
with a score indicating whether a customer will "like" the recommended item.
-
-One of the strengths of the item based recommender is its adaptability to your 
business conditions or research interests. For example, there are many 
available approaches for providing product preference. One such method is to 
calculate the total orders for a given product for each customer (i.e. Acme 
Corp has ordered Widget-A 5,678 times) while others rely on user preference 
captured via the web (i.e. Jane Doe rated a movie as five stars, or gave a 
product two thumbs’ up).
-
-Additionally, a variety of methodologies can be implemented to narrow the 
focus of Mahout's recommendations, such as:
-
-* Exclude low volume or low profitability products from consideration
-* Group customers by segment or market rather than using user/customer level 
data
-* Exclude zero-dollar transactions, returns or other order types
-* Map product substitutions into the Mahout input (i.e. if WidgetA is a 
recommended item replace it with WidgetX)
-
-The item based recommender output can be easily consumed by downstream 
applications (i.e. websites, ERP systems or salesforce automation tools) and is 
configurable so users can determine the number of item recommendations 
generated by the algorithm.
-
-##Example
-
-Testing the item based recommender can be a simple and potentially quite 
rewarding endeavor. Whereas the typical sample use case for collaborative 
filtering focuses on utilization of, and integration with, eCommerce platforms 
we can instead look at a potential use case applicable to most businesses (even 
those without a web presence). Let’s look at how a company might use 
Mahout’s item based recommender to identify new sales opportunities for an 
existing customer base. First, you’ll need to get Mahout up and running, the 
instructions for which can be found 
[here](https://mahout.apache.org/users/basics/quickstart.html). After you've 
ensured Mahout is properly installed, we’re ready to run a quick example.
-
-**Step 1: Gather some test data**
-
-Mahout’s item based recommender relies on three key pieces of data: 
*userID*, *itemID* and *preference*. The “users” could be website visitors 
or simply customers that purchase products from your business. Similarly, items 
could be products, product groups or even pages on your website – really 
anything you would want to recommend to a group of users or customers. For our 
example let’s use customer orders as a proxy for preference. A simple count 
of distinct orders by customer, by product will work for this example. You’ll 
find as you explore ways to manipulate the item based recommender the 
preference value can be many things (page clicks, explicit ratings, order 
counts, etc.). Once your test data is gathered put it in a *.txt* file 
separated by commas with no column headers included.
-
-**Step 2: Pick a similarity measure**
-
-Choosing a similarity measure for use in a production environment is something 
that requires careful testing, evaluation and research. For our example 
purposes, we’ll just go with a Mahout similarity classname called 
*SIMILARITY_LOGLIKELIHOOD*.
-
-**Step 3: Configure the Mahout command**
-
-Assuming your *JAVA_HOME* is appropriately set and Mahout was installed 
properly we’re ready to configure our syntax. Enter the following command:
-
-    $ mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i 
/path/to/input/file -o /path/to/desired/output --numRecommendations 25
-
-Running the command will execute a series of jobs the final product of which 
will be an output file deposited to the directory specified in the command 
syntax. The output file will contain two columns: the *userID* and an array of 
*itemIDs* and scores.
-
-**Step 4: Making use of the output and doing more with Mahout**
-
-The output file generated in our simple example can be transformed using your 
tool of choice and consumed by downstream applications. There exist a variety 
of configuration options for Mahout’s item based recommender to accommodate 
custom business requirements; exploring and testing various configurations to 
suit your needs will doubtless lead to additional questions. Our user community 
is accessible via our [mailing 
list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html) 
and the book *Mahout In Action* is a fantastic (but slightly outdated) starting 
point. 

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/map-reduce/recommender/matrix-factorization.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/matrix-factorization.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/matrix-factorization.md
deleted file mode 100644
index 63de4fd..0000000
--- 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/matrix-factorization.md
+++ /dev/null
@@ -1,187 +0,0 @@
----
-layout: default
-title: Perceptron and Winnow
-theme:
-    name: retro-mahout
----
-<a name="MatrixFactorization-Intro"></a>
-# Introduction to Matrix Factorization for Recommendation Mining
-
-In the mathematical discipline of linear algebra, a matrix decomposition 
-or matrix factorization is a dimensionality reduction technique that 
factorizes a matrix into a product of matrices, usually two. 
-There are many different matrix decompositions, each finds use among a 
particular class of problems.
-
-In mahout, the SVDRecommender provides an interface to build recommender based 
on matrix factorization.
-The idea behind is to project the users and items onto a feature space and try 
to optimize U and M so that U \* (M^t) is as close to R as possible:
-
-     U is n * p user feature matrix, 
-     M is m * p item feature matrix, M^t is the conjugate transpose of M,
-     R is n * m rating matrix,
-     n is the number of users,
-     m is the number of items,
-     p is the number of features
-
-We usually use RMSE to represent the deviations between predictions and atual 
ratings.
-RMSE is defined as the squared root of the sum of squared errors at each known 
user item ratings.
-So our matrix factorization target could be mathmatically defined as:
-
-     find U and M, (U, M) = argmin(RMSE) = argmin(pow(SSE / K, 0.5))
-     
-     SSE = sum(e(u,i)^2)
-     e(u,i) = r(u, i) - U[u,] * (M[i,]^t) = r(u,i) - sum(U[u,f] * M[i,f]), f = 
0, 1, .. p - 1
-     K is the number of known user item ratings.
-
-<a name="MatrixFactorization-Factorizers"></a>
-
-Mahout has implemented matrix factorization based on 
-
-    (1) SGD(Stochastic Gradient Descent)
-    (2) ALSWR(Alternating-Least-Squares with Weighted-λ-Regularization).
-
-## SGD
-
-Stochastic gradient descent is a gradient descent optimization method for 
minimizing an objective function that is written as a su of differentiable 
functions.
-
-       Q(w) = sum(Q_i(w)), 
-
-where w is the parameters to be estimated,
-      Q(w) is the objective function that could be expressed as sum of 
differentiable functions,
-      Q_i(w) is associated with the i-th observation in the data set 
-
-In practice, w is estimated using an iterative method at each single sample 
until an approximate miminum is obtained,
-
-      w = w - alpha * (d(Q_i(w))/dw),
-where aplpha is the learning rate,
-      (d(Q_i(w))/dw) is the first derivative of Q_i(w) on w.
-
-In matrix factorization, the RatingSGDFactorizer class implements the SGD with 
w = (U, M) and objective function Q(w) = sum(Q(u,i)),
-
-       Q(u,i) =  sum(e(u,i) * e(u,i)) / 2 + lambda * [(U[u,] * (U[u,]^t)) + 
(M[i,] * (M[i,]^t))] / 2
-
-where Q(u, i) is the objecive function for user u and item i,
-      e(u, i) is the error between predicted rating and actual rating,
-      U[u,] is the feature vector of user u,
-      M[i,] is the feature vector of item i,
-      lambda is the regularization parameter to prevent overfitting.
-
-The algorithm is sketched as follows:
-  
-      init U and M with randomized value between 0.0 and 1.0 with standard 
Gaussian distribution   
-      
-      for(iter = 0; iter < numIterations; iter++)
-      {
-          for(user u and item i with rating R[u,i])
-          {
-              predicted_rating = U[u,] *  M[i,]^t //dot product of feature 
vectors between user u and item i
-              err = R[u, i] - predicted_rating
-              //adjust U[u,] and M[i,]
-              // p is the number of features
-              for(f = 0; f < p; f++) {
-                 NU[u,f] = U[u,f] - alpha * d(Q(u,i))/d(U[u,f]) //optimize 
U[u,f]
-                         = U[u, f] + alpha * (e(u,i) * M[i,f] - lambda * 
U[u,f]) 
-              }
-              for(f = 0; f < p; f++) {
-                 M[i,f] = M[i,f] - alpha * d(Q(u,i))/d(M[i,f])  //optimize 
M[i,f] 
-                        = M[i,f] + alpha * (e(u,i) * U[u,f] - lambda * M[i,f]) 
-              }
-              U[u,] = NU[u,]
-          }
-      }
-
-## SVD++
-
-SVD++ is an enhancement of the SGD matrix factorization. 
-
-It could be considered as an integration of latent factor model and 
neighborhood based model, considering not only how users rate, but also who has 
rated what. 
-
-The complete model is a sum of 3 sub-models with complete prediction formula 
as follows: 
-    
-    pr(u,i) = b[u,i] + fm + nm   //user u and item i
-    
-    pr(u,i) is the predicted rating of user u on item i,
-    b[u,i] = U + b(u) + b(i)
-    fm = (q[i,]) * (p[u,] + pow(|N(u)|, -0.5) * sum(y[j,])),  j is an item in 
N(u)
-    nm = pow(|R(i;u;k)|, -0.5) * sum((r[u,j0] - b[u,j0]) * w[i,j0]) + 
pow(|N(i;u;k)|, -0.5) * sum(c[i,j1]), j0 is an item in R(i;u;k), j1 is an item 
in N(i;u;k)
-
-The associated regularized squared error function to be minimized is:
-
-    {sum((r[u,i] - pr[u,i]) * (r[u,i] - pr[u,i]))  - lambda * (b(u) * b(u) + 
b(i) * b(i) + ||q[i,]||^2 + ||p[u,]||^2 + sum(||y[j,]||^2) + sum(w[i,j0] * 
w[i,j0]) + sum(c[i,j1] * c[i,j1]))}
-
-b[u,i] is the baseline estimate of user u's predicted rating on item i. U is 
users' overall average rating and b(u) and b(i) indicate the observed 
deviations of user u and item i's ratings from average. 
-
-The baseline estimate is to adjust for the user and item effects - i.e, 
systematic tendencies for some users to give higher ratings than others and 
tendencies
-for some items to receive higher ratings than other items.
-
-fm is the latent factor model to capture the interactions between user and 
item via a feature layer. q[i,] is the feature vector of item i, and the rest 
part of the formula represents user u with a user feature vector and a sum of 
features of items in N(u),
-N(u) is the set of items that user u have expressed preference, y[j,] is 
feature vector of an item in N(u).
-
-nm is an extension of the classic item-based neighborhood model. 
-It captures not only the user's explicit ratings but also the user's implicit 
preferences. R(i;u;k) is the set of items that have got explicit rating from 
user u and only retain top k most similar items. r[u,j0] is the actual rating 
of user u on item j0, 
-b[u,j0] is the corresponding baseline estimate.
-
-The difference between r[u,j0] and b[u,j0] is weighted by a parameter w[i,j0], 
which could be thought as the similarity between item i and j0. 
-
-N[i;u;k] is the top k most similar items that have got the user's preference.
-c[i;j1] is the paramter to be estimated. 
-
-The value of w[i,j0] and c[i,j1] could be treated as the significance of the 
-user's explicit rating and implicit preference respectively.
-
-The parameters b, y, q, w, c are to be determined by minimizing the the 
associated regularized squared error function through gradient descent. We loop 
over all known ratings and for a given training case r[u,i], we apply gradient 
descent on the error function and modify the parameters by moving in the 
opposite direction of the gradient.
-
-For a complete analysis of the SVD++ algorithm,
-please refer to the paper [Yehuda Koren: Factorization Meets the Neighborhood: 
a Multifaceted Collaborative Filtering Model, KDD 
2008](http://research.yahoo.com/files/kdd08koren.pdf).
- 
-In Mahout,SVDPlusPlusFactorizer class is a simplified implementation of the 
SVD++ algorithm.It mainly uses the latent factor model with item feature 
vector, user feature vector and user's preference, with pr(u,i) = fm = (q[i,]) 
\* (p[u,] + pow(|N(u)|, -0.5) * sum(y[j,])) and the parameters to be determined 
are q, p, y. 
-
-The update to q, p, y in each gradient descent step is:
-
-      err(u,i) = r[u,i] - pr[u,i]
-      q[i,] = q[i,] + alpha * (err(u,i) * (p[u,] + pow(|N(u)|, -0.5) * 
sum(y[j,])) - lamda * q[i,]) 
-      p[u,] = p[u,] + alpha * (err(u,i) * q[i,] - lambda * p[u,])
-      for j that is an item in N(u):
-         y[j,] = y[j,] + alpha * (err(u,i) * pow(|N(u)|, -0.5) * q[i,] - 
lambda * y[j,])
-
-where alpha is the learning rate of gradient descent, N(u) is the items that 
user u has expressed preference.
-
-## Parallel SGD
-
-Mahout has a parallel SGD implementation in ParallelSGDFactorizer class. It 
shuffles the user ratings in every iteration and 
-generates splits on the shuffled ratings. Each split is handled by a thread to 
update the user features and item features using 
-vanilla SGD. 
-
-The implementation could be traced back to a lock-free version of SGD based on 
paper 
-[Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient 
Descent](http://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf).
-
-## ALSWR
-
-ALSWR is an iterative algorithm to solve the low rank factorization of user 
feature matrix U and item feature matrix M.  
-The loss function to be minimized is formulated as the sum of squared errors 
plus [Tikhonov 
regularization](http://en.wikipedia.org/wiki/Tikhonov_regularization):
-
-     L(R, U, M) = sum(pow((R[u,i] - U[u,]* (M[i,]^t)), 2)) + lambda * 
(sum(n(u) * ||U[u,]||^2) + sum(n(i) * ||M[i,]||^2))
- 
-At the beginning of the algorithm, M is initialized with the average item 
ratings as its first row and random numbers for the rest row.  
-
-In every iteration, we fix M and solve U by minimization of the cost function 
L(R, U, M), then we fix U and solve M by the minimization of 
-the cost function similarly. The iteration stops until a certain stopping 
criteria is met.
-
-To solve the matrix U when M is given, each user's feature vector is 
calculated by resolving a regularized linear least square error function 
-using the items the user has rated and their feature vectors:
-
-      1/2 * d(L(R,U,M)) / d(U[u,f]) = 0 
-
-Similary, when M is updated, we resolve a regularized linear least square 
error function using feature vectors of the users that have rated the 
-item and their feature vectors:
-
-      1/2 * d(L(R,U,M)) / d(M[i,f]) = 0
-
-The ALSWRFactorizer class is a non-distributed implementation of ALSWR using 
multi-threading to dispatch the computation among several threads.
-Mahout also offers a [parallel map-reduce 
implementation](https://mahout.apache.org/users/recommender/intro-als-hadoop.html).
-
-<a name="MatrixFactorization-Reference"></a>
-# Reference:
-
-[Stochastic gradient 
descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent)
-    
-[ALSWR](http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf)
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-documentation.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-documentation.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-documentation.md
deleted file mode 100644
index 8ba5b28..0000000
--- 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-documentation.md
+++ /dev/null
@@ -1,277 +0,0 @@
----
-layout: default
-title: Recommender Documentation
-theme:
-    name: retro-mahout
----
-
-<a name="RecommenderDocumentation-Overview"></a>
-## Overview
-
-_This documentation concerns the non-distributed, non-Hadoop-based
-recommender engine / collaborative filtering code inside Mahout. It was
-formerly a separate project called "Taste" and has continued development
-inside Mahout alongside other Hadoop-based code. It may be viewed as a
-somewhat separate, more comprehensive and more mature aspect of this
-code, compared to current development efforts focusing on Hadoop-based
-distributed recommenders. This remains the best entry point into Mahout
-recommender engines of all kinds._
-
-A Mahout-based collaborative filtering engine takes users' preferences for
-items ("tastes") and returns estimated preferences for other items. For
-example, a site that sells books or CDs could easily use Mahout to figure
-out, from past purchase data, which CDs a customer might be interested in
-listening to.
-
-Mahout provides a rich set of components from which you can construct a
-customized recommender system from a selection of algorithms. Mahout is
-designed to be enterprise-ready; it's designed for performance, scalability
-and flexibility.
-
-Top-level packages define the Mahout interfaces to these key abstractions:
-
-* **DataModel**
-* **UserSimilarity**
-* **ItemSimilarity**
-* **UserNeighborhood**
-* **Recommender**
-
-Subpackages of *org.apache.mahout.cf.taste.impl* hold implementations of
-these interfaces. These are the pieces from which you will build your own
-recommendation engine. That's it! 
-
-<a name="RecommenderDocumentation-Architecture"></a>
-## Architecture
-
-![doc](../../images/taste-architecture.png)
-
-This diagram shows the relationship between various Mahout components in a
-user-based recommender. An item-based recommender system is similar except
-that there are no Neighborhood algorithms involved.
-
-<a name="RecommenderDocumentation-Recommender"></a>
-### Recommender
-A Recommender is the core abstraction in Mahout. Given a DataModel, it can
-produce recommendations. Applications will most likely use the
-**GenericUserBasedRecommender** or **GenericItemBasedRecommender**,
-possibly decorated by **CachingRecommender**.
-
-<a name="RecommenderDocumentation-DataModel"></a>
-### DataModel
-A **DataModel** is the interface to information about user preferences. An
-implementation might draw this data from any source, but a database is the
-most likely source. Be sure to wrap this with a **ReloadFromJDBCDataModel** to 
get good performance! Mahout provides **MySQLJDBCDataModel**, for example, to 
access preference data from a database via JDBC and MySQL. Another exists for 
PostgreSQL. Mahout also provides a **FileDataModel**, which is fine for small 
applications.
-
-Users and items are identified solely by an ID value in the
-framework. Further, this ID value must be numeric; it is a Java long type
-through the APIs. A **Preference** object or **PreferenceArray** object
-encapsulates the relation between user and preferred items (or items and
-users preferring them).
-
-Finally, Mahout supports, in various ways, a so-called "boolean" data model
-in which users do not express preferences of varying strengths for items,
-but simply express an association or none at all. For example, while users
-might express a preference from 1 to 5 in the context of a movie
-recommender site, there may be no notion of a preference value between
-users and pages in the context of recommending pages on a web site: there
-is only a notion of an association, or none, between a user and pages that
-have been visited.
-
-<a name="RecommenderDocumentation-UserSimilarity"></a>
-### UserSimilarity
-A **UserSimilarity** defines a notion of similarity between two users. This is
-a crucial part of a recommendation engine. These are attached to a
-**Neighborhood** implementation. **ItemSimilarity** is analagous, but find
-similarity between items.
-
-<a name="RecommenderDocumentation-UserNeighborhood"></a>
-### UserNeighborhood
-In a user-based recommender, recommendations are produced by finding a
-"neighborhood" of similar users near a given user. A **UserNeighborhood**
-defines a means of determining that neighborhood &mdash; for example,
-nearest 10 users. Implementations typically need a **UserSimilarity** to
-operate.
-
-<a name="RecommenderDocumentation-Examples"></a>
-## Examples
-<a name="RecommenderDocumentation-User-basedRecommender"></a>
-### User-based Recommender
-User-based recommenders are the "original", conventional style of
-recommender systems. They can produce good recommendations when tweaked
-properly; they are not necessarily the fastest recommender systems and are
-thus suitable for small data sets (roughly, less than ten million ratings).
-We'll start with an example of this.
-
-First, create a **DataModel** of some kind. Here, we'll use a simple on based
-on data in a file. The file should be in CSV format, with lines of the form
-"userID,itemID,prefValue" (e.g. "39505,290002,3.5"):
-
-
-    DataModel model = new FileDataModel(new File("data.txt"));
-
-
-We'll use the **PearsonCorrelationSimilarity** implementation of 
**UserSimilarity**
-as our user correlation algorithm, and add an optional preference inference
-algorithm:
-
-
-    UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);
-
-
-Now we create a **UserNeighborhood** algorithm. Here we use nearest-3:
-
-
-    UserNeighborhood neighborhood =
-         new NearestNUserNeighborhood(3, userSimilarity, model);{code}
-    
-Now we can create our **Recommender**, and add a caching decorator:
-    
-
-    Recommender recommender =
-         new GenericUserBasedRecommender(model, neighborhood, userSimilarity);
-    Recommender cachingRecommender = new CachingRecommender(recommender);
-
-    
-Now we can get 10 recommendations for user ID "1234" &mdash; done!
-
-    List<RecommendedItem> recommendations =
-         cachingRecommender.recommend(1234, 10);
-
-    
-## Item-based Recommender
-    
-We could have created an item-based recommender instead. Item-based
-recommenders base recommendation not on user similarity, but on item
-similarity. In theory these are about the same approach to the problem,
-just from different angles. However the similarity of two items is
-relatively fixed, more so than the similarity of two users. So, item-based
-recommenders can use pre-computed similarity values in the computations,
-which make them much faster. For large data sets, item-based recommenders
-are more appropriate.
-    
-Let's start over, again with a **FileDataModel** to start:
-    
-
-    DataModel model = new FileDataModel(new File("data.txt"));
-
-    
-We'll also need an **ItemSimilarity**. We could use
-**PearsonCorrelationSimilarity**, which computes item similarity in realtime,
-but, this is generally too slow to be useful. Instead, in a real
-application, you would feed a list of pre-computed correlations to a
-**GenericItemSimilarity**: 
-    
-
-    // Construct the list of pre-computed correlations
-    Collection<GenericItemSimilarity.ItemItemSimilarity> correlations =
-         ...;
-    ItemSimilarity itemSimilarity =
-         new GenericItemSimilarity(correlations);
-
-
-    
-Then we can finish as before to produce recommendations:
-    
-
-    Recommender recommender =
-         new GenericItemBasedRecommender(model, itemSimilarity);
-    Recommender cachingRecommender = new CachingRecommender(recommender);
-    ...
-    List<RecommendedItem> recommendations =
-         cachingRecommender.recommend(1234, 10);
-
-
-<a name="RecommenderDocumentation-Integrationwithyourapplication"></a>
-## Integration with your application
-
-You can create a Recommender, as shown above, wherever you like in your
-Java application, and use it. This includes simple Java applications or GUI
-applications, server applications, and J2EE web applications.
-
-<a name="RecommenderDocumentation-Performance"></a>
-## Performance
-<a name="RecommenderDocumentation-RuntimePerformance"></a>
-### Runtime Performance
-The more data you give, the better. Though Mahout is designed for
-performance, you will undoubtedly run into performance issues at some
-point. For best results, consider using the following command-line flags to
-your JVM:
-
-* -server: Enables the server VM, which is generally appropriate for
-long-running, computation-intensive applications.
-* -Xms1024m -Xmx1024m: Make the heap as big as possible -- a gigabyte
-doesn't hurt when dealing with tens millions of preferences. Mahout
-recommenders will generally use as much memory as you give it for caching,
-which helps performance. Set the initial and max size to the same value to
-avoid wasting time growing the heap, and to avoid having the JVM run minor
-collections to avoid growing the heap, which will clear cached values.
-* -da -dsa: Disable all assertions.
-* -XX:NewRatio=9: Increase heap allocated to 'old' objects, which is most
-of them in this framework
-* -XX:+UseParallelGC -XX:+UseParallelOldGC (multi-processor machines only):
-Use a GC algorithm designed to take advantage of multiple processors, and
-designed for throughput. This is a default in J2SE 5.0.
-* -XX:-DisableExplicitGC: Disable calls to System.gc(). These calls can
-only hurt in the presence of modern GC algorithms; they may force Mahout to
-remove cached data needlessly. This flag isn't needed if you're sure your
-code and third-party code you use doesn't call this method.
-
-Also consider the following tips:
-
-* Use **CachingRecommender** on top of your custom **Recommender** 
implementation.
-* When using **JDBCDataModel**, make sure you wrap it with the 
**ReloadFromJDBCDataModel** to load data into memory!. 
-
-<a name="RecommenderDocumentation-AlgorithmPerformance:WhichOneIsBest?"></a>
-### Algorithm Performance: Which One Is Best?
-There is no right answer; it depends on your data, your application,
-environment, and performance needs. Mahout provides the building blocks
-from which you can construct the best Recommender for your application. The
-links below provide research on this topic. You will probably need a bit of
-trial-and-error to find a setup that works best. The code sample above
-provides a good starting point.
-
-Fortunately, Mahout provides a way to evaluate the accuracy of your
-Recommender on your own data, in org.apache.mahout.cf.taste.eval
-
-
-    DataModel myModel = ...;
-    RecommenderBuilder builder = new RecommenderBuilder() {
-      public Recommender buildRecommender(DataModel model) {
-        // build and return the Recommender to evaluate here
-      }
-    };
-    RecommenderEvaluator evaluator =
-         new AverageAbsoluteDifferenceRecommenderEvaluator();
-    double evaluation = evaluator.evaluate(builder, myModel, 0.9, 1.0);
-
-
-For "boolean" data model situations, where there are no notions of
-preference value, the above evaluation based on estimated preference does
-not make sense. In this case, try a *RecommenderIRStatsEvaluator*, which 
presents
-traditional information retrieval figures like precision and recall, which
-are more meaningful.
-
-
-<a name="RecommenderDocumentation-UsefulLinks"></a>
-## Useful Links
-
-
-Here's a handful of research papers that I've read and found particularly
-useful:
-
-J.S. Breese, D. Heckerman and C. Kadie, "[Empirical Analysis of Predictive 
Algorithms for Collaborative 
Filtering](http://research.microsoft.com/research/pubs/view.aspx?tr_id=166)
-," in Proceedings of the Fourteenth Conference on Uncertainity in
-Artificial Intelligence (UAI 1998), 1998.
-
-B. Sarwar, G. Karypis, J. Konstan and J. Riedl, "[Item-based collaborative 
filtering recommendation algorithms](http://www10.org/cdrom/papers/519/)
-" in Proceedings of the Tenth International Conference on the World Wide
-Web (WWW 10), pp. 285-295, 2001.
-
-P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom and J. Riedl, "[GroupLens: an 
open architecture for collaborative filtering of 
netnews](http://doi.acm.org/10.1145/192844.192905)
-" in Proceedings of the 1994 ACM conference on Computer Supported
-Cooperative Work (CSCW 1994), pp. 175-186, 1994.
-
-J.L. Herlocker, J.A. Konstan, A. Borchers and J. Riedl, "[An algorithmic 
framework for performing collaborative 
filtering](http://www.grouplens.org/papers/pdf/algs.pdf)
-" in Proceedings of the 22nd annual international ACM SIGIR Conference on
-Research and Development in Information Retrieval (SIGIR 99), pp. 230-237,
-1999.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-first-timer-faq.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-first-timer-faq.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-first-timer-faq.md
deleted file mode 100644
index 2b090e6..0000000
--- 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-first-timer-faq.md
+++ /dev/null
@@ -1,54 +0,0 @@
----
-layout: default
-title: Recommender First-Timer FAQ
-theme:
-    name: retro-mahout
----
-
-# Recommender First Timer Dos and Don'ts
-
-Many people with an interest in recommenders arrive at Mahout since they're
-building a first recommender system. Some starting questions have been
-asked enough times to warrant a FAQ collecting advice and rules-of-thumb to
-newcomers.
-
-For the interested, these topics are treated in detail in the book [Mahout in 
Action](http://manning.com/owen/).
-
-Don't start with a distributed, Hadoop-based recommender; take on that
-complexity only if necessary. Start with non-distributed recommenders. It
-is simpler, has fewer requirements, and is more flexible. 
-
-As a crude rule of thumb, a system with up to 100M user-item associations
-(ratings, preferences) should "fit" onto one modern server machine with 4GB
-of heap available and run acceptably as a real-time recommender. The system
-is invariably memory-bound since keeping data in memory is essential to
-performance.
-
-Beyond this point it gets expensive to deploy a machine with enough RAM,
-so, designing for a distributed makes sense when nearing this scale.
-However most applications don't "really" have 100M associations to process.
-Data can be sampled; noisy and old data can often be aggressively pruned
-without significant impact on the result.
-
-The next question is whether or not your system has preference values, or
-ratings. Do users and items merely have an association or not, such as the
-existence or lack of a click? or is behavior translated into some scalar
-value representing the user's degree of preference for the item.
-
-If you have ratings, then a good place to start is a
-GenericItemBasedRecommender, plus a PearsonCorrelationSimilarity similarity
-metric. If you don't have ratings, then a good place to start is
-GenericBooleanPrefItemBasedRecommender and LogLikelihoodSimilarity.
-
-If you want to do content-based item-item similarity, you need to implement
-your own ItemSimilarity.
-
-If your data can be simply exported to a CSV file, use FileDataModel and
-push new files periodically.
-If your data is in a database, use MySQLJDBCDataModel (or its "BooleanPref"
-counterpart if appropriate, or its PostgreSQL counterpart, etc.) and put on
-top a ReloadFromJDBCDataModel.
-
-This should give a reasonable starter system which responds fast. The
-nature of the system is that new data comes in from the file or database
-only periodically -- perhaps on the order of minutes. 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/map-reduce/recommender/userbased-5-minutes.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/userbased-5-minutes.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/userbased-5-minutes.md
deleted file mode 100644
index da17b38..0000000
--- 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/userbased-5-minutes.md
+++ /dev/null
@@ -1,133 +0,0 @@
----
-layout: default
-title: User Based Recommender in 5 Minutes
-theme:
-    name: retro-mahout
----
-
-# Creating a User-Based Recommender in 5 minutes
-
-##Prerequisites
-
-Create a java project in your favorite IDE and make sure mahout is on the 
classpath. The easiest way to accomplish this is by importing it via maven as 
described on the [Quickstart](/users/basics/quickstart.html) page.
-
-
-## Dataset
-
-Mahout's recommenders expect interactions between users and items as input. 
The easiest way to supply such data to Mahout is in the form of a textfile, 
where every line has the format *userID,itemID,value*. Here *userID* and 
*itemID* refer to a particular user and a particular item, and *value* denotes 
the strength of the interaction (e.g. the rating given to a movie).
-
-In this example, we'll use some made up data for simplicity. Create a file 
called "dataset.csv" and copy the following example interactions into the file. 
-
-<pre>
-1,10,1.0
-1,11,2.0
-1,12,5.0
-1,13,5.0
-1,14,5.0
-1,15,4.0
-1,16,5.0
-1,17,1.0
-1,18,5.0
-2,10,1.0
-2,11,2.0
-2,15,5.0
-2,16,4.5
-2,17,1.0
-2,18,5.0
-3,11,2.5
-3,12,4.5
-3,13,4.0
-3,14,3.0
-3,15,3.5
-3,16,4.5
-3,17,4.0
-3,18,5.0
-4,10,5.0
-4,11,5.0
-4,12,5.0
-4,13,0.0
-4,14,2.0
-4,15,3.0
-4,16,1.0
-4,17,4.0
-4,18,1.0
-</pre>
-
-## Creating a user-based recommender
-
-Create a class called *SampleRecommender* with a main method.
-
-The first thing we have to do is load the data from the file. Mahout's 
recommenders use an interface called *DataModel* to handle interaction data. 
You can load our made up interactions like this:
-
-<pre>
-DataModel model = new FileDataModel(new File("/path/to/dataset.csv"));
-</pre>
-
-In this example, we want to create a user-based recommender. The idea behind 
this approach is that when we want to compute recommendations for a particular 
users, we look for other users with a similar taste and pick the 
recommendations from their items. For finding similar users, we have to compare 
their interactions. There are several methods for doing this. One popular 
method is to compute the [correlation 
coefficient](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)
 between their interactions. In Mahout, you use this method as follows:
-
-<pre>
-UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
-</pre>
-
-The next thing we have to do is to define which similar users we want to 
leverage for the recommender. For the sake of simplicity, we'll use all that 
have a similarity greater than *0.1*. This is implemented via a 
*ThresholdUserNeighborhood*:
-
-<pre>UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, 
similarity, model);</pre>
-
-Now we have all the pieces to create our recommender:
-
-<pre>
-UserBasedRecommender recommender = new GenericUserBasedRecommender(model, 
neighborhood, similarity);
-</pre>
-        
-We can easily ask the recommender for recommendations now. If we wanted to get 
three items recommended for the user with *userID* 2, we would do it like this:
-       
-
-<pre>
-List<RecommendedItem> recommendations = recommender.recommend(2, 3);
-for (RecommendedItem recommendation : recommendations) {
-  System.out.println(recommendation);
-}
-</pre>
-
-
-Congratulations, you have built your first recommender!
-
-
-## Evaluation
-
-You might ask yourself, how to make sure that your recommender returns good 
results. Unfortunately, the only way to be really sure about the quality is by 
doing an A/B test with real users in a live system.
-
-We can however try to get a feel of the quality, by statistical offline 
evaluation. Just keep in mind that this does not replace a test with real users!
-
-One way to check whether the recommender returns good results is by doing a 
**hold-out** test. We partition our dataset into two sets: a trainingset 
consisting of 90% of the data and a testset consisting of 10%. Then we train 
our recommender using the training set and look how well it predicts the 
unknown interactions in the testset.
-
-To test our recommender, we create a class called *EvaluateRecommender* with a 
main method and add an inner class called *MyRecommenderBuilder* that 
implements the *RecommenderBuilder* interface. We implement the 
*buildRecommender* method and make it setup our user-based recommender:
-
-<pre>
-UserSimilarity similarity = new PearsonCorrelationSimilarity(dataModel);
-UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, 
dataModel);
-return new GenericUserBasedRecommender(dataModel, neighborhood, similarity);
-</pre>
-
-Now we have to create the code for the test. We'll check how much the 
recommender misses the real interaction strength on average. We employ an 
*AverageAbsoluteDifferenceRecommenderEvaluator* for this. The following code 
shows how to put the pieces together and run a hold-out test: 
-
-<pre>
-DataModel model = new FileDataModel(new File("/path/to/dataset.csv"));
-RecommenderEvaluator evaluator = new 
AverageAbsoluteDifferenceRecommenderEvaluator();
-RecommenderBuilder builder = new MyRecommenderBuilder();
-double result = evaluator.evaluate(builder, null, model, 0.9, 1.0);
-System.out.println(result);
-</pre>
-
-Note: if you run this test multiple times, you will get different results, 
because the splitting into trainingset and testset is done randomly. 
-
-
-
-
-
-
-
-
-
-
-

http://git-wip-us.apache.org/repos/asf/mahout/blob/3c53a6dc/website/old_site_migration/needs_work_convenience/powered-by-mahout.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/powered-by-mahout.md 
b/website/old_site_migration/needs_work_convenience/powered-by-mahout.md
deleted file mode 100644
index cb7c039..0000000
--- a/website/old_site_migration/needs_work_convenience/powered-by-mahout.md
+++ /dev/null
@@ -1,129 +0,0 @@
----
-layout: default
-title: Powered By Mahout
-theme:
-    name: retro-mahout
----
-
-# Powered by Mahout
-
-Are you using Mahout to do Machine Learning? <a 
href="https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html";>Care
 to share</a>? Developers of the project always are happy to learn about new 
happy users with interesting use cases.
-
-*Links here do NOT imply
-endorsement by Mahout, its committers or the Apache Software Foundation and
-are for informational purposes only.*
-
-<a name="PoweredByMahout-CommercialUse"></a>
-## Commercial Use
-
-* <a 
href="http://nosql.mypopescu.com/post/2082712431/hbase-and-hadoop-at-adobe";>Adobe
 AMP</a> uses Mahout's clustering algorithms to increase video
-consumption by better user targeting. 
-* Accenture uses Mahout as typical example for their [Hadoop Deployment 
Comparison 
Study](http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture-Hadoop-Deployment-Comparison-Study.pdf)
-* [AOL](http://www.aol.com)
- use Mahout for shopping recommendations. See [slide 
deck](http://www.slideshare.net/kryton/the-data-layer)
-* [Booz Allen Hamilton](http://www.boozallen.com/)
- uses Mahout's clustering algorithms. See [slide 
deck](http://www.slideshare.net/ydn/3-biometric-hadoopsummit2010)
-* [Buzzlogic](http://www.buzzlogic.com)
- uses Mahout's clustering algorithms to improve ad targeting
-* [Cull.tv](http://cull.tv/)
- uses modified Mahout algorithms for content recommendations
-* ![DatamineLab](http://cdn.dataminelab.com/favicon.ico) [DataMine 
Lab](http://dataminelab.com)
- uses Mahout's recommendation and clustering algorithms to improve our
-clients' ad targeting.
-* [Drupal](http://drupal.org/project/recommender)
- uses Mahout to provide open source content recommendation solutions.
-* [Evolv ](http://www.evolvondemand.com)
- uses Mahout for its Workforce Predictive Analytics platform.
-* [Foursquare](http://www.foursquare.com)
- uses Mahout for its [recommendation 
engine](http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/).
-* [Idealo](http://www.idealo.de)
- uses Mahout's recommendation engine.
-* [InfoGlutton](http://www.infoglutton.com)
- uses Mahout's clustering and classification for various consulting
-projects.
-* 
[Intel](http://mark.chmarny.com/2013/07/thinking-big-about-data-at-intel.html)
- ships Mahout as part of their Distribution for Apache Hadoop Software.
-* [Intela](http://www.intela.com/)
- has implementations of Mahout's recommendation algorithms to select new
-offers to send tu customers, as well as to recommend potential customers to
-current offers. We are also working on enhancing our offer categories by
-using the clustering algorithms.
-* ![iOffer](http://ioffer.com/favicon.ico) [iOffer](http://www.ioffer.com)
- uses Mahout's Frequent Pattern Mining and Collaborative Filtering to
-recommend items to users.
-* ![kau.li](http://kau.li/favicon.ico) [Kauli](http://kau.li/en)
-, one of Japanese Adnetwork, uses Mahout's clustering to handle clickstream
-data for predicting audience's interests and intents.
-* [Linked.In](http://linkedin.com)
- Historically, we have used R for model training. We have recently started
-experimenting with Mahout for model training and are excited about it - also 
see
- <a 
href="https://www.quora.com/LinkedIn-Recommendations/How-does-LinkedIns-recommendation-system-work?srid=XoeG&share=1";>Hadoop
 World slides</a>
-.
-* [LucidWorks Big Data](http://www.lucidworks.com/products/lucidworks-big-data)
- uses Mahout for clustering, duplicate document detection, phrase
-extraction and classification.
-* ![Mendeley](http://mendeley.com/favicon.ico) [Mendeley](http://mendeley.com)
- uses Mahout to power Mendeley Suggest, a research article recommendation
-service.
-* ![Mippin](http://mippin.com/web/favicon.ico) [Mippin](http://mippin.com)
- uses Mahout's collaborative filtering engine to recommend news feeds
-* 
[Mobage](http://www.slideshare.net/hamadakoichi/mobage-prmu-2011-mahout-hadoop)
- uses Mahout in their analysis pipeline
-* ![Myrrix](http://myrrix.com/wp-content/uploads/2012/03/favicon.ico) 
[Myrrix](http://myrrix.com)
- is a recommender system product built on Mahout.
-* ![Newscred](http://www.newscred.com/static/img/website/favicon.ico) 
[NewsCred](http://platform.newscred.com)
- uses Mahout to generate clusters of news articles and to surface the
-important stories of the day
-* [Next Glass](http://nextglass.co/)
- uses Mahout
-* [Predixion Software](http://predixionsoftware.com/)
- uses Mahout’s algorithms to build predictive models on big data
-* <img src="http://www.radoop.eu/wp-content/uploads/favicon.png"; width=15> 
[Radoop](http://radoop.eu)
- provides a drag-n-drop interface for big data analytics, including Mahout
-clustering and classification algorithms
-* ![Researchgate](https://www.researchgate.net/favicon.ico) 
[ResearchGate](http://www.researchgate.net/), the professional network for 
scientists and researchers, uses Mahout's
-recommendation algorithms.
-* [Sematext](http://www.sematext.com/)
- uses Mahout for its recommendation engine
-* [SpeedDate.com](http://www.speeddate.com)
- uses Mahout's collaborative filtering engine to recommend member profiles
-* [Twitter](http://twitter.com)
- uses Mahout's LDA implementation for user interest modeling
-* [Yahoo\!](http://www.yahoo.com)
- Mail uses Mahout's Frequent Pattern Set Mining.  See 
[slides](http://www.slideshare.net/hadoopusergroup/mail-antispam)
-* [365Media ](http://365media.com/)
- uses *Mahout's* Classification and Collaborative Filtering algorithms in
-its Real-time system named [UPTIME](http://uptime.365media.com/)
- and 365Media/Social
-
-<a name="PoweredByMahout-AcademicUse"></a>
-## Academic Use
-
-* [Dicode](https://www.dicode-project.eu/)
- project uses Mahout's clustering and classification algorithms on top of
-HBase.
-* The course [Large Scale Data Analysis and Data 
Mining](http://www.dima.tu-berlin.de/menue/teaching/masterstudium/aim-3/)
- at TU Berlin uses Mahout to teach students about the parallelization of data
-mining problems with Hadoop and Map/Reduce
-* Mahout is used at Carnegie Mellon University, as a comparable platform to 
[GraphLab](http://www.graphlab.ml.cmu.edu/)
-
-* The [ROBUST project](http://www.robust-project.eu/)
-, co-funded by the European Commission, employs Mahout in the large scale
-analysis of online community data.
-* Mahout is used for research and data processing at [Nagoya Institute of 
Technology](http://www.nitech.ac.jp/eng/schools/grad/cse.html)
-, in the context of a large-scale citizen participation platform project,
-funded by the Ministry of Interior of Japan.
-* Several researches within [Digital Enterprise Research 
Institute](http://www.deri.ie)
- [NUI Galway](http://www.nuigalway.ie)
- use Mahout for e.g. topic mining and modelling of large corpora.
-* Mahout is used in the NoTube EU project.
-
-<a name="PoweredByMahout-PoweredByLogos"></a>
-## Powered By Logos
-
-Feel free to use our **Powered By** logos on your site:
-
-![powered by 
logo](https://mahout.apache.org/images/mahout-logo-poweredby-55.png)
-
-
-![powered by 
logo](https://mahout.apache.org/images/mahout-logo-poweredby-100.png)
\ No newline at end of file

Reply via email to