[4/9] mahout git commit: WEBSITE Triage of Old Site Migration

rawkintrevo Sat, 29 Apr 2017 16:25:24 -0700

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/misc/using-mahout-with-python-via-jpype.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/misc/using-mahout-with-python-via-jpype.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/misc/using-mahout-with-python-via-jpype.md
new file mode 100644
index 0000000..57378ba
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/misc/using-mahout-with-python-via-jpype.md
@@ -0,0 +1,222 @@
+---
+layout: default
+title: Using Mahout with Python via JPype
+theme:
+    name: retro-mahout
+---
+
+<a name="UsingMahoutwithPythonviaJPype-overview"></a>
+# Mahout over Jython - some examples
+This tutorial provides some sample code illustrating how we can read and
+write sequence files containing Mahout vectors from Python using JPype.
+This tutorial is intended for people who want to use Python for analyzing
+and plotting Mahout data. Using Mahout from Python turns out to be quite
+easy.
+
+This tutorial concerns the use of cPython (cython) as opposed to Jython.
+JPython wasn't an option for me, because  (to the best of my knowledge)
+JPython doesn't work with Python extensions numpy, matplotlib, or h5py
+which I rely on heavily.
+
+The instructions below explain how to setup a python script to read and
+write the output of Mahout clustering.
+
+You will first need to download and install the JPype package for python.
+
+The first step to setting up JPype is determining the path to the dynamic
+library for the jvm ; on linux this will be a .so file on and on windows it
+will be a .dll.
+
+In your python script, create a global variable with the path to this dll
+
+
+
+Next we need to figure out how we need to set the classpath for mahout. The
+easiest way to do this is to edit the script in "bin/mahout" to print out
+the classpath. Add the line "echo $CLASSPATH" to the script somewhere after
+the comment "run it" (this is line 195 or so). Execute the script to print
+out the classpath.  Copy this output and paste it into a variable in your
+python script. The result for me looks like the following
+
+
+
+
+Now we can create a function to start the jvm in python using jype
+
+    jvm=None
+    def start_jpype():
+    global jvm
+    if (jvm is None):
+    cpopt="-Djava.class.path={cp}".format(cp=classpath)
+    startJVM(jvmlib,"-ea",cpopt)
+    jvm="started"
+
+
+
+<a 
name="UsingMahoutwithPythonviaJPype-WritingNamedVectorstoSequenceFilesfromPython"></a>
+# Writing Named Vectors to Sequence Files from Python
+We can now use JPype to create sequence files which will contain vectors to
+be used by Mahout for kmeans. The example below is a function which creates
+vectors from two Gaussian distributions with unit variance.
+
+
+    def create_inputs(ifile,*args,**param):
+     """Create a sequence file containing some normally distributed
+       ifile - path to the sequence file to create
+     """
+     
+     #matrix of the cluster means
+     cmeans=np.array([[1,1] ,[-1,-1]],np.int)
+     
+     nperc=30  #number of points per cluster
+     
+     vecs=[]
+     
+     vnames=[]
+     for cind in range(cmeans.shape[0]):
+      pts=np.random.randn(nperc,2)
+      pts=pts+cmeans[cind,:].reshape([1,cmeans.shape[1]])
+      vecs.append(pts)
+     
+      #names for the vectors
+      #names are just the points with an index
+      #we do this so we can validate by cross-refencing the name with thevector
+      vn=np.empty(nperc,dtype=(np.str,30))
+      for row in range(nperc):
+       
vn[row]="c"+str(cind)+"_"+pts[row,0].astype((np.str,4))+"_"+pts[row,1].astype((np.str,4))
+      vnames.append(vn)
+      
+     vecs=np.vstack(vecs)
+     vnames=np.hstack(vnames)
+     
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     DenseVectorCls=JPackage("org").apache.mahout.math.DenseVector
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     writer=io.SequenceFile.createWriter(fs, conf, 
path,io.Text,VectorWritableCls)
+     
+     
+     vecwritable=VectorWritableCls()
+     for row in range(vecs.shape[0]):
+      
nvector=NamedVectorCls(DenseVectorCls(JArray(JDouble,1)(vecs[row,:])),vnames[row])
+      #need to wrap key and value because of overloading
+      wrapkey=JObject(io.Text("key "+str(row)),io.Writable)
+      wrapval=JObject(vecwritable,io.Writable)
+      
+      vecwritable.set(nvector)
+      writer.append(wrapkey,wrapval)
+      
+     writer.close()
+
+
+<a 
name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansClusteredPointsfromPython"></a>
+# Reading the KMeans Clustered Points from Python
+Similarly we can use JPype to easily read the clustered points outputted by
+mahout.
+
+    def read_clustered_pts(ifile,*args,**param):
+     """Read the clustered points
+     ifile - path to the sequence file containing the clustered points
+     """ 
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     
+     
+     ReaderCls=io.__getattribute__("SequenceFile$Reader") 
+     reader=ReaderCls(fs, path,conf)
+     
+    
+     key=reader.getKeyClass()()
+     
+    
+     valcls=reader.getValueClass()
+     vecwritable=valcls()
+     while (reader.next(key,vecwritable)):     
+      weight=vecwritable.getWeight()
+      nvec=vecwritable.getVector()
+      
+      cname=nvec.__class__.__name__
+      if (cname.rsplit('.',1)[1]=="NamedVector"):  
+       print "cluster={key} Name={name} 
x={x}y={y}".format(key=key.toString(),name=nvec.getName(),x=nvec.get(0),y=nvec.get(1))
+      else:
+       raise NotImplementedError("Vector isn't a NamedVector. Need 
tomodify/test the code to handle this case.")
+
+
+<a name="UsingMahoutwithPythonviaJPype-ReadingtheKMeansCentroids"></a>
+# Reading the KMeans Centroids
+Finally we can create a function to print out the actual cluster centers
+found by mahout,
+
+    def getClusters(ifile,*args,**param):
+     """Read the centroids from the clusters outputted by kmenas
+          ifile - Path to the sequence file containing the centroids
+     """ 
+    
+     #start the jvm
+     start_jpype()
+     
+     #create the sequence file that we will write to
+     io=JPackage("org").apache.hadoop.io 
+     FileSystemCls=JPackage("org").apache.hadoop.fs.FileSystem
+     
+     PathCls=JPackage("org").apache.hadoop.fs.Path
+     path=PathCls(ifile)
+    
+     ConfCls=JPackage("org").apache.hadoop.conf.Configuration 
+     conf=ConfCls()
+     
+     fs=FileSystemCls.get(conf)
+     
+     #vector classes
+     VectorWritableCls=JPackage("org").apache.mahout.math.VectorWritable
+     NamedVectorCls=JPackage("org").apache.mahout.math.NamedVector
+     ReaderCls=io.__getattribute__("SequenceFile$Reader")
+     reader=ReaderCls(fs, path,conf)
+     
+    
+     key=io.Text()
+     
+    
+     valcls=reader.getValueClass()
+    
+     vecwritable=valcls()
+     
+     while (reader.next(key,vecwritable)):     
+      center=vecwritable.getCenter()
+      
+      print 
"id={cid}center={center}".format(cid=vecwritable.getId(),center=center.values)
+      pass
+


http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-als-hadoop.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-als-hadoop.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-als-hadoop.md
new file mode 100644
index 0000000..2acacd0
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-als-hadoop.md
@@ -0,0 +1,98 @@
+---
+layout: default
+title: Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+
+# Introduction to ALS Recommendations with Hadoop
+
+##Overview
+
+Mahoutâs ALS recommender is a matrix factorization algorithm that uses 
Alternating Least Squares with Weighted-Lamda-Regularization (ALS-WR). It 
factors the user to item matrix *A* into the user-to-feature matrix *U* and the 
item-to-feature matrix *M*: It runs the ALS algorithm in a parallel fashion. 
The algorithm details can be referred to in the following papers: 
+
+* [Large-scale Parallel Collaborative Filtering for
+the Netflix 
Prize](http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf)
+* [Collaborative Filtering for Implicit Feedback 
Datasets](http://research.yahoo.com/pub/2433) 
+
+This recommendation algorithm can be used in eCommerce platform to recommend 
products to customers. Unlike the user or item based recommenders that computes 
the similarity of users or items to make recommendations, the ALS algorithm 
uncovers the latent factors that explain the observed user to item ratings and 
tries to find optimal factor weights to minimize the least squares between 
predicted and actual ratings.
+
+Mahout's ALS recommendation algorithm takes as input user preferences by item 
and generates an output of recommending items for a user. The input customer 
preference could either be explicit user ratings or implicit feedback such as 
user's click on a web page.
+
+One of the strengths of the ALS based recommender, compared to the user or 
item based recommender, is its ability to handle large sparse data sets and its 
better prediction performance. It could also gives an intuitive rationale of 
the factors that influence recommendations.
+
+##Implementation
+At present Mahout has a map-reduce implementation of ALS, which is composed of 
2 jobs: a parallel matrix factorization job and a recommendation job.
+The matrix factorization job computes the user-to-feature matrix and 
item-to-feature matrix given the user to item ratings. Its input includes: 
+<pre>
+    --input: directory containing files of explicit user to item rating or 
implicit feedback;
+    --output: output path of the user-feature matrix and feature-item matrix;
+    --lambda: regularization parameter to avoid overfitting;
+    --alpha: confidence parameter only used on implicit feedback
+    --implicitFeedback: boolean flag to indicate whether the input dataset 
contains implicit feedback;
+    --numFeatures: dimensions of feature space;
+    --numThreadsPerSolver: number of threads per solver mapper for concurrent 
execution;
+    --numIterations: number of iterations
+    --usesLongIDs: boolean flag to indicate whether the input contains long 
IDs that need to be translated
+</pre>
+and it outputs the matrices in sequence file format. 
+
+The recommendation job uses the user feature matrix and item feature matrix 
calculated from the factorization job to compute the top-N recommendations per 
user. Its input includes:
+<pre>
+    --input: directory containing files of user ids;
+    --output: output path of the recommended items for each input user id;
+    --userFeatures: path to the user feature matrix;
+    --itemFeatures: path to the item feature matrix;
+    --numRecommendations: maximum number of recommendations per user, default 
is 10;
+    --maxRating: maximum rating available;
+    --numThreads: number of threads per mapper;
+    --usesLongIDs: boolean flag to indicate whether the input contains long 
IDs that need to be translated;
+    --userIDIndex: index for user long IDs (necessary if usesLongIDs is true);
+    --itemIDIndex: index for item long IDs (necessary if usesLongIDs is true) 
+</pre>
+and it outputs a list of recommended item ids for each user. The predicted 
rating between user and item is a dot product of the user's feature vector and 
the item's feature vector.  
+
+##Example
+
+Letâs look at a simple example of how we could use Mahoutâs ALS 
recommender to recommend items for users. First, youâll need to get Mahout up 
and running, the instructions for which can be found 
[here](https://mahout.apache.org/users/basics/quickstart.html). After you've 
ensured Mahout is properly installed, weâre ready to run the example.
+
+**Step 1: Prepare test data**
+
+Similar to Mahout's item based recommender, the ALS recommender relies on the 
user to item preference data: *userID*, *itemID* and *preference*. The 
preference could be explicit numeric rating or counts of actions such as a 
click (implicit feedback). The test data file is organized as each line is a 
tab-delimited string, the 1st field is user id, which must be numeric, the 2nd 
field is item id, which must be numeric and the 3rd field is preference, which 
should also be a number.
+
+**Note:** You must create IDs that are ordinal positive integers for all user 
and item IDs. Often this will require you to keep a dictionary
+to map into and out of Mahout IDs. For instance if the first user has ID "xyz" 
in your application, this would get an Mahout ID of the integer 1 and so on. 
The same
+for item IDs. Then after recommendations are calculated you will have to 
translate the Mahout user and item IDs back into your application IDs.
+
+To quickly start, you could specify a text file like following as the input:
+<pre>
+1      100     1
+1      200     5
+1      400     1
+2      200     2
+2      300     1
+</pre>
+
+**Step 2: Determine parameters**
+
+In addition, users need to determine dimension of feature space, the number of 
iterations to run the alternating least square algorithm, Using 10 features and 
15 iterations is a reasonable default to try first. Optionally a confidence 
parameter can be set if the input preference is implicit user feedback.  
+
+**Step 3: Run ALS**
+
+Assuming your *JAVA_HOME* is appropriately set and Mahout was installed 
properly weâre ready to configure our syntax. Enter the following command:
+
+    $ mahout parallelALS --input $als_input --output $als_output --lambda 0.1 
--implicitFeedback true --alpha 0.8 --numFeatures 2 --numIterations 5  
--numThreadsPerSolver 1 --tempDir tmp 
+
+Running the command will execute a series of jobs the final product of which 
will be an output file deposited to the output directory specified in the 
command syntax. The output directory contains 3 sub-directories: *M* stores the 
item to feature matrix, *U* stores the user to feature matrix and userRatings 
stores the user's ratings on the items. The *tempDir* parameter specifies the 
directory to store the intermediate output of the job, such as the matrix 
output in each iteration and each item's average rating. Using the *tempDir* 
will help on debugging.
+
+**Step 4: Make Recommendations**
+
+Based on the output feature matrices from step 3, we could make 
recommendations for users. Enter the following command:
+
+     $ mahout recommendfactorized --input $als_recommender_input 
--userFeatures $als_output/U/ --itemFeatures $als_output/M/ 
--numRecommendations 1 --output recommendations --maxRating 1
+
+The input user file is a sequence file, the sequence record key is user id and 
value is the user's rated item ids which will be removed from recommendation. 
The output file generated in our simple example will be a text file giving the 
recommended item ids for each user. 
+Remember to translate the Mahout ids back into your application specific ids. 
+
+There exist a variety of parameters for Mahoutâs ALS recommender to 
accommodate custom business requirements; exploring and testing various 
configurations to suit your needs will doubtless lead to additional questions. 
Feel free to ask such questions on the [mailing 
list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html).
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-itembased-hadoop.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-itembased-hadoop.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-itembased-hadoop.md
new file mode 100644
index 0000000..ee2c3e8
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/intro-itembased-hadoop.md
@@ -0,0 +1,54 @@
+---
+layout: default
+title: Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+# Introduction to Item-Based Recommendations with Hadoop
+
+##Overview
+
+Mahoutâs item based recommender is a flexible and easily implemented 
algorithm with a diverse range of applications. The minimalism of the primary 
input fileâs structure and availability of ancillary filtering controls can 
make sourcing required data and shaping a desired output both efficient and 
straightforward.
+
+Typical use cases include:
+
+* Recommend products to customers via an eCommerce platform (think: Amazon, 
Netflix, Overstock)
+* Identify organic sales opportunities
+* Segment users/customers based on similar item preferences
+
+Broadly speaking, Mahout's item-based recommendation algorithm takes as input 
customer preferences by item and generates an output recommending similar items 
with a score indicating whether a customer will "like" the recommended item.
+
+One of the strengths of the item based recommender is its adaptability to your 
business conditions or research interests. For example, there are many 
available approaches for providing product preference. One such method is to 
calculate the total orders for a given product for each customer (i.e. Acme 
Corp has ordered Widget-A 5,678 times) while others rely on user preference 
captured via the web (i.e. Jane Doe rated a movie as five stars, or gave a 
product two thumbsâ up).
+
+Additionally, a variety of methodologies can be implemented to narrow the 
focus of Mahout's recommendations, such as:
+
+* Exclude low volume or low profitability products from consideration
+* Group customers by segment or market rather than using user/customer level 
data
+* Exclude zero-dollar transactions, returns or other order types
+* Map product substitutions into the Mahout input (i.e. if WidgetA is a 
recommended item replace it with WidgetX)
+
+The item based recommender output can be easily consumed by downstream 
applications (i.e. websites, ERP systems or salesforce automation tools) and is 
configurable so users can determine the number of item recommendations 
generated by the algorithm.
+
+##Example
+
+Testing the item based recommender can be a simple and potentially quite 
rewarding endeavor. Whereas the typical sample use case for collaborative 
filtering focuses on utilization of, and integration with, eCommerce platforms 
we can instead look at a potential use case applicable to most businesses (even 
those without a web presence). Letâs look at how a company might use 
Mahoutâs item based recommender to identify new sales opportunities for an 
existing customer base. First, youâll need to get Mahout up and running, the 
instructions for which can be found 
[here](https://mahout.apache.org/users/basics/quickstart.html). After you've 
ensured Mahout is properly installed, weâre ready to run a quick example.
+
+**Step 1: Gather some test data**
+
+Mahoutâs item based recommender relies on three key pieces of data: 
*userID*, *itemID* and *preference*. The âusersâ could be website visitors 
or simply customers that purchase products from your business. Similarly, items 
could be products, product groups or even pages on your website â really 
anything you would want to recommend to a group of users or customers. For our 
example letâs use customer orders as a proxy for preference. A simple count 
of distinct orders by customer, by product will work for this example. Youâll 
find as you explore ways to manipulate the item based recommender the 
preference value can be many things (page clicks, explicit ratings, order 
counts, etc.). Once your test data is gathered put it in a *.txt* file 
separated by commas with no column headers included.
+
+**Step 2: Pick a similarity measure**
+
+Choosing a similarity measure for use in a production environment is something 
that requires careful testing, evaluation and research. For our example 
purposes, weâll just go with a Mahout similarity classname called 
*SIMILARITY_LOGLIKELIHOOD*.
+
+**Step 3: Configure the Mahout command**
+
+Assuming your *JAVA_HOME* is appropriately set and Mahout was installed 
properly weâre ready to configure our syntax. Enter the following command:
+
+    $ mahout recommenditembased -s SIMILARITY_LOGLIKELIHOOD -i 
/path/to/input/file -o /path/to/desired/output --numRecommendations 25
+
+Running the command will execute a series of jobs the final product of which 
will be an output file deposited to the directory specified in the command 
syntax. The output file will contain two columns: the *userID* and an array of 
*itemIDs* and scores.
+
+**Step 4: Making use of the output and doing more with Mahout**
+
+The output file generated in our simple example can be transformed using your 
tool of choice and consumed by downstream applications. There exist a variety 
of configuration options for Mahoutâs item based recommender to accommodate 
custom business requirements; exploring and testing various configurations to 
suit your needs will doubtless lead to additional questions. Our user community 
is accessible via our [mailing 
list](https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html) 
and the book *Mahout In Action* is a fantastic (but slightly outdated) starting 
point. 

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/recommender/matrix-factorization.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/matrix-factorization.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/matrix-factorization.md
new file mode 100644
index 0000000..63de4fd
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/matrix-factorization.md
@@ -0,0 +1,187 @@
+---
+layout: default
+title: Perceptron and Winnow
+theme:
+    name: retro-mahout
+---
+<a name="MatrixFactorization-Intro"></a>
+# Introduction to Matrix Factorization for Recommendation Mining
+
+In the mathematical discipline of linear algebra, a matrix decomposition 
+or matrix factorization is a dimensionality reduction technique that 
factorizes a matrix into a product of matrices, usually two. 
+There are many different matrix decompositions, each finds use among a 
particular class of problems.
+
+In mahout, the SVDRecommender provides an interface to build recommender based 
on matrix factorization.
+The idea behind is to project the users and items onto a feature space and try 
to optimize U and M so that U \* (M^t) is as close to R as possible:
+
+     U is n * p user feature matrix, 
+     M is m * p item feature matrix, M^t is the conjugate transpose of M,
+     R is n * m rating matrix,
+     n is the number of users,
+     m is the number of items,
+     p is the number of features
+
+We usually use RMSE to represent the deviations between predictions and atual 
ratings.
+RMSE is defined as the squared root of the sum of squared errors at each known 
user item ratings.
+So our matrix factorization target could be mathmatically defined as:
+
+     find U and M, (U, M) = argmin(RMSE) = argmin(pow(SSE / K, 0.5))
+     
+     SSE = sum(e(u,i)^2)
+     e(u,i) = r(u, i) - U[u,] * (M[i,]^t) = r(u,i) - sum(U[u,f] * M[i,f]), f = 
0, 1, .. p - 1
+     K is the number of known user item ratings.
+
+<a name="MatrixFactorization-Factorizers"></a>
+
+Mahout has implemented matrix factorization based on 
+
+    (1) SGD(Stochastic Gradient Descent)
+    (2) ALSWR(Alternating-Least-Squares with Weighted-Î»-Regularization).
+
+## SGD
+
+Stochastic gradient descent is a gradient descent optimization method for 
minimizing an objective function that is written as a su of differentiable 
functions.
+
+       Q(w) = sum(Q_i(w)), 
+
+where w is the parameters to be estimated,
+      Q(w) is the objective function that could be expressed as sum of 
differentiable functions,
+      Q_i(w) is associated with the i-th observation in the data set 
+
+In practice, w is estimated using an iterative method at each single sample 
until an approximate miminum is obtained,
+
+      w = w - alpha * (d(Q_i(w))/dw),
+where aplpha is the learning rate,
+      (d(Q_i(w))/dw) is the first derivative of Q_i(w) on w.
+
+In matrix factorization, the RatingSGDFactorizer class implements the SGD with 
w = (U, M) and objective function Q(w) = sum(Q(u,i)),
+
+       Q(u,i) =  sum(e(u,i) * e(u,i)) / 2 + lambda * [(U[u,] * (U[u,]^t)) + 
(M[i,] * (M[i,]^t))] / 2
+
+where Q(u, i) is the objecive function for user u and item i,
+      e(u, i) is the error between predicted rating and actual rating,
+      U[u,] is the feature vector of user u,
+      M[i,] is the feature vector of item i,
+      lambda is the regularization parameter to prevent overfitting.
+
+The algorithm is sketched as follows:
+  
+      init U and M with randomized value between 0.0 and 1.0 with standard 
Gaussian distribution   
+      
+      for(iter = 0; iter < numIterations; iter++)
+      {
+          for(user u and item i with rating R[u,i])
+          {
+              predicted_rating = U[u,] *  M[i,]^t //dot product of feature 
vectors between user u and item i
+              err = R[u, i] - predicted_rating
+              //adjust U[u,] and M[i,]
+              // p is the number of features
+              for(f = 0; f < p; f++) {
+                 NU[u,f] = U[u,f] - alpha * d(Q(u,i))/d(U[u,f]) //optimize 
U[u,f]
+                         = U[u, f] + alpha * (e(u,i) * M[i,f] - lambda * 
U[u,f]) 
+              }
+              for(f = 0; f < p; f++) {
+                 M[i,f] = M[i,f] - alpha * d(Q(u,i))/d(M[i,f])  //optimize 
M[i,f] 
+                        = M[i,f] + alpha * (e(u,i) * U[u,f] - lambda * M[i,f]) 
+              }
+              U[u,] = NU[u,]
+          }
+      }
+
+## SVD++
+
+SVD++ is an enhancement of the SGD matrix factorization. 
+
+It could be considered as an integration of latent factor model and 
neighborhood based model, considering not only how users rate, but also who has 
rated what. 
+
+The complete model is a sum of 3 sub-models with complete prediction formula 
as follows: 
+    
+    pr(u,i) = b[u,i] + fm + nm   //user u and item i
+    
+    pr(u,i) is the predicted rating of user u on item i,
+    b[u,i] = U + b(u) + b(i)
+    fm = (q[i,]) * (p[u,] + pow(|N(u)|, -0.5) * sum(y[j,])),  j is an item in 
N(u)
+    nm = pow(|R(i;u;k)|, -0.5) * sum((r[u,j0] - b[u,j0]) * w[i,j0]) + 
pow(|N(i;u;k)|, -0.5) * sum(c[i,j1]), j0 is an item in R(i;u;k), j1 is an item 
in N(i;u;k)
+
+The associated regularized squared error function to be minimized is:
+
+    {sum((r[u,i] - pr[u,i]) * (r[u,i] - pr[u,i]))  - lambda * (b(u) * b(u) + 
b(i) * b(i) + ||q[i,]||^2 + ||p[u,]||^2 + sum(||y[j,]||^2) + sum(w[i,j0] * 
w[i,j0]) + sum(c[i,j1] * c[i,j1]))}
+
+b[u,i] is the baseline estimate of user u's predicted rating on item i. U is 
users' overall average rating and b(u) and b(i) indicate the observed 
deviations of user u and item i's ratings from average. 
+
+The baseline estimate is to adjust for the user and item effects - i.e, 
systematic tendencies for some users to give higher ratings than others and 
tendencies
+for some items to receive higher ratings than other items.
+
+fm is the latent factor model to capture the interactions between user and 
item via a feature layer. q[i,] is the feature vector of item i, and the rest 
part of the formula represents user u with a user feature vector and a sum of 
features of items in N(u),
+N(u) is the set of items that user u have expressed preference, y[j,] is 
feature vector of an item in N(u).
+
+nm is an extension of the classic item-based neighborhood model. 
+It captures not only the user's explicit ratings but also the user's implicit 
preferences. R(i;u;k) is the set of items that have got explicit rating from 
user u and only retain top k most similar items. r[u,j0] is the actual rating 
of user u on item j0, 
+b[u,j0] is the corresponding baseline estimate.
+
+The difference between r[u,j0] and b[u,j0] is weighted by a parameter w[i,j0], 
which could be thought as the similarity between item i and j0. 
+
+N[i;u;k] is the top k most similar items that have got the user's preference.
+c[i;j1] is the paramter to be estimated. 
+
+The value of w[i,j0] and c[i,j1] could be treated as the significance of the 
+user's explicit rating and implicit preference respectively.
+
+The parameters b, y, q, w, c are to be determined by minimizing the the 
associated regularized squared error function through gradient descent. We loop 
over all known ratings and for a given training case r[u,i], we apply gradient 
descent on the error function and modify the parameters by moving in the 
opposite direction of the gradient.
+
+For a complete analysis of the SVD++ algorithm,
+please refer to the paper [Yehuda Koren: Factorization Meets the Neighborhood: 
a Multifaceted Collaborative Filtering Model, KDD 
2008](http://research.yahoo.com/files/kdd08koren.pdf).
+ 
+In Mahout,SVDPlusPlusFactorizer class is a simplified implementation of the 
SVD++ algorithm.It mainly uses the latent factor model with item feature 
vector, user feature vector and user's preference, with pr(u,i) = fm = (q[i,]) 
\* (p[u,] + pow(|N(u)|, -0.5) * sum(y[j,])) and the parameters to be determined 
are q, p, y. 
+
+The update to q, p, y in each gradient descent step is:
+
+      err(u,i) = r[u,i] - pr[u,i]
+      q[i,] = q[i,] + alpha * (err(u,i) * (p[u,] + pow(|N(u)|, -0.5) * 
sum(y[j,])) - lamda * q[i,]) 
+      p[u,] = p[u,] + alpha * (err(u,i) * q[i,] - lambda * p[u,])
+      for j that is an item in N(u):
+         y[j,] = y[j,] + alpha * (err(u,i) * pow(|N(u)|, -0.5) * q[i,] - 
lambda * y[j,])
+
+where alpha is the learning rate of gradient descent, N(u) is the items that 
user u has expressed preference.
+
+## Parallel SGD
+
+Mahout has a parallel SGD implementation in ParallelSGDFactorizer class. It 
shuffles the user ratings in every iteration and 
+generates splits on the shuffled ratings. Each split is handled by a thread to 
update the user features and item features using 
+vanilla SGD. 
+
+The implementation could be traced back to a lock-free version of SGD based on 
paper 
+[Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient 
Descent](http://www.eecs.berkeley.edu/~brecht/papers/hogwildTR.pdf).
+
+## ALSWR
+
+ALSWR is an iterative algorithm to solve the low rank factorization of user 
feature matrix U and item feature matrix M.  
+The loss function to be minimized is formulated as the sum of squared errors 
plus [Tikhonov 
regularization](http://en.wikipedia.org/wiki/Tikhonov_regularization):
+
+     L(R, U, M) = sum(pow((R[u,i] - U[u,]* (M[i,]^t)), 2)) + lambda * 
(sum(n(u) * ||U[u,]||^2) + sum(n(i) * ||M[i,]||^2))
+ 
+At the beginning of the algorithm, M is initialized with the average item 
ratings as its first row and random numbers for the rest row.  
+
+In every iteration, we fix M and solve U by minimization of the cost function 
L(R, U, M), then we fix U and solve M by the minimization of 
+the cost function similarly. The iteration stops until a certain stopping 
criteria is met.
+
+To solve the matrix U when M is given, each user's feature vector is 
calculated by resolving a regularized linear least square error function 
+using the items the user has rated and their feature vectors:
+
+      1/2 * d(L(R,U,M)) / d(U[u,f]) = 0 
+
+Similary, when M is updated, we resolve a regularized linear least square 
error function using feature vectors of the users that have rated the 
+item and their feature vectors:
+
+      1/2 * d(L(R,U,M)) / d(M[i,f]) = 0
+
+The ALSWRFactorizer class is a non-distributed implementation of ALSWR using 
multi-threading to dispatch the computation among several threads.
+Mahout also offers a [parallel map-reduce 
implementation](https://mahout.apache.org/users/recommender/intro-als-hadoop.html).
+
+<a name="MatrixFactorization-Reference"></a>
+# Reference:
+
+[Stochastic gradient 
descent](http://en.wikipedia.org/wiki/Stochastic_gradient_descent)
+    
+[ALSWR](http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08%28submitted%29.pdf)
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-documentation.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-documentation.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-documentation.md
new file mode 100644
index 0000000..8ba5b28
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-documentation.md
@@ -0,0 +1,277 @@
+---
+layout: default
+title: Recommender Documentation
+theme:
+    name: retro-mahout
+---
+
+<a name="RecommenderDocumentation-Overview"></a>
+## Overview
+
+_This documentation concerns the non-distributed, non-Hadoop-based
+recommender engine / collaborative filtering code inside Mahout. It was
+formerly a separate project called "Taste" and has continued development
+inside Mahout alongside other Hadoop-based code. It may be viewed as a
+somewhat separate, more comprehensive and more mature aspect of this
+code, compared to current development efforts focusing on Hadoop-based
+distributed recommenders. This remains the best entry point into Mahout
+recommender engines of all kinds._
+
+A Mahout-based collaborative filtering engine takes users' preferences for
+items ("tastes") and returns estimated preferences for other items. For
+example, a site that sells books or CDs could easily use Mahout to figure
+out, from past purchase data, which CDs a customer might be interested in
+listening to.
+
+Mahout provides a rich set of components from which you can construct a
+customized recommender system from a selection of algorithms. Mahout is
+designed to be enterprise-ready; it's designed for performance, scalability
+and flexibility.
+
+Top-level packages define the Mahout interfaces to these key abstractions:
+
+* **DataModel**
+* **UserSimilarity**
+* **ItemSimilarity**
+* **UserNeighborhood**
+* **Recommender**
+
+Subpackages of *org.apache.mahout.cf.taste.impl* hold implementations of
+these interfaces. These are the pieces from which you will build your own
+recommendation engine. That's it! 
+
+<a name="RecommenderDocumentation-Architecture"></a>
+## Architecture
+
+![doc](../../images/taste-architecture.png)
+
+This diagram shows the relationship between various Mahout components in a
+user-based recommender. An item-based recommender system is similar except
+that there are no Neighborhood algorithms involved.
+
+<a name="RecommenderDocumentation-Recommender"></a>
+### Recommender
+A Recommender is the core abstraction in Mahout. Given a DataModel, it can
+produce recommendations. Applications will most likely use the
+**GenericUserBasedRecommender** or **GenericItemBasedRecommender**,
+possibly decorated by **CachingRecommender**.
+
+<a name="RecommenderDocumentation-DataModel"></a>
+### DataModel
+A **DataModel** is the interface to information about user preferences. An
+implementation might draw this data from any source, but a database is the
+most likely source. Be sure to wrap this with a **ReloadFromJDBCDataModel** to 
get good performance! Mahout provides **MySQLJDBCDataModel**, for example, to 
access preference data from a database via JDBC and MySQL. Another exists for 
PostgreSQL. Mahout also provides a **FileDataModel**, which is fine for small 
applications.
+
+Users and items are identified solely by an ID value in the
+framework. Further, this ID value must be numeric; it is a Java long type
+through the APIs. A **Preference** object or **PreferenceArray** object
+encapsulates the relation between user and preferred items (or items and
+users preferring them).
+
+Finally, Mahout supports, in various ways, a so-called "boolean" data model
+in which users do not express preferences of varying strengths for items,
+but simply express an association or none at all. For example, while users
+might express a preference from 1 to 5 in the context of a movie
+recommender site, there may be no notion of a preference value between
+users and pages in the context of recommending pages on a web site: there
+is only a notion of an association, or none, between a user and pages that
+have been visited.
+
+<a name="RecommenderDocumentation-UserSimilarity"></a>
+### UserSimilarity
+A **UserSimilarity** defines a notion of similarity between two users. This is
+a crucial part of a recommendation engine. These are attached to a
+**Neighborhood** implementation. **ItemSimilarity** is analagous, but find
+similarity between items.
+
+<a name="RecommenderDocumentation-UserNeighborhood"></a>
+### UserNeighborhood
+In a user-based recommender, recommendations are produced by finding a
+"neighborhood" of similar users near a given user. A **UserNeighborhood**
+defines a means of determining that neighborhood &mdash; for example,
+nearest 10 users. Implementations typically need a **UserSimilarity** to
+operate.
+
+<a name="RecommenderDocumentation-Examples"></a>
+## Examples
+<a name="RecommenderDocumentation-User-basedRecommender"></a>
+### User-based Recommender
+User-based recommenders are the "original", conventional style of
+recommender systems. They can produce good recommendations when tweaked
+properly; they are not necessarily the fastest recommender systems and are
+thus suitable for small data sets (roughly, less than ten million ratings).
+We'll start with an example of this.
+
+First, create a **DataModel** of some kind. Here, we'll use a simple on based
+on data in a file. The file should be in CSV format, with lines of the form
+"userID,itemID,prefValue" (e.g. "39505,290002,3.5"):
+
+
+    DataModel model = new FileDataModel(new File("data.txt"));
+
+
+We'll use the **PearsonCorrelationSimilarity** implementation of 
**UserSimilarity**
+as our user correlation algorithm, and add an optional preference inference
+algorithm:
+
+
+    UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);
+
+
+Now we create a **UserNeighborhood** algorithm. Here we use nearest-3:
+
+
+    UserNeighborhood neighborhood =
+         new NearestNUserNeighborhood(3, userSimilarity, model);{code}
+    
+Now we can create our **Recommender**, and add a caching decorator:
+    
+
+    Recommender recommender =
+         new GenericUserBasedRecommender(model, neighborhood, userSimilarity);
+    Recommender cachingRecommender = new CachingRecommender(recommender);
+
+    
+Now we can get 10 recommendations for user ID "1234" &mdash; done!
+
+    List<RecommendedItem> recommendations =
+         cachingRecommender.recommend(1234, 10);
+
+    
+## Item-based Recommender
+    
+We could have created an item-based recommender instead. Item-based
+recommenders base recommendation not on user similarity, but on item
+similarity. In theory these are about the same approach to the problem,
+just from different angles. However the similarity of two items is
+relatively fixed, more so than the similarity of two users. So, item-based
+recommenders can use pre-computed similarity values in the computations,
+which make them much faster. For large data sets, item-based recommenders
+are more appropriate.
+    
+Let's start over, again with a **FileDataModel** to start:
+    
+
+    DataModel model = new FileDataModel(new File("data.txt"));
+
+    
+We'll also need an **ItemSimilarity**. We could use
+**PearsonCorrelationSimilarity**, which computes item similarity in realtime,
+but, this is generally too slow to be useful. Instead, in a real
+application, you would feed a list of pre-computed correlations to a
+**GenericItemSimilarity**: 
+    
+
+    // Construct the list of pre-computed correlations
+    Collection<GenericItemSimilarity.ItemItemSimilarity> correlations =
+         ...;
+    ItemSimilarity itemSimilarity =
+         new GenericItemSimilarity(correlations);
+
+
+    
+Then we can finish as before to produce recommendations:
+    
+
+    Recommender recommender =
+         new GenericItemBasedRecommender(model, itemSimilarity);
+    Recommender cachingRecommender = new CachingRecommender(recommender);
+    ...
+    List<RecommendedItem> recommendations =
+         cachingRecommender.recommend(1234, 10);
+
+
+<a name="RecommenderDocumentation-Integrationwithyourapplication"></a>
+## Integration with your application
+
+You can create a Recommender, as shown above, wherever you like in your
+Java application, and use it. This includes simple Java applications or GUI
+applications, server applications, and J2EE web applications.
+
+<a name="RecommenderDocumentation-Performance"></a>
+## Performance
+<a name="RecommenderDocumentation-RuntimePerformance"></a>
+### Runtime Performance
+The more data you give, the better. Though Mahout is designed for
+performance, you will undoubtedly run into performance issues at some
+point. For best results, consider using the following command-line flags to
+your JVM:
+
+* -server: Enables the server VM, which is generally appropriate for
+long-running, computation-intensive applications.
+* -Xms1024m -Xmx1024m: Make the heap as big as possible -- a gigabyte
+doesn't hurt when dealing with tens millions of preferences. Mahout
+recommenders will generally use as much memory as you give it for caching,
+which helps performance. Set the initial and max size to the same value to
+avoid wasting time growing the heap, and to avoid having the JVM run minor
+collections to avoid growing the heap, which will clear cached values.
+* -da -dsa: Disable all assertions.
+* -XX:NewRatio=9: Increase heap allocated to 'old' objects, which is most
+of them in this framework
+* -XX:+UseParallelGC -XX:+UseParallelOldGC (multi-processor machines only):
+Use a GC algorithm designed to take advantage of multiple processors, and
+designed for throughput. This is a default in J2SE 5.0.
+* -XX:-DisableExplicitGC: Disable calls to System.gc(). These calls can
+only hurt in the presence of modern GC algorithms; they may force Mahout to
+remove cached data needlessly. This flag isn't needed if you're sure your
+code and third-party code you use doesn't call this method.
+
+Also consider the following tips:
+
+* Use **CachingRecommender** on top of your custom **Recommender** 
implementation.
+* When using **JDBCDataModel**, make sure you wrap it with the 
**ReloadFromJDBCDataModel** to load data into memory!. 
+
+<a name="RecommenderDocumentation-AlgorithmPerformance:WhichOneIsBest?"></a>
+### Algorithm Performance: Which One Is Best?
+There is no right answer; it depends on your data, your application,
+environment, and performance needs. Mahout provides the building blocks
+from which you can construct the best Recommender for your application. The
+links below provide research on this topic. You will probably need a bit of
+trial-and-error to find a setup that works best. The code sample above
+provides a good starting point.
+
+Fortunately, Mahout provides a way to evaluate the accuracy of your
+Recommender on your own data, in org.apache.mahout.cf.taste.eval
+
+
+    DataModel myModel = ...;
+    RecommenderBuilder builder = new RecommenderBuilder() {
+      public Recommender buildRecommender(DataModel model) {
+        // build and return the Recommender to evaluate here
+      }
+    };
+    RecommenderEvaluator evaluator =
+         new AverageAbsoluteDifferenceRecommenderEvaluator();
+    double evaluation = evaluator.evaluate(builder, myModel, 0.9, 1.0);
+
+
+For "boolean" data model situations, where there are no notions of
+preference value, the above evaluation based on estimated preference does
+not make sense. In this case, try a *RecommenderIRStatsEvaluator*, which 
presents
+traditional information retrieval figures like precision and recall, which
+are more meaningful.
+
+
+<a name="RecommenderDocumentation-UsefulLinks"></a>
+## Useful Links
+
+
+Here's a handful of research papers that I've read and found particularly
+useful:
+
+J.S. Breese, D. Heckerman and C. Kadie, "[Empirical Analysis of Predictive 
Algorithms for Collaborative 
Filtering](http://research.microsoft.com/research/pubs/view.aspx?tr_id=166)
+," in Proceedings of the Fourteenth Conference on Uncertainity in
+Artificial Intelligence (UAI 1998), 1998.
+
+B. Sarwar, G. Karypis, J. Konstan and J. Riedl, "[Item-based collaborative 
filtering recommendation algorithms](http://www10.org/cdrom/papers/519/)
+" in Proceedings of the Tenth International Conference on the World Wide
+Web (WWW 10), pp. 285-295, 2001.
+
+P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom and J. Riedl, "[GroupLens: an 
open architecture for collaborative filtering of 
netnews](http://doi.acm.org/10.1145/192844.192905)
+" in Proceedings of the 1994 ACM conference on Computer Supported
+Cooperative Work (CSCW 1994), pp. 175-186, 1994.
+
+J.L. Herlocker, J.A. Konstan, A. Borchers and J. Riedl, "[An algorithmic 
framework for performing collaborative 
filtering](http://www.grouplens.org/papers/pdf/algs.pdf)
+" in Proceedings of the 22nd annual international ACM SIGIR Conference on
+Research and Development in Information Retrieval (SIGIR 99), pp. 230-237,
+1999.
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-first-timer-faq.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-first-timer-faq.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-first-timer-faq.md
new file mode 100644
index 0000000..2b090e6
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/recommender-first-timer-faq.md
@@ -0,0 +1,54 @@
+---
+layout: default
+title: Recommender First-Timer FAQ
+theme:
+    name: retro-mahout
+---
+
+# Recommender First Timer Dos and Don'ts
+
+Many people with an interest in recommenders arrive at Mahout since they're
+building a first recommender system. Some starting questions have been
+asked enough times to warrant a FAQ collecting advice and rules-of-thumb to
+newcomers.
+
+For the interested, these topics are treated in detail in the book [Mahout in 
Action](http://manning.com/owen/).
+
+Don't start with a distributed, Hadoop-based recommender; take on that
+complexity only if necessary. Start with non-distributed recommenders. It
+is simpler, has fewer requirements, and is more flexible. 
+
+As a crude rule of thumb, a system with up to 100M user-item associations
+(ratings, preferences) should "fit" onto one modern server machine with 4GB
+of heap available and run acceptably as a real-time recommender. The system
+is invariably memory-bound since keeping data in memory is essential to
+performance.
+
+Beyond this point it gets expensive to deploy a machine with enough RAM,
+so, designing for a distributed makes sense when nearing this scale.
+However most applications don't "really" have 100M associations to process.
+Data can be sampled; noisy and old data can often be aggressively pruned
+without significant impact on the result.
+
+The next question is whether or not your system has preference values, or
+ratings. Do users and items merely have an association or not, such as the
+existence or lack of a click? or is behavior translated into some scalar
+value representing the user's degree of preference for the item.
+
+If you have ratings, then a good place to start is a
+GenericItemBasedRecommender, plus a PearsonCorrelationSimilarity similarity
+metric. If you don't have ratings, then a good place to start is
+GenericBooleanPrefItemBasedRecommender and LogLikelihoodSimilarity.
+
+If you want to do content-based item-item similarity, you need to implement
+your own ItemSimilarity.
+
+If your data can be simply exported to a CSV file, use FileDataModel and
+push new files periodically.
+If your data is in a database, use MySQLJDBCDataModel (or its "BooleanPref"
+counterpart if appropriate, or its PostgreSQL counterpart, etc.) and put on
+top a ReloadFromJDBCDataModel.
+
+This should give a reasonable starter system which responds fast. The
+nature of the system is that new data comes in from the file or database
+only periodically -- perhaps on the order of minutes. 
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/map-reduce/recommender/userbased-5-minutes.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/map-reduce/recommender/userbased-5-minutes.md
 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/userbased-5-minutes.md
new file mode 100644
index 0000000..da17b38
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_convenience/map-reduce/recommender/userbased-5-minutes.md
@@ -0,0 +1,133 @@
+---
+layout: default
+title: User Based Recommender in 5 Minutes
+theme:
+    name: retro-mahout
+---
+
+# Creating a User-Based Recommender in 5 minutes
+
+##Prerequisites
+
+Create a java project in your favorite IDE and make sure mahout is on the 
classpath. The easiest way to accomplish this is by importing it via maven as 
described on the [Quickstart](/users/basics/quickstart.html) page.
+
+
+## Dataset
+
+Mahout's recommenders expect interactions between users and items as input. 
The easiest way to supply such data to Mahout is in the form of a textfile, 
where every line has the format *userID,itemID,value*. Here *userID* and 
*itemID* refer to a particular user and a particular item, and *value* denotes 
the strength of the interaction (e.g. the rating given to a movie).
+
+In this example, we'll use some made up data for simplicity. Create a file 
called "dataset.csv" and copy the following example interactions into the file. 
+
+<pre>
+1,10,1.0
+1,11,2.0
+1,12,5.0
+1,13,5.0
+1,14,5.0
+1,15,4.0
+1,16,5.0
+1,17,1.0
+1,18,5.0
+2,10,1.0
+2,11,2.0
+2,15,5.0
+2,16,4.5
+2,17,1.0
+2,18,5.0
+3,11,2.5
+3,12,4.5
+3,13,4.0
+3,14,3.0
+3,15,3.5
+3,16,4.5
+3,17,4.0
+3,18,5.0
+4,10,5.0
+4,11,5.0
+4,12,5.0
+4,13,0.0
+4,14,2.0
+4,15,3.0
+4,16,1.0
+4,17,4.0
+4,18,1.0
+</pre>
+
+## Creating a user-based recommender
+
+Create a class called *SampleRecommender* with a main method.
+
+The first thing we have to do is load the data from the file. Mahout's 
recommenders use an interface called *DataModel* to handle interaction data. 
You can load our made up interactions like this:
+
+<pre>
+DataModel model = new FileDataModel(new File("/path/to/dataset.csv"));
+</pre>
+
+In this example, we want to create a user-based recommender. The idea behind 
this approach is that when we want to compute recommendations for a particular 
users, we look for other users with a similar taste and pick the 
recommendations from their items. For finding similar users, we have to compare 
their interactions. There are several methods for doing this. One popular 
method is to compute the [correlation 
coefficient](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient)
 between their interactions. In Mahout, you use this method as follows:
+
+<pre>
+UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
+</pre>
+
+The next thing we have to do is to define which similar users we want to 
leverage for the recommender. For the sake of simplicity, we'll use all that 
have a similarity greater than *0.1*. This is implemented via a 
*ThresholdUserNeighborhood*:
+
+<pre>UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, 
similarity, model);</pre>
+
+Now we have all the pieces to create our recommender:
+
+<pre>
+UserBasedRecommender recommender = new GenericUserBasedRecommender(model, 
neighborhood, similarity);
+</pre>
+        
+We can easily ask the recommender for recommendations now. If we wanted to get 
three items recommended for the user with *userID* 2, we would do it like this:
+       
+
+<pre>
+List<RecommendedItem> recommendations = recommender.recommend(2, 3);
+for (RecommendedItem recommendation : recommendations) {
+  System.out.println(recommendation);
+}
+</pre>
+
+
+Congratulations, you have built your first recommender!
+
+
+## Evaluation
+
+You might ask yourself, how to make sure that your recommender returns good 
results. Unfortunately, the only way to be really sure about the quality is by 
doing an A/B test with real users in a live system.
+
+We can however try to get a feel of the quality, by statistical offline 
evaluation. Just keep in mind that this does not replace a test with real users!
+
+One way to check whether the recommender returns good results is by doing a 
**hold-out** test. We partition our dataset into two sets: a trainingset 
consisting of 90% of the data and a testset consisting of 10%. Then we train 
our recommender using the training set and look how well it predicts the 
unknown interactions in the testset.
+
+To test our recommender, we create a class called *EvaluateRecommender* with a 
main method and add an inner class called *MyRecommenderBuilder* that 
implements the *RecommenderBuilder* interface. We implement the 
*buildRecommender* method and make it setup our user-based recommender:
+
+<pre>
+UserSimilarity similarity = new PearsonCorrelationSimilarity(dataModel);
+UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, 
dataModel);
+return new GenericUserBasedRecommender(dataModel, neighborhood, similarity);
+</pre>
+
+Now we have to create the code for the test. We'll check how much the 
recommender misses the real interaction strength on average. We employ an 
*AverageAbsoluteDifferenceRecommenderEvaluator* for this. The following code 
shows how to put the pieces together and run a hold-out test: 
+
+<pre>
+DataModel model = new FileDataModel(new File("/path/to/dataset.csv"));
+RecommenderEvaluator evaluator = new 
AverageAbsoluteDifferenceRecommenderEvaluator();
+RecommenderBuilder builder = new MyRecommenderBuilder();
+double result = evaluator.evaluate(builder, null, model, 0.9, 1.0);
+System.out.println(result);
+</pre>
+
+Note: if you run this test multiple times, you will get different results, 
because the splitting into trainingset and testset is done randomly. 
+
+
+
+
+
+
+
+
+
+
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_convenience/powered-by-mahout.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_convenience/powered-by-mahout.md 
b/website/old_site_migration/needs_work_convenience/powered-by-mahout.md
new file mode 100644
index 0000000..cb7c039
--- /dev/null
+++ b/website/old_site_migration/needs_work_convenience/powered-by-mahout.md
@@ -0,0 +1,129 @@
+---
+layout: default
+title: Powered By Mahout
+theme:
+    name: retro-mahout
+---
+
+# Powered by Mahout
+
+Are you using Mahout to do Machine Learning? <a 
href="https://mahout.apache.org/general/mailing-lists,-irc-and-archives.html";>Care
 to share</a>? Developers of the project always are happy to learn about new 
happy users with interesting use cases.
+
+*Links here do NOT imply
+endorsement by Mahout, its committers or the Apache Software Foundation and
+are for informational purposes only.*
+
+<a name="PoweredByMahout-CommercialUse"></a>
+## Commercial Use
+
+* <a 
href="http://nosql.mypopescu.com/post/2082712431/hbase-and-hadoop-at-adobe";>Adobe
 AMP</a> uses Mahout's clustering algorithms to increase video
+consumption by better user targeting. 
+* Accenture uses Mahout as typical example for their [Hadoop Deployment 
Comparison 
Study](http://www.accenture.com/SiteCollectionDocuments/PDF/Accenture-Hadoop-Deployment-Comparison-Study.pdf)
+* [AOL](http://www.aol.com)
+ use Mahout for shopping recommendations. See [slide 
deck](http://www.slideshare.net/kryton/the-data-layer)
+* [Booz Allen Hamilton](http://www.boozallen.com/)
+ uses Mahout's clustering algorithms. See [slide 
deck](http://www.slideshare.net/ydn/3-biometric-hadoopsummit2010)
+* [Buzzlogic](http://www.buzzlogic.com)
+ uses Mahout's clustering algorithms to improve ad targeting
+* [Cull.tv](http://cull.tv/)
+ uses modified Mahout algorithms for content recommendations
+* ![DatamineLab](http://cdn.dataminelab.com/favicon.ico) [DataMine 
Lab](http://dataminelab.com)
+ uses Mahout's recommendation and clustering algorithms to improve our
+clients' ad targeting.
+* [Drupal](http://drupal.org/project/recommender)
+ uses Mahout to provide open source content recommendation solutions.
+* [Evolv ](http://www.evolvondemand.com)
+ uses Mahout for its Workforce Predictive Analytics platform.
+* [Foursquare](http://www.foursquare.com)
+ uses Mahout for its [recommendation 
engine](http://engineering.foursquare.com/2011/03/22/building-a-recommendation-engine-foursquare-style/).
+* [Idealo](http://www.idealo.de)
+ uses Mahout's recommendation engine.
+* [InfoGlutton](http://www.infoglutton.com)
+ uses Mahout's clustering and classification for various consulting
+projects.
+* 
[Intel](http://mark.chmarny.com/2013/07/thinking-big-about-data-at-intel.html)
+ ships Mahout as part of their Distribution for Apache Hadoop Software.
+* [Intela](http://www.intela.com/)
+ has implementations of Mahout's recommendation algorithms to select new
+offers to send tu customers, as well as to recommend potential customers to
+current offers. We are also working on enhancing our offer categories by
+using the clustering algorithms.
+* ![iOffer](http://ioffer.com/favicon.ico) [iOffer](http://www.ioffer.com)
+ uses Mahout's Frequent Pattern Mining and Collaborative Filtering to
+recommend items to users.
+* ![kau.li](http://kau.li/favicon.ico) [Kauli](http://kau.li/en)
+, one of Japanese Adnetwork, uses Mahout's clustering to handle clickstream
+data for predicting audience's interests and intents.
+* [Linked.In](http://linkedin.com)
+ Historically, we have used R for model training. We have recently started
+experimenting with Mahout for model training and are excited about it - also 
see
+ <a 
href="https://www.quora.com/LinkedIn-Recommendations/How-does-LinkedIns-recommendation-system-work?srid=XoeG&share=1";>Hadoop
 World slides</a>
+.
+* [LucidWorks Big Data](http://www.lucidworks.com/products/lucidworks-big-data)
+ uses Mahout for clustering, duplicate document detection, phrase
+extraction and classification.
+* ![Mendeley](http://mendeley.com/favicon.ico) [Mendeley](http://mendeley.com)
+ uses Mahout to power Mendeley Suggest, a research article recommendation
+service.
+* ![Mippin](http://mippin.com/web/favicon.ico) [Mippin](http://mippin.com)
+ uses Mahout's collaborative filtering engine to recommend news feeds
+* 
[Mobage](http://www.slideshare.net/hamadakoichi/mobage-prmu-2011-mahout-hadoop)
+ uses Mahout in their analysis pipeline
+* ![Myrrix](http://myrrix.com/wp-content/uploads/2012/03/favicon.ico) 
[Myrrix](http://myrrix.com)
+ is a recommender system product built on Mahout.
+* ![Newscred](http://www.newscred.com/static/img/website/favicon.ico) 
[NewsCred](http://platform.newscred.com)
+ uses Mahout to generate clusters of news articles and to surface the
+important stories of the day
+* [Next Glass](http://nextglass.co/)
+ uses Mahout
+* [Predixion Software](http://predixionsoftware.com/)
+ uses Mahoutâs algorithms to build predictive models on big data
+* <img src="http://www.radoop.eu/wp-content/uploads/favicon.png"; width=15> 
[Radoop](http://radoop.eu)
+ provides a drag-n-drop interface for big data analytics, including Mahout
+clustering and classification algorithms
+* ![Researchgate](https://www.researchgate.net/favicon.ico) 
[ResearchGate](http://www.researchgate.net/), the professional network for 
scientists and researchers, uses Mahout's
+recommendation algorithms.
+* [Sematext](http://www.sematext.com/)
+ uses Mahout for its recommendation engine
+* [SpeedDate.com](http://www.speeddate.com)
+ uses Mahout's collaborative filtering engine to recommend member profiles
+* [Twitter](http://twitter.com)
+ uses Mahout's LDA implementation for user interest modeling
+* [Yahoo\!](http://www.yahoo.com)
+ Mail uses Mahout's Frequent Pattern Set Mining.  See 
[slides](http://www.slideshare.net/hadoopusergroup/mail-antispam)
+* [365Media ](http://365media.com/)
+ uses *Mahout's* Classification and Collaborative Filtering algorithms in
+its Real-time system named [UPTIME](http://uptime.365media.com/)
+ and 365Media/Social
+
+<a name="PoweredByMahout-AcademicUse"></a>
+## Academic Use
+
+* [Dicode](https://www.dicode-project.eu/)
+ project uses Mahout's clustering and classification algorithms on top of
+HBase.
+* The course [Large Scale Data Analysis and Data 
Mining](http://www.dima.tu-berlin.de/menue/teaching/masterstudium/aim-3/)
+ at TU Berlin uses Mahout to teach students about the parallelization of data
+mining problems with Hadoop and Map/Reduce
+* Mahout is used at Carnegie Mellon University, as a comparable platform to 
[GraphLab](http://www.graphlab.ml.cmu.edu/)
+
+* The [ROBUST project](http://www.robust-project.eu/)
+, co-funded by the European Commission, employs Mahout in the large scale
+analysis of online community data.
+* Mahout is used for research and data processing at [Nagoya Institute of 
Technology](http://www.nitech.ac.jp/eng/schools/grad/cse.html)
+, in the context of a large-scale citizen participation platform project,
+funded by the Ministry of Interior of Japan.
+* Several researches within [Digital Enterprise Research 
Institute](http://www.deri.ie)
+ [NUI Galway](http://www.nuigalway.ie)
+ use Mahout for e.g. topic mining and modelling of large corpora.
+* Mahout is used in the NoTube EU project.
+
+<a name="PoweredByMahout-PoweredByLogos"></a>
+## Powered By Logos
+
+Feel free to use our **Powered By** logos on your site:
+
+![powered by 
logo](https://mahout.apache.org/images/mahout-logo-poweredby-55.png)
+
+
+![powered by 
logo](https://mahout.apache.org/images/mahout-logo-poweredby-100.png)
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md
----------------------------------------------------------------------
diff --git 
a/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md 
b/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md
new file mode 100644
index 0000000..14dd276
--- /dev/null
+++ 
b/website/old_site_migration/needs_work_priority/creating-vectors-from-text.md
@@ -0,0 +1,291 @@
+---
+layout: default
+title: Creating Vectors from Text
+theme:
+    name: retro-mahout
+---
+
+
+# Creating vectors from text
+<a name="CreatingVectorsfromText-Introduction"></a>
+# Introduction
+
+For clustering and classifying documents it is usually necessary to convert 
the raw text
+into vectors that can then be consumed by the clustering 
[Algorithms](algorithms.html).  These approaches are described below.
+
+<a name="CreatingVectorsfromText-FromLucene"></a>
+# From Lucene
+
+*NOTE: Your Lucene index must be created with the same version of Lucene
+used in Mahout.  As of Mahout 0.9 this is Lucene 4.6.1. If these versions dont 
match you will likely get "Exception in thread "main"
+org.apache.lucene.index.CorruptIndexException: Unknown format version: -11"
+as an error.*
+
+Mahout has utilities that allow one to easily produce Mahout Vector
+representations from a Lucene (and Solr, since they are they same) index.
+
+For this, we assume you know how to build a Lucene/Solr index. For those
+who don't, it is probably easiest to get up and running using 
[Solr](http://lucene.apache.org/solr)
+ as it can ingest things like PDFs, XML, Office, etc. and create a Lucene
+index. For those wanting to use just Lucene, see the [Lucene 
website](http://lucene.apache.org/core)
+ or check out _Lucene In Action_ by Erik Hatcher, Otis Gospodnetic and Mike
+McCandless.
+
+To get started, make sure you get a fresh copy of Mahout from 
[GitHub](http://mahout.apache.org/developers/buildingmahout.html)
+ and are comfortable building it. It defines interfaces and implementations
+for efficiently iterating over a data source (it only supports Lucene
+currently, but should be extensible to databases, Solr, etc.) and produces
+a Mahout Vector file and term dictionary which can then be used for
+clustering.   The main code for driving this is the driver program located
+in the org.apache.mahout.utils.vectors package.  The driver program offers
+several input options, which can be displayed by specifying the --help
+option.  Examples of running the driver are included below:
+
+<a name="CreatingVectorsfromText-GeneratinganoutputfilefromaLuceneIndex"></a>
+#### Generating an output file from a Lucene Index
+
+
+    $MAHOUT_HOME/bin/mahout lucene.vector 
+        --dir (-d) dir                     The Lucene directory      
+        --idField idField                  The field in the index    
+                                               containing the index.  If 
+                                               null, then the Lucene     
+                                               internal doc id is used   
+                                               which is prone to error   
+                                               if the underlying index   
+                                               changes                   
+        --output (-o) output               The output file           
+        --delimiter (-l) delimiter         The delimiter for         
+                                               outputting the dictionary 
+        --help (-h)                        Print out help            
+        --field (-f) field                 The field in the index    
+        --max (-m) max                         The maximum number of     
+                                               vectors to output.  If    
+                                               not specified, then it    
+                                               will loop over all docs   
+        --dictOut (-t) dictOut             The output of the         
+                                               dictionary                
+        --seqDictOut (-st) seqDictOut      The output of the         
+                                               dictionary as sequence    
+                                               file                      
+        --norm (-n) norm                   The norm to use,          
+                                               expressed as either a     
+                                               double or "INF" if you    
+                                               want to use the Infinite  
+                                               norm.  Must be greater or 
+                                               equal to 0.  The default  
+                                               is not to normalize       
+        --maxDFPercent (-x) maxDFPercent   The max percentage of     
+                                               docs for the DF.  Can be  
+                                               used to remove really     
+                                               high frequency terms.     
+                                               Expressed as an integer   
+                                               between 0 and 100.        
+                                               Default is 99.            
+        --weight (-w) weight               The kind of weight to     
+                                               use. Currently TF or      
+                                               TFIDF                     
+        --minDF (-md) minDF                The minimum document      
+                                               frequency.  Default is 1  
+        --maxPercentErrorDocs (-err) mErr  The max percentage of     
+                                               docs that can have a null 
+                                               term vector. These are    
+                                               noise document and can    
+                                               occur if the analyzer     
+                                               used strips out all terms 
+                                               in the target field. This 
+                                               percentage is expressed   
+                                               as a value between 0 and  
+                                               1. The default is 0.  
+  
+#### Example: Create 50 Vectors from an Index 
+
+    $MAHOUT_HOME/bin/mahout lucene.vector
+        --dir $WORK_DIR/wikipedia/solr/data/index 
+        --field body 
+        --dictOut $WORK_DIR/solr/wikipedia/dict.txt
+        --output $WORK_DIR/solr/wikipedia/out.txt 
+        --max 50
+
+
+This uses the index specified by --dir and the body field in it and writes
+out the info to the output dir and the dictionary to dict.txt. It only
+outputs 50 vectors.  If you don't specify --max, then all the documents in
+the index are output.
+
+<a name="CreatingVectorsfromText-50VectorsFromLuceneL2Norm"></a>
+#### Example: Creating 50 Normalized Vectors from a Lucene Index using the 
[L_2 Norm](http://en.wikipedia.org/wiki/Lp_space)
+
+    $MAHOUT_HOME/bin/mahout lucene.vector 
+        --dir $WORK_DIR/wikipedia/solr/data/index 
+        --field body 
+        --dictOut $WORK_DIR/solr/wikipedia/dict.txt
+        --output $WORK_DIR/solr/wikipedia/out.txt 
+        --max 50 
+        --norm 2
+
+
+<a name="CreatingVectorsfromText-FromDirectoryofTextdocuments"></a>
+## From A Directory of Text documents
+Mahout has utilities to generate Vectors from a directory of text
+documents. Before creating the vectors, you need to convert the documents
+to SequenceFile format. SequenceFile is a hadoop class which allows us to
+write arbitary (key, value) pairs into it. The DocumentVectorizer requires the
+key to be a Text with a unique document id, and value to be the Text
+content in UTF-8 format.
+
+You may find [Tika](http://tika.apache.org/) helpful in converting
+binary documents to text.
+
+<a 
name="CreatingVectorsfromText-ConvertingdirectoryofdocumentstoSequenceFileformat"></a>
+#### Converting directory of documents to SequenceFile format
+Mahout has a nifty utility which reads a directory path including its
+sub-directories and creates the SequenceFile in a chunked manner for us.
+
+    $MAHOUT_HOME/bin/mahout seqdirectory 
+        --input (-i) input                       Path to job input directory.  
 
+        --output (-o) output                     The directory pathname for    
 
+                                                     output.                   
     
+        --overwrite (-ow)                        If present, overwrite the     
 
+                                                     output directory before   
     
+                                                     running job               
     
+        --method (-xm) method                    The execution method to use:  
 
+                                                     sequential or mapreduce.  
     
+                                                     Default is mapreduce      
     
+        --chunkSize (-chunk) chunkSize           The chunkSize in MegaBytes.   
 
+                                                     Defaults to 64            
     
+        --fileFilterClass (-filter) fFilterClass The name of the class to use  
 
+                                                     for file parsing. 
Default:     
+                                                     
org.apache.mahout.text.PrefixAdditionFilter                   
+        --keyPrefix (-prefix) keyPrefix          The prefix to be prepended to 
 
+                                                     the key                   
     
+        --charset (-c) charset                   The name of the character     
 
+                                                     encoding of the input 
files.   
+                                                     Default to UTF-8 
{accepts: cp1252|ascii...}             
+        --method (-xm) method                    The execution method to use:  
 
+                                                     sequential or mapreduce.  
     
+                                                 Default is mapreduce          
 
+        --overwrite (-ow)                        If present, overwrite the     
 
+                                                     output directory before   
     
+                                                     running job               
     
+        --help (-h)                              Print out help                
 
+        --tempDir tempDir                        Intermediate output directory 
 
+        --startPhase startPhase                  First phase to run            
 
+        --endPhase endPhase                      Last phase to run  
+
+The output of seqDirectory will be a Sequence file < Text, Text > of all 
documents (/sub-directory-path/documentFileName, documentText).
+
+<a name="CreatingVectorsfromText-CreatingVectorsfromSequenceFile"></a>
+#### Creating Vectors from SequenceFile
+
+From the sequence file generated from the above step run the following to
+generate vectors. 
+
+
+    $MAHOUT_HOME/bin/mahout seq2sparse
+        --minSupport (-s) minSupport      (Optional) Minimum Support. Default  
     
+                                              Value: 2                         
         
+        --analyzerName (-a) analyzerName  The class name of the analyzer       
     
+        --chunkSize (-chunk) chunkSize    The chunkSize in MegaBytes. Default  
     
+                                              Value: 100MB                     
         
+        --output (-o) output              The directory pathname for output.   
     
+        --input (-i) input                Path to job input directory.         
     
+        --minDF (-md) minDF               The minimum document frequency.  
Default  
+                                              is 1                             
         
+        --maxDFSigma (-xs) maxDFSigma     What portion of the tf (tf-idf) 
vectors   
+                                              to be used, expressed in times 
the        
+                                              standard deviation (sigma) of 
the         
+                                              document frequencies of these 
vectors.    
+                                              Can be used to remove really 
high         
+                                              frequency terms. Expressed as a 
double    
+                                              value. Good value to be 
specified is 3.0. 
+                                              In case the value is less than 0 
no       
+                                              vectors will be filtered out. 
Default is  
+                                              -1.0.  Overrides maxDFPercent    
         
+        --maxDFPercent (-x) maxDFPercent  The max percentage of docs for the 
DF.    
+                                              Can be used to remove really 
high         
+                                              frequency terms. Expressed as an 
integer  
+                                              between 0 and 100. Default is 
99.  If     
+                                              maxDFSigma is also set, it will 
override  
+                                              this value.                      
         
+        --weight (-wt) weight             The kind of weight to use. Currently 
TF   
+                                              or TFIDF. Default: TFIDF         
         
+        --norm (-n) norm                  The norm to use, expressed as either 
a    
+                                              float or "INF" if you want to 
use the     
+                                              Infinite norm.  Must be greater 
or equal  
+                                              to 0.  The default is not to 
normalize    
+        --minLLR (-ml) minLLR             (Optional)The minimum Log Likelihood 
     
+                                              Ratio(Float)  Default is 1.0     
         
+        --numReducers (-nr) numReducers   (Optional) Number of reduce tasks.   
     
+                                              Default Value: 1                 
         
+        --maxNGramSize (-ng) ngramSize    (Optional) The maximum size of 
ngrams to  
+                                              create (2 = bigrams, 3 = 
trigrams, etc)   
+                                              Default Value:1                  
         
+        --overwrite (-ow)                 If set, overwrite the output 
directory    
+        --help (-h)                           Print out help                   
         
+        --sequentialAccessVector (-seq)   (Optional) Whether output vectors 
should  
+                                              be SequentialAccessVectors. 
Default is false;
+                                              true required for running some 
algorithms
+                                              (LDA,Lanczos)                    
            
+        --namedVector (-nv)               (Optional) Whether output vectors 
should  
+                                              be NamedVectors. If set true 
else false   
+        --logNormalize (-lnorm)           (Optional) Whether output vectors 
should  
+                                              be logNormalize. If set true 
else false
+
+
+
+This will create SequenceFiles of tokenized documents < Text, StringTuple >  
(docID, tokenizedDoc) and vectorized documents < Text, VectorWritable > (docID, 
TF-IDF Vector).  
+
+As well, seq2sparse will create SequenceFiles for: a dictionary (wordIndex, 
word), a word frequency count (wordIndex, count) and a document frequency count 
(wordIndex, DFCount) in the output directory. 
+
+The --minSupport option is the min frequency for the word to be considered as 
a feature; --minDF is the min number of documents the word needs to be in; 
--maxDFPercent is the max value of the expression (document frequency of a 
word/total number of document) to be considered as good feature to be in the 
document. These options are helpful in removing high frequency features like 
stop words.
+
+The vectorized documents can then be used as input to many of Mahout's 
classification and clustering algorithms.
+
+#### Example: Creating Normalized 
[TF-IDF](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) Vectors from a directory 
of text documents using [trigrams](http://en.wikipedia.org/wiki/N-gram) and the 
[L_2 Norm](http://en.wikipedia.org/wiki/Lp_space)
+Create sequence files from the directory of text documents:
+    
+    $MAHOUT_HOME/bin/mahout seqdirectory 
+        -i $WORK_DIR/reuters 
+        -o $WORK_DIR/reuters-seqdir 
+        -c UTF-8
+        -chunk 64
+        -xm sequential
+
+Vectorize the documents using trigrams, L_2 length normalization and a maximum 
document frequency cutoff of 85%.
+
+    $MAHOUT_HOME/bin/mahout seq2sparse 
+        -i $WORK_DIR/reuters-out-seqdir/ 
+        -o $WORK_DIR/reuters-out-seqdir-sparse-kmeans 
+        --namedVec
+        -wt tfidf
+        -ng 3
+        -n 2
+        --maxDFPercent 85 
+
+The sequence file in the 
$WORK_DIR/reuters-out-seqdir-sparse-kmeans/tfidf-vectors directory can now be 
used as input to the Mahout 
[k-Means](http://mahout.apache.org/users/clustering/k-means-clustering.html) 
clustering algorithm.
+
+<a name="CreatingVectorsfromText-Background"></a>
+## Background
+
+* [Discussion on centroid calculations with sparse 
vectors](http://markmail.org/thread/l5zi3yk446goll3o)
+
+<a 
name="CreatingVectorsfromText-ConvertingexistingvectorstoMahout'sformat"></a>
+## Converting existing vectors to Mahout's format
+
+If you are in the happy position to already own a document (as in: texts,
+images or whatever item you wish to treat) processing pipeline, the
+question arises of how to convert the vectors into the Mahout vector
+format. Probably the easiest way to go would be to implement your own
+Iterable<Vector> (called VectorIterable in the example below) and then
+reuse the existing VectorWriter classes:
+
+
+    VectorWriter vectorWriter = SequenceFile.createWriter(filesystem,
+                                                          configuration,
+                                                          outfile,
+                                                          LongWritable.class,
+                                                          SparseVector.class);
+
+    long numDocs = vectorWriter.write(new VectorIterable(), Long.MAX_VALUE);
+

http://git-wip-us.apache.org/repos/asf/mahout/blob/9c031452/website/old_site_migration/needs_work_priority/creating-vectors.md
----------------------------------------------------------------------
diff --git a/website/old_site_migration/needs_work_priority/creating-vectors.md 
b/website/old_site_migration/needs_work_priority/creating-vectors.md
new file mode 100644
index 0000000..10cbd8e
--- /dev/null
+++ b/website/old_site_migration/needs_work_priority/creating-vectors.md
@@ -0,0 +1,16 @@
+---
+layout: default
+title: Creating Vectors
+theme:
+    name: retro-mahout
+---
+
+
+<a name="CreatingVectors-UtilitiesforCreatingVectors"></a>
+# Utilities for Creating Vectors
+
+1. [Text](creating-vectors-from-text.html) ... utilities to turn plain text 
into Mahout vectors.
+
+1. Mahout also has rudimentary support for the arff file format. See [arff 
junit 
doc](https://builds.apache.org/job/Mahout-Quality/ws/trunk/integration/target/site/apidocs/org/apache/mahout/utils/vectors/arff/package-summary.html).
+
+1. There is also support for reading vectors from [csv 
files](https://builds.apache.org/job/Mahout-Quality/ws/trunk/integration/target/site/apidocs/org/apache/mahout/utils/vectors/csv/package-summary.html).

[4/9] mahout git commit: WEBSITE Triage of Old Site Migration

Reply via email to