[
https://issues.apache.org/jira/browse/FLINK-1695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14363522#comment-14363522
]
ASF GitHub Bot commented on FLINK-1695:
---------------------------------------
Github user tillrohrmann commented on a diff in the pull request:
https://github.com/apache/flink/pull/479#discussion_r26506880
--- Diff: docs/ml/alternating_least_squares.md ---
@@ -0,0 +1,157 @@
+---
+mathjax: include
+title: Alternating Least Squares
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+The alternating least squares (ALS) algorithm factorizes a given matrix
$R$ into two factors $U$ and $V$ such that $R \approx U^TV$.
+The unknown row dimension is given as a parameter to the algorithm and is
called latent factors.
+Since matrix factorization can be used in the context of recommendation,
the matrices $U$ and $V$ can be called user and item matrix, respectively.
+The $i$th column of the user matrix is denoted by $u_i$ and the $i$th
column of the item matrix is $v_i$.
+The matrix $R$ can be called the ratings matrix with $$(R)_{i,j} =
r_{i,j}$$.
+
+In order to find the user and item matrix, the following problem is solved:
+
+$$\arg\min_{U,V} \sum_{\{i,j\mid r_{i,j} \not= 0\}} \left(r_{i,j} -
u_{i}^Tv_{j}\right)^2 +
+\lambda \left(\sum_{i} n_{u_i} \left\lVert u_i \right\rVert^2 + \sum_{j}
n_{v_j} \left\lVert v_j \right\rVert^2 \right)$$
+
+with $\lambda$ being the regularization factor, $$n_{u_i}$$ being the
number of items the user $i$ has rated and $$n_{v_j}$$ being the number of
times the item $j$ has been rated.
+This regularization scheme to avoid overfitting is called
weighted-$\lambda$-regularization.
+Details can be found in the work of [Zhou et
al.](http://dx.doi.org/10.1007/978-3-540-68880-8_32).
+
+By fixing one of the matrices $U$ or $V$, we obtain a quadratic form which
can be solved directly.
+The solution of the modified problem is guaranteed to monotonically
decrease the overall cost function.
+By applying this step alternately to the matrices $U$ and $V$, we can
iteratively improve the matrix factorization.
+
+The matrix $R$ is given in its sparse representation as a tuple of $(i, j,
r)$ where $i$ denotes the row index, $j$ the column index and $r$ is the matrix
value at position $(i,j)$.
+
+
+## Parameters
+
+The alternating least squares implementation can be controlled by the
following parameters:
+
+ <table class="table table-bordered">
+ <thead>
+ <tr>
+ <th class="text-left" style="width: 20%">Parameters</th>
+ <th class="text-center">Description</th>
+ </tr>
+ </thead>
+
+ <tbody>
+ <tr>
+ <td><strong>NumFactors</strong></td>
+ <td>
+ <p>
+ The number of latent factors to use for the underlying model.
+ It is equivalent to the dimension of the calculated user and
item vectors.
+ (Default value: <strong>10</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>Lambda</strong></td>
+ <td>
+ <p>
+ Regularization factor. Tune this value in order to avoid
overfitting or poor performance due to strong generalization.
+ (Default value: <strong>1</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>Iterations</strong></td>
+ <td>
+ <p>
+ The maximum number of iterations.
+ (Default value: <strong>10</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>Blocks</strong></td>
+ <td>
+ <p>
+ The number of blocks into which the user and item matrix a
grouped.
+ The fewer blocks one uses, the less data is sent redundantly.
+ However, bigger blocks entail bigger update messages which
have to be stored on the heap.
+ If the algorithm fails because of an OutOfMemoryException,
then try to increase the number of blocks.
+ (Default value: '''None''')
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>Seed</strong></td>
+ <td>
+ <p>
+ Random seed used to generate the initial item matrix for the
algorithm.
+ (Default value: <strong>0</strong>)
+ </p>
+ </td>
+ </tr>
+ <tr>
+ <td><strong>TemporaryPath</strong></td>
+ <td>
+ <p>
+ Path to a temporary directory into which intermediate results
are stored.
+ If this value is set, then the algorithm is split into two
preprocessing steps, the ALS iteration and a post-processing step which
calculates a last ALS half-step.
+ The preprocessing steps calculate the
<code>OutBlockInformation</code> and <code>InBlockInformation</code> for the
given rating matrix.
+ The result of the individual steps are stored in the specified
directory.
--- End diff --
Good catch again :-)
> Create machine learning library
> -------------------------------
>
> Key: FLINK-1695
> URL: https://issues.apache.org/jira/browse/FLINK-1695
> Project: Flink
> Issue Type: Improvement
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
>
> Create the infrastructure for Flink's machine learning library. This includes
> the creation of the module structure and the implementation of basic types
> such as vectors and matrices.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)