GitHub user debasish83 opened a pull request:
https://github.com/apache/spark/pull/3221
[MLLIB] [WIP] SPARK-2426: Quadratic Minimization for ALS
ALS is a generic algorithm for matrix factorization which is equally
applicable for both feature space and similarity space. Current ALS support L2
regularization and positivity constraint. This PR introduces userConstraint and
productConstraint to ALS and let's the user select different constraints for
user and product solves. The supported constraints are the following:
SMOOTH : default ALS with L2 regularization
ELASTIC NET: ALS with Elastic Net regularization
POSITIVE: ALS with positive factors
BOUNDS: ALS with factors bounded within upper and lower bound (default
within 0 and 1)
EQUALITY: ALS with equality constraint (default the factors sum up to 1 and
positive)
First let's focus on the problem formulation. Both implicit and explicit
feedback ALS formulation can be written as a quadratic minimization problem.
The quadratic objective can be written as xtHx + ctx. Each of the respective
constraints take the following form (for example sparsity constraint)
minimize xtHx + ctx
s.t ||x||1 <= c (SPARSE constraint)
We rewrite the objective as f(x) = xtHx + ctx and the constraint as an
indicator function g(x)
Now minimization of f(x) + g(x) can be carried out using various forward
backward splitting algorithms. We choose ADMM for this PR.
For use-cases the PR is focused on the following:
1. Sparse matrix factorization
Example run:
MASTER=spark://localhost:7077 ./bin/run-example mllib.MovieLensALS --rank
20 --numIterations 10 --userConstraint SMOOTH --lambdaUser 0.065
--productConstraint SPARSE --lambdaProduct 0.1 --kryo
hdfs://localhost:8020/sandbox/movielens/
2. Topic modeling using LSA
References:
2007 Sparse coding:
papers.nips.cc/paper/2979-efficient-sparse-coding-algorithms.pdf
2011 Sparse Latent Semantic Analysis LSA(some of it is implemented in
Graphlab):
https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf
2012 Sparse Coding + MR/MPI Microsoft:
http://web.stanford.edu/group/mmds/slides2012/s-hli.pdf
Implementing the 20NG flow to validate the sparse coding result improvement
over LDA based topic modeling.
3. Topic modeling using PLSA
Reference:
Tutorial on Probabilistic Topic Modeling: Additive Regularization for
Stochastic Matrix Factorization
The EQUALITY formulation with a Quadratic loss is an approximation to the
KL divergence loss being used in PLSA. We are interested to see if it improves
the result further as compared to the Sparse coding.
Next steps:
1. Move QuadraticMinimizerSuite.scala to breeze
2. Scale the factorization rank and remove the need to construct H matrix
3. Replace the quadratic loss xtHx + ctx with a Convex loss
Detailed experiments are on the JIRA
https://issues.apache.org/jira/browse/SPARK-2426
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/debasish83/spark qp-als
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3221.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3221
----
commit 0b3f0530702b7ca54b5152a3b65530113b2d538c
Author: Debasish Das <[email protected]>
Date: 2014-06-13T03:24:24Z
jecos integrated as the default qpsolve in ALS; implicit tests are failing
commit 8ba4871ed44c5e971dddd1888be34fb5f950bf76
Author: Debasish Das <[email protected]>
Date: 2014-06-18T04:30:12Z
Qp options added to Spark ALS: unbounded Qp, Qp with pos, Qp with bounds,
Qp with smoothness, Qp with L1
commit dd912db0966c28501948cdd27e5271ff9b10a8c5
Author: Debasish Das <[email protected]>
Date: 2014-06-19T08:19:06Z
Prepared branch of ALS-QP feature/runtime testing
commit 6dd320b170ffe7d6f6487bfe686192a672fe20c9
Author: Debasish Das <[email protected]>
Date: 2014-06-20T23:01:21Z
QpProblem drivers in MovieLensALS
commit 48023c84fd4896899a2d0dfd25f7c33234551b22
Author: Debasish Das <[email protected]>
Date: 2014-06-21T01:23:41Z
debug option for octave quadprog validation
commit e7e64b7741c001735b85b25bca0dc6cea534bed3
Author: Debasish Das <[email protected]>
Date: 2014-06-21T07:50:18Z
L1 option added to ALS; Driver added to MovieLensALS
commit 84f1d67242e4ac6a7846f60c042269e31a57894a
Author: Debasish Das <[email protected]>
Date: 2014-06-21T09:33:02Z
Qp with equality and bounds added to option 4 of ECOS based QpSolver
commit 90bca10bcbf255135b4a9ec5b68b848133298dc7
Author: Debasish Das <[email protected]>
Date: 2014-06-30T22:59:08Z
Movielens runtime experiments for Spark Summit talk
commit 4e2c6235b52dd81a8a2e74182b2751c389201158
Author: Debasish Das <[email protected]>
Date: 2014-07-15T15:28:43Z
ADMM based QuadraticMinimizer in mllib.optimization;Used in ALS
commit 3f93ee5741be230e6a0fcab1bac9459243e9355e
Author: Debasish Das <[email protected]>
Date: 2014-07-16T01:50:55Z
Refactored to use com.github.ecos package
commit f2888465677028fd1e85eb3f591b2537514529a2
Author: Debasish Das <[email protected]>
Date: 2014-07-18T02:43:30Z
moved interior point based qp-als to feature/ipmqp-als branch; preparing
for distributed runs; rho=50 for equality constraint, default rho=1.0, alpha =
1.0 (no over-relaxation) for convergence study
commit 21d79901ebb6e9399ebf192afda3e5cbbe782f31
Author: Debasish Das <[email protected]>
Date: 2014-08-02T07:15:34Z
license cleanup; Copyright added to NOTICE
commit 13cb89b27963868227e940fd946e8faadfa32cb1
Author: Debasish Das <[email protected]>
Date: 2014-08-05T05:57:48Z
Merge with HEAD
commit a12d92a3c3a950bd8782592cb8c797199aa1fdfc
Author: Debasish Das <[email protected]>
Date: 2014-08-07T21:36:25Z
BSD license for Proximal algorithms
commit 02199a8939f421c10ffc688bfb0ce1e1a908e369
Author: Debasish Das <[email protected]>
Date: 2014-08-08T18:19:30Z
LICENSE and NOTICE updates as per Legal
commit f43ed66127781a5e6669e059686eb4cb9c5c2e28
Author: Debasish Das <[email protected]>
Date: 2014-08-09T05:45:47Z
Merge with master
commit c03dbeda9b3b1d0f182d3c1450bb1e8b3e0c9af3
Author: Debasish Das <[email protected]>
Date: 2014-08-13T06:42:56Z
Merge branch 'feature/qp-als' of
https://istg.vzvisp.com:8443/stash/scm/bda/spark into qp-als
commit c9d1fbf88337058f59fea958fb8f1aa17ea92c74
Author: Debasish Das <[email protected]>
Date: 2014-10-08T03:01:36Z
Redesign of ALS API; userConstraint and productConstraint separated;
commit f2cab3e63a6269d9455ac07299c69bd68b7c957c
Author: Debasish Das <[email protected]>
Date: 2014-10-29T22:58:11Z
delimiter added to MovieLensALS example;rho tuning in
QuadraticMinimizer;Elastic net formulation in ALS, elastic net parameter not
exposed to users
commit b2c9daca98f7766394596ca819813a06a8ec46b3
Author: Debasish Das <[email protected]>
Date: 2014-10-30T20:21:42Z
rho as sqrt(eigenMin*eigenMax) for sparse, sqrt(eigenMax) for other
formulations
commit c01f3e3dacec502b534a1b21f3b328f5586ddd02
Author: Debasish Das <[email protected]>
Date: 2014-10-31T20:48:39Z
NNLS bug2
commit a9412079d9f05cf9514d62978f0b4525590b871b
Author: Debasish Das <[email protected]>
Date: 2014-11-01T15:26:09Z
removed ecos dependency from mllib pom
commit 9b3951f558e5673eb475c575f14876421b5a3abc
Author: Debasish Das <[email protected]>
Date: 2014-11-05T01:23:09Z
validate user/product on MovieLens dataset through user input and compute
map measure along with rmse
commit cd3ab31cb9b244bae2b45396a6269ed1dc59151b
Author: Debasish Das <[email protected]>
Date: 2014-11-05T22:43:11Z
merged with AbstractParams serialization bug
commit 4bbae0f248ca8747b47ecf852d5aba19c9b39dab
Author: Debasish Das <[email protected]>
Date: 2014-11-05T23:23:02Z
comments fixed as per scalastyle
commit 9fa063e1eb172d68248e03797a54acc738543592
Author: Debasish Das <[email protected]>
Date: 2014-11-06T00:05:24Z
import scala.math.round
commit 10cbb37a7881867d801ae6630ffc0d09b3feebf9
Author: Debasish Das <[email protected]>
Date: 2014-11-08T06:31:40Z
provide ratio for topN product validation; generate MAP and prec@k metric
for movielens dataset
commit f38a1b59e27907f2aa9bd732c5f9147b738d3a0f
Author: Debasish Das <[email protected]>
Date: 2014-11-08T06:45:13Z
use sampleByKey for per user sampling
commit d144f57a58c9424365f1242f90961386c016641e
Author: Debasish Das <[email protected]>
Date: 2014-11-12T04:56:46Z
recommendAll API to MatrixFactorizationModel, uses topK finding using
BoundedPriorityQueue similar to RDD.top
commit 1e7e36e9158f5fcdf576ce87577fea05d11ea216
Author: Debasish Das <[email protected]>
Date: 2014-11-12T07:18:35Z
Updated qp-als with irmetrics for experiments
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]