[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

debasish83 Tue, 07 Oct 2014 20:50:33 -0700

GitHub user debasish83 opened a pull request:

    https://github.com/apache/spark/pull/2705


    [MLLIB] [WIP] SPARK-2426: Quadratic Minimization for MLlib ALS

    ALS is a generic algorithm for matrix factorization which is equally 
applicable for both feature space and similarity space. Current ALS support L2 
regularization and positivity constraint. This PR introduces userConstraint and 
productConstraint to ALS and let's the user select different constraints for 
user and product solves. The supported constraints are the following:
    
    1. SMOOTH : default ALS with L2 regularization
    2. POSITIVE: ALS with positive factors
    3. BOUNDS: ALS with factors bounded within upper and lower bound (default 
within 0 and 1)
    4. SPARSE: ALS with L1 regularization
    5. EQUALITY: ALS with equality constraint (default the factors sum up to 1 
and positive)
    
    First let's focus on the problem formulation. Both implicit and explicit 
feedback ALS formulation can be written as a quadratic minimization problem. 
The quadratic objective can be written as xtHx + ctx. Each of the respective 
constraints take the following form:
    minimize xtHx + ctx
    s.t ||x||1 <= c (SPARSE constraint)
    
    We rewrite the objective as f(x) = xtHx + ctx and the constraint as an 
indicator function g(x)
    
    Now minimization of f(x) + g(x) can be carried out using various forward 
backward splitting algorithms. We choose ADMM for the first version based on 
our experimentation with ECOS IP solver and MOSEK comparisons. I will document 
the comparisons.
    
    Details of the algorithm are in the following reference:
    http://web.stanford.edu/~boyd/papers/pdf/admm_distr_stats.pdf
    
    Right now the default parameters of alpha, rho are set as 1.0 but the 
following issues show up in experiments with MovieLens dataset:
    1. ~3X higher iterations as compared to NNLS
    2. For SPARSE we are hitting the max iterations (400) around 10% of the time
    3. For EQUALITY rho is set at 50 based on a reference from Professor Boyd 
on optimal control
    
    We choose ADMM as the baseline solver but this PR will explore the 
following solver enhancements to decrease the iteration count:
    1. Accelerated ADMM using Nesterov acceleration
    2. FISTA style forward backward splitting
    
    For use-cases the PR is focused on the following:
    
    1. Sparse matrix factorization to improve recommendation 
    On Movielens data right now the RMSE with SPARSE is 10% (1.04) lower than 
the Mahout/Spark baseline (0.9) but have not looked into map, prec@k and ndcg@k 
measures. Using the PR from @coderxiang to look into IR measures.
    Example run:
    MASTER=spark://localhost:7077 ./bin/run-example mllib.MovieLensALS --rank 
20 --numIterations 10 --userConstraint SMOOTH --lambdaUser 0.065 
--productConstraint SPARSE --lambdaProduct 0.1 --kryo 
hdfs://localhost:8020/sandbox/movielens/
       
    2. Topic modeling using LSA
    References:
    2007 Sparse coding: 
papers.nips.cc/paper/2979-efficient-sparse-coding-algorithms.pdf
    2012 Sparse Coding + MR/MPI Microsoft: 
http://web.stanford.edu/group/mmds/slides2012/s-hli.pdf
    Implementing the 20NG flow to validate the sparse coding result improvement 
over LDA based topic modeling.
    
    3. Topic modeling using PLSA
    Reference: 
    Tutorial on Probabilistic Topic Modeling: Additive Regularization for 
Stochastic Matrix Factorization
    The EQUALITY formulation with a Quadratic loss is an approximation to the 
KL divergence loss being used in PLSA. We are interested to see if it improves 
the result further as compared to the Sparse coding.
    
    Next steps:
    1. Improve the convergence rate of forward-backward splitting on quadratic 
problems
    2. Move the test-cases to QuadraticMinimizerSuite.scala
    3. Generate results for each of the use-cases and add tests related to each 
use-case
    
    Related future PRs:
    1. Scale the factorization rank and remove the need to construct H matrix
    2. Replace the quadratic loss xtHx + ctx with a Convex loss

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/debasish83/spark qp-als

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2705.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2705
    
----
commit 0b3f0530702b7ca54b5152a3b65530113b2d538c
Author: Debasish Das <[email protected]>
Date:   2014-06-13T03:24:24Z

    jecos integrated as the default qpsolve in ALS; implicit tests are failing

commit 8ba4871ed44c5e971dddd1888be34fb5f950bf76
Author: Debasish Das <[email protected]>
Date:   2014-06-18T04:30:12Z

    Qp options added to Spark ALS: unbounded Qp, Qp with pos, Qp with bounds, 
Qp with smoothness, Qp with L1

commit dd912db0966c28501948cdd27e5271ff9b10a8c5
Author: Debasish Das <[email protected]>
Date:   2014-06-19T08:19:06Z

    Prepared branch of ALS-QP feature/runtime testing

commit 6dd320b170ffe7d6f6487bfe686192a672fe20c9
Author: Debasish Das <[email protected]>
Date:   2014-06-20T23:01:21Z

    QpProblem drivers in MovieLensALS

commit 48023c84fd4896899a2d0dfd25f7c33234551b22
Author: Debasish Das <[email protected]>
Date:   2014-06-21T01:23:41Z

    debug option for octave quadprog validation

commit e7e64b7741c001735b85b25bca0dc6cea534bed3
Author: Debasish Das <[email protected]>
Date:   2014-06-21T07:50:18Z

    L1 option added to ALS; Driver added to MovieLensALS

commit 84f1d67242e4ac6a7846f60c042269e31a57894a
Author: Debasish Das <[email protected]>
Date:   2014-06-21T09:33:02Z

    Qp with equality and bounds added to option 4 of ECOS based QpSolver

commit 90bca10bcbf255135b4a9ec5b68b848133298dc7
Author: Debasish Das <[email protected]>
Date:   2014-06-30T22:59:08Z

    Movielens runtime experiments for Spark Summit talk

commit 4e2c6235b52dd81a8a2e74182b2751c389201158
Author: Debasish Das <[email protected]>
Date:   2014-07-15T15:28:43Z

    ADMM based QuadraticMinimizer in mllib.optimization;Used in ALS

commit 3f93ee5741be230e6a0fcab1bac9459243e9355e
Author: Debasish Das <[email protected]>
Date:   2014-07-16T01:50:55Z

    Refactored to use com.github.ecos package

commit f2888465677028fd1e85eb3f591b2537514529a2
Author: Debasish Das <[email protected]>
Date:   2014-07-18T02:43:30Z

    moved interior point based qp-als to feature/ipmqp-als branch; preparing 
for distributed runs; rho=50 for equality constraint, default rho=1.0, alpha = 
1.0 (no over-relaxation) for convergence study

commit 21d79901ebb6e9399ebf192afda3e5cbbe782f31
Author: Debasish Das <[email protected]>
Date:   2014-08-02T07:15:34Z

    license cleanup; Copyright added to NOTICE

commit 13cb89b27963868227e940fd946e8faadfa32cb1
Author: Debasish Das <[email protected]>
Date:   2014-08-05T05:57:48Z

    Merge with HEAD

commit a12d92a3c3a950bd8782592cb8c797199aa1fdfc
Author: Debasish Das <[email protected]>
Date:   2014-08-07T21:36:25Z

    BSD license for Proximal algorithms

commit 02199a8939f421c10ffc688bfb0ce1e1a908e369
Author: Debasish Das <[email protected]>
Date:   2014-08-08T18:19:30Z

    LICENSE and NOTICE updates as per Legal

commit f43ed66127781a5e6669e059686eb4cb9c5c2e28
Author: Debasish Das <[email protected]>
Date:   2014-08-09T05:45:47Z

    Merge with master

commit c03dbeda9b3b1d0f182d3c1450bb1e8b3e0c9af3
Author: Debasish Das <[email protected]>
Date:   2014-08-13T06:42:56Z

    Merge branch 'feature/qp-als' of 
https://istg.vzvisp.com:8443/stash/scm/bda/spark into qp-als

commit c9d1fbf88337058f59fea958fb8f1aa17ea92c74
Author: Debasish Das <[email protected]>
Date:   2014-10-08T03:01:36Z

    Redesign of ALS API; userConstraint and productConstraint separated;

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [MLLIB] [WIP] SPARK-2426: Quadratic Minimizati...

Reply via email to