[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity

debasish83 Sat, 16 May 2015 17:43:40 -0700

GitHub user debasish83 opened a pull request:

    https://github.com/apache/spark/pull/6213


    [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity

    @mengxr @srowen
    For RowMatrix with 100K columns, colSimilarity with bruteforce/dimsum 
sampling is used. This PR adds rowSimilarity to IndexedRowMatrix which outputs 
a CoordinateMatrix. For matrices where columns are > 1M, rowSimilarity flow 
scales better compared to column similarity flow.
    
    For most applications, topK similar items requirement is much less than all 
available items and therefore the rowSimilarity API takes topK and threshold as 
input. topK and threshold help in improving shuffle space.
    
    For MatrixFactorization model generally the columns for both user and 
product factors are ~50-200 and therefore the column similarity flow does not 
work for such cases. This PR also adds batch similarUsers and similarProducts 
(SPARK-4675).
    
    The following ideas are added:
    
    1. Similarity computation is abstracted as Kernel
    2. Kernel implementations for Cosine, RBF, Euclidean and Product (for 
distributed matrix multiply) are added
    3. Tests cover Cosine Kernel. More tests will be added for Euclidean, RBF 
and Product kernels.
    4. IndexedRowMatrix object adds a kernalized distributed matrix multiply 
which is used by similarity computation.
    5. In examples, MovieLensSimilarity is added that shows col and row based 
flows on MovieLens as runtime experiment.
    6. Level-1 BLAS is used so that kernel abstraction can be used. We can 
either design the Kernel abstraction with Level-3 BLAS (might be difficult) or 
use BlockMatrix for distributed matrix multiply.
    
    Next steps:
    
    1. In MovieLensSimilarity add ALS + similarItems example
    2. Use RBF similarity in power iteration clustering flow
    
    From internal experiments, we have run 6M users, 1.6M items, 351M ratings 
through row similarity flow with topK=200 in 1.1 hr with 240 cores running over 
30 nodes. We had difficult time in scaling column similarity flow since the 
topK optimization can't be added until reduce phase is done in that flow.
    
    On MovieLens-1M and Netflix dataset I will report row and col similarity 
runtime comparisons.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/debasish83/spark similarity

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6213.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6213
    
----
commit f9fd6fbfb1a55142a9eb8f2129d3729ca25ab501
Author: Debasish Das <[email protected]>
Date:   2015-05-17T00:05:52Z

    blocked kernalized row similarity calculation and tests

commit 66176f9f346c324b9c77c252be369e24f7fdd991
Author: Debasish Das <[email protected]>
Date:   2015-05-17T00:06:36Z

    Cosine, Euclidean, RBF and Product Kernel added

commit 3f96963f80a40f3a4fce6b6dbd97c20605ebaecc
Author: Debasish Das <[email protected]>
Date:   2015-05-17T00:07:28Z

    row similarity API added to drive MatrixFactorizationModel similarUsers and 
similarProducts

commit 6dc9e18d507cfe0d2ee12e768ca6bddb5c3c4b38
Author: Debasish Das <[email protected]>
Date:   2015-05-17T00:09:24Z

    MovieLens flow to demonstrate item similarity calculation using raw 
features and ALS factors

commit 71f24a4629cf54c39af4e9e598d9808d85952532
Author: Debasish Das <[email protected]>
Date:   2015-05-17T00:09:45Z

    import cleanup

commit cc4e104b7430e3fe2e6bf71489638321076428a3
Author: Debasish Das <[email protected]>
Date:   2015-05-17T00:11:15Z

    Merge branch 'similarity' of https://github.com/debasish83/spark into 
similarity

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity

Reply via email to