GitHub user debasish83 opened a pull request:
https://github.com/apache/spark/pull/6213
[MLLIB][SPARK-4675, SPARK-4823] RowSimilarity
@mengxr @srowen
For RowMatrix with 100K columns, colSimilarity with bruteforce/dimsum
sampling is used. This PR adds rowSimilarity to IndexedRowMatrix which outputs
a CoordinateMatrix. For matrices where columns are > 1M, rowSimilarity flow
scales better compared to column similarity flow.
For most applications, topK similar items requirement is much less than all
available items and therefore the rowSimilarity API takes topK and threshold as
input. topK and threshold help in improving shuffle space.
For MatrixFactorization model generally the columns for both user and
product factors are ~50-200 and therefore the column similarity flow does not
work for such cases. This PR also adds batch similarUsers and similarProducts
(SPARK-4675).
The following ideas are added:
1. Similarity computation is abstracted as Kernel
2. Kernel implementations for Cosine, RBF, Euclidean and Product (for
distributed matrix multiply) are added
3. Tests cover Cosine Kernel. More tests will be added for Euclidean, RBF
and Product kernels.
4. IndexedRowMatrix object adds a kernalized distributed matrix multiply
which is used by similarity computation.
5. In examples, MovieLensSimilarity is added that shows col and row based
flows on MovieLens as runtime experiment.
6. Level-1 BLAS is used so that kernel abstraction can be used. We can
either design the Kernel abstraction with Level-3 BLAS (might be difficult) or
use BlockMatrix for distributed matrix multiply.
Next steps:
1. In MovieLensSimilarity add ALS + similarItems example
2. Use RBF similarity in power iteration clustering flow
From internal experiments, we have run 6M users, 1.6M items, 351M ratings
through row similarity flow with topK=200 in 1.1 hr with 240 cores running over
30 nodes. We had difficult time in scaling column similarity flow since the
topK optimization can't be added until reduce phase is done in that flow.
On MovieLens-1M and Netflix dataset I will report row and col similarity
runtime comparisons.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/debasish83/spark similarity
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/6213.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #6213
----
commit f9fd6fbfb1a55142a9eb8f2129d3729ca25ab501
Author: Debasish Das <[email protected]>
Date: 2015-05-17T00:05:52Z
blocked kernalized row similarity calculation and tests
commit 66176f9f346c324b9c77c252be369e24f7fdd991
Author: Debasish Das <[email protected]>
Date: 2015-05-17T00:06:36Z
Cosine, Euclidean, RBF and Product Kernel added
commit 3f96963f80a40f3a4fce6b6dbd97c20605ebaecc
Author: Debasish Das <[email protected]>
Date: 2015-05-17T00:07:28Z
row similarity API added to drive MatrixFactorizationModel similarUsers and
similarProducts
commit 6dc9e18d507cfe0d2ee12e768ca6bddb5c3c4b38
Author: Debasish Das <[email protected]>
Date: 2015-05-17T00:09:24Z
MovieLens flow to demonstrate item similarity calculation using raw
features and ALS factors
commit 71f24a4629cf54c39af4e9e598d9808d85952532
Author: Debasish Das <[email protected]>
Date: 2015-05-17T00:09:45Z
import cleanup
commit cc4e104b7430e3fe2e6bf71489638321076428a3
Author: Debasish Das <[email protected]>
Date: 2015-05-17T00:11:15Z
Merge branch 'similarity' of https://github.com/debasish83/spark into
similarity
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]