GitHub user rezazadeh opened a pull request:
https://github.com/apache/spark/pull/1778
DIMSUM: Dimension Independent Matrix Square using Mapreduce
# DIMSUM
Compute all pairs of similar vectors using brute force approach, and also
DIMSUM sampling approach.
Laying down some notation: we are looking for all pairs of similar columns
in an m x n matrix whose entries are denoted a_ij, with the iâth row denoted
r_i and the jâth column denoted c_j. There is an oversampling parameter
labeled ɣ that should be set to 4 log(n)/s to get provably correct results
(with high probability), where s is the similarity threshold.
The algorithm is stated with a Map and Reduce, with proofs of correctness
and efficiency in published papers [1] [2]. The reducer is simply the summation
reducer. The mapper is more interesting, and is also the heart of the scheme.
As an exercise, you should try to see why in expectation, the map-reduce below
outputs cosine similarities.

[1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent
Matrix Square using MapReduce, arXiv:1304.1467
[2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent
Similarity Computation, arXiv:1206.2082
# Testing
Tests for all invocations included.
Added magnitude computation to MultivariateStatisticalSummary since it was
needed. Added a test for this.
Scaling it up now and will report back with results.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rezazadeh/spark dimsumv2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1778.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1778
----
commit 5b8cd7deb3f29d3c2533b01f496f41175471f023
Author: Reza Zadeh <[email protected]>
Date: 2014-08-04T02:19:45Z
Initial files
commit 6bebabb9364eb917dd86acbea4438a9e4d301f18
Author: Reza Zadeh <[email protected]>
Date: 2014-08-04T18:37:31Z
remove changes to MatrixSuite
commit 3726ca97ab184a8d5a9b3c0003d3afa6fd973890
Author: Reza Zadeh <[email protected]>
Date: 2014-08-04T20:47:57Z
Remove MatrixAlgebra
commit 654c4fb1136cfa856fc354b5ddb710758d38948f
Author: Reza Zadeh <[email protected]>
Date: 2014-08-04T21:38:18Z
default methods
commit 502ce526fc8ec84fd2c1f3b2b9a74b07e76c2d65
Author: Reza Zadeh <[email protected]>
Date: 2014-08-04T22:02:36Z
new interface
commit 05e59b8e883fd126dc81707b90aaf1011a2d1ee5
Author: Reza Zadeh <[email protected]>
Date: 2014-08-04T22:59:55Z
Add test
commit 75edb257e33a23f87fa379be597483d12a421626
Author: Reza Zadeh <[email protected]>
Date: 2014-08-05T01:02:33Z
All tests passing!
commit 029aa9c3d71960cb63293d721b96eebb6bdfcfbf
Author: Reza Zadeh <[email protected]>
Date: 2014-08-05T05:12:40Z
javadoc and new test
commit 139c8e1d20274322dfe1c513d6872e47f5eb5138
Author: Reza Zadeh <[email protected]>
Date: 2014-08-05T05:16:23Z
Syntax changes
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]