GitHub user rezazadeh opened a pull request:

    https://github.com/apache/spark/pull/1778

    DIMSUM: Dimension Independent Matrix Square using Mapreduce

    # DIMSUM
    Compute all pairs of similar vectors using brute force approach, and also 
DIMSUM sampling approach.
    
    Laying down some notation: we are looking for all pairs of similar columns 
in an m x n matrix whose entries are denoted a_ij, with the i’th row denoted 
r_i and the j’th column denoted c_j. There is an oversampling parameter 
labeled ɣ that should be set to 4 log(n)/s to get provably correct results 
(with high probability), where s is the similarity threshold.
    
    The algorithm is stated with a Map and Reduce, with proofs of correctness 
and efficiency in published papers [1] [2]. The reducer is simply the summation 
reducer. The mapper is more interesting, and is also the heart of the scheme. 
As an exercise, you should try to see why in expectation, the map-reduce below 
outputs cosine similarities.
    
    
![dimsumv2](https://cloud.githubusercontent.com/assets/3220351/3807272/d1d9514e-1c62-11e4-9f12-3cfdb1d78b3a.png)
    
    [1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent 
Matrix Square using MapReduce, arXiv:1304.1467
    
    [2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent 
Similarity Computation, arXiv:1206.2082
    
    # Testing
    
    Tests for all invocations included. 
    Added magnitude computation to MultivariateStatisticalSummary since it was 
needed. Added a test for this.
    
    Scaling it up now and will report back with results.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rezazadeh/spark dimsumv2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1778.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1778
    
----
commit 5b8cd7deb3f29d3c2533b01f496f41175471f023
Author: Reza Zadeh <[email protected]>
Date:   2014-08-04T02:19:45Z

    Initial files

commit 6bebabb9364eb917dd86acbea4438a9e4d301f18
Author: Reza Zadeh <[email protected]>
Date:   2014-08-04T18:37:31Z

    remove changes to MatrixSuite

commit 3726ca97ab184a8d5a9b3c0003d3afa6fd973890
Author: Reza Zadeh <[email protected]>
Date:   2014-08-04T20:47:57Z

    Remove MatrixAlgebra

commit 654c4fb1136cfa856fc354b5ddb710758d38948f
Author: Reza Zadeh <[email protected]>
Date:   2014-08-04T21:38:18Z

    default methods

commit 502ce526fc8ec84fd2c1f3b2b9a74b07e76c2d65
Author: Reza Zadeh <[email protected]>
Date:   2014-08-04T22:02:36Z

    new interface

commit 05e59b8e883fd126dc81707b90aaf1011a2d1ee5
Author: Reza Zadeh <[email protected]>
Date:   2014-08-04T22:59:55Z

    Add test

commit 75edb257e33a23f87fa379be597483d12a421626
Author: Reza Zadeh <[email protected]>
Date:   2014-08-05T01:02:33Z

    All tests passing!

commit 029aa9c3d71960cb63293d721b96eebb6bdfcfbf
Author: Reza Zadeh <[email protected]>
Date:   2014-08-05T05:12:40Z

    javadoc and new test

commit 139c8e1d20274322dfe1c513d6872e47f5eb5138
Author: Reza Zadeh <[email protected]>
Date:   2014-08-05T05:16:23Z

    Syntax changes

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to