GitHub user Yunni opened a pull request:

    https://github.com/apache/spark/pull/15148

    Spark 5992 yunn lsh

    ## What changes were proposed in this pull request?
    
    Implement Locality Sensitive Hashing along with approximate nearest 
neighbors and approximate similarity join based on the [design 
doc](https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit).
    
    Detailed changes are as follows:
    (1) Implement abstract LSH, LSHModel classes as Estimator-Model
    (2) Implement approxNearestNeighbors and approxSimilarityJoin in the 
abstract LSHModel
    (3) Implement Random Projection as LSH subclass for Euclidean distance
    (4) Implement unit test utility methods including checkLshProperty, 
checkNearestNeighbor and checkSimilarityJoin
    
    Things haven't implemented in this pull request:
    (1) LSH subclasses for Jaccard Distance, Hamming Distance, Cosine Distance
    (2) PySpark Integration for the scala classes and methods.
    
    ## How was this patch tested?
    Unit test is implemented for all the implemented classes and algorithms. A 
scalability test on Uber's dataset was performed internally.
    
    Will test the methods on an open dataset, and write a doc on the 
configurations and sample code.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/Yunni/spark SPARK-5992-yunn-lsh

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15148.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15148
    
----
commit 1bbd48cc4c1242195d46976d8d0382d9f09bbc25
Author: Yunni <euler57...@gmail.com>
Date:   2016-09-13T15:47:42Z

    First Commit of LSH function implementation. Implement basic 
Estimator-Model class hierarchy to make RandomProjection works.

commit ca46d82214a3ebc38c0bc69a460f6cfcb6550d99
Author: Yunni <euler57...@gmail.com>
Date:   2016-09-13T16:09:03Z

    Implementation of Approximate Nearest Neighbors. Add distCol as another 
model parameters

commit c693f5b2deec621bf8dbf617d1fb2367bf8b3397
Author: Yunni <euler57...@gmail.com>
Date:   2016-09-15T05:48:35Z

    Implement approxSimilarityJoin(). Remove modelDataset and distCol as 
discussed in the Design Doc.

commit c9ee0f9222f76ee2bc77e1a0e056274444a4af5e
Author: Yunni <euler57...@gmail.com>
Date:   2016-09-19T04:10:10Z

    Add test utility method to check LSH property. Tested on random projection.

commit fc838e0de0fd560a69b4a60bec5411c00842b4bb
Author: Yunni <euler57...@gmail.com>
Date:   2016-09-19T04:55:39Z

    Add testing utility for approximate nearest neighbor. Run the testing on 
random projection.

commit aa138e8db4fab8c6cd33d465895b65c8519c88b9
Author: Yunni <euler57...@gmail.com>
Date:   2016-09-19T06:14:37Z

    Add testing utility for approximate similarity join. Run the testing on 
random projection.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to