GitHub user Yunni opened a pull request:
https://github.com/apache/spark/pull/15874
Spark 18408 yunn api improvements
## What changes were proposed in this pull request?
(1) Change output schema to `Array of Vector` instead of `Vectors`
(2) Use `numHashTables` as the dimension of Array and `numHashFunctions` as
the dimension of Vector
(3) Rename `RandomProjection` to `BucketedRandomProjectionLSH`, `MinHash`
to `MinHashLSH`
(4) Make `randUnitVectors/randCoefficients` private
(5) Make Multi-Probe NN Search and `hashDistance` private for future
discussion
## How was this patch tested?
Related unit tests are modified to make sure the performance of LSH are
ensured, and the outputs of the APIs meets expectation.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/Yunni/spark SPARK-18408-yunn-api-improvements
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15874.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15874
----
commit 559c09904538012b70bcb3493b8bc287dd855b2d
Author: Yun Ni <[email protected]>
Date: 2016-11-07T21:30:32Z
[SPARK-18334] MinHash should use binary hash distance
commit 517a97bd16f3771d9abbcdf54957a011f5f87adc
Author: Yunni <[email protected]>
Date: 2016-11-08T06:15:24Z
Remove misleading documentation as requested
commit b546dbd207a04e73bde097f25cae8c927322c2ae
Author: Yun Ni <[email protected]>
Date: 2016-11-08T18:54:09Z
Add warning for multi-probe in MinHash
commit a3cd9281d1fb8d969cb8bdd32ae8c5b9c373ad3b
Author: Yun Ni <[email protected]>
Date: 2016-11-08T18:55:49Z
Merge branch 'SPARK-18334-yunn-minhash-bug' of
https://github.com/Yunni/spark into SPARK-18334-yunn-minhash-bug
commit c8243c7def8c270072edd5889cea7fd02677b44f
Author: Yun Ni <[email protected]>
Date: 2016-11-09T23:11:20Z
(1) Fix documentation as CR suggested (2) Fix typo in unit test
commit 6aac8b343c5ea3a91b8517a2d3f47ed055ece9ad
Author: Yun Ni <[email protected]>
Date: 2016-11-09T23:22:27Z
Fix typo in unit test
commit 98707436ea8a90599fd8615a47afff3bf29a3ae6
Author: Yun Ni <[email protected]>
Date: 2016-11-14T04:25:17Z
[SPARK-18408] API Improvements for LSH
commit 0e9250be0142691e9e085ed1260f83f8ed40f5e4
Author: Yun Ni <[email protected]>
Date: 2016-11-14T04:38:44Z
(1) Fix description for numHashFunctions (2) Make numEntries in MinHash
private
commit adbbefe1777db8fb85a0af59c11e5840d3bc91ee
Author: Yun Ni <[email protected]>
Date: 2016-11-14T04:43:30Z
Add assertion for hashFunction in BucketedRandomProjectionLSHSuite
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]