Github user sethah commented on the issue:
https://github.com/apache/spark/pull/15148
A few high-level comments/questions:
* Should this go into the `feature` package as a feature
estimator/transformer? That is where other dimensionality reduction techniques
have gone and I'm not sure we should create a new package for this.
* Could you please point me to a specific section of a specific paper that
documents the approaches used here? AFAICT, this patch implements something
different than most of the Approximate nearest neighbors via LSH algorithms
found in papers. For instance, the method in section 2
[here](http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf) as well as
the method on Wikipedia
[here](https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search)
are different than the implementation in this pr. Also, the spark package
[`spark-neighbors`](https://github.com/sethah/spark-neighbors) employs those
approaches. I'm not an expert in LSH so I was just hoping for some
clarification.
* The implementation of the `RandomProjections` class actually follows the
implementation of the "2-stable" (or more generically, "p-stable") LSH
algorithm, and not the "Random Projection" algorithm in the paper that is
referenced. At the very least, we should clarify this. Potentially, we should
think of a better name.
@karlhigley Would you mind taking a look at the patch, or providing your
input on the comments?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]