Github user sethah commented on the issue: https://github.com/apache/spark/pull/15148 I apologize for coming late to this, but I am taking a look at some of the documentation now. For `RandomProjection` class there are two links: one to wikipedia entry on stable distributions and one to a survey paper. The wikipedia links to the "stable distributions" section despite also having a section on random projections, which is the supposed algorithm. The paper has a "Random Projection" section as well - neither of the Random Projection methods in the links match the code here. I expressed this concern before. The approach in the Random Projection class does not match either the "Random Projection" method OR the "P-Stable distribution" methods that I find in the literature. I summarized this in a comment way up towards the top. If this method is some well-accepted hybrid of the two, fine, but I think the references would leave users quite confused. I think it's nice to have certainty about the practical effectiveness of this method since it has already been deployed in industry, so my main concern is really just documentation. Right now, we're linking to sources which describe distinctly different algorithms than what we have implemented. Thoughts? For convenience, some references: * http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf * https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search * https://people.csail.mit.edu/indyk/p117-andoni.pdf
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org