Github user karlhigley commented on the issue: https://github.com/apache/spark/pull/15148 @jkbradley: "Multi-probe" seems like a standard term, and I think this is the [original paper](http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf) that coined it. > Terminology: For LSH, "dimensionality" = "number of hash functions" and is relevant only for amplification. Do you agree? I have yet to see a hash function used for LSH which does not have a discrete set. I confess that I'm a little confused what you mean by the above. There are several relevant dimensionalities: the dimensionality of the input points (`x`), the dimensionality of the computed hashes (i.e. the results of applying `g(x)`), and the number of hash tables computed (i.e. how many `g(x)` functions are applied), which is the dimensionality of AND-amplification (in a sense). After wrestling with inconsistent terminology for a while, what I settled on for spark-neighbors was to refer to `g(x)` as a hash function, the outputs of `g(x)` as hashes, the sub-elements of `g(x)` -- `h1(x)` etc. -- as whatever made sense for the particular method (e.g. `permutations` for Minhash), and the output of each of the L `g(x)` functions as a hash table. While that terminology isn't necessarily standard, it helped me identify the common concepts across LSH methods clearly enough to build some abstractions around them. Using those terms, the dimensionality of the `g(x)` hash functions and the hashes they produce is equivalent to the number of `h(x)` sub-elements they contain. I thought of applying OR-amplification as producing multiple hash tables by using multiple `g(x)` functions, with a collision in any one hash table producing a pair of candidate neighbors. Does that make any more (or less) sense?
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org