Joseph K. Bradley created SPARK-18392:
-----------------------------------------
Summary: LSH API, algorithm, and documentation follow-ups
Key: SPARK-18392
URL: https://issues.apache.org/jira/browse/SPARK-18392
Project: Spark
Issue Type: Improvement
Components: ML
Reporter: Joseph K. Bradley
This JIRA summarizes discussions from the initial LSH PR
[https://github.com/apache/spark/pull/15148] as well as the follow-up for hash
distance [https://github.com/apache/spark/pull/15800]. This will be broken
into subtasks:
* API changes (targeted for 2.1)
* algorithmic fixes (targeted for 2.1)
* documentation improvements (ideally 2.1, but could slip)
The major issues we have mentioned are as follows:
* OR vs AND amplification
** Need to make API flexible enough to support both types of amplification in
the future
** Need to clarify which we support, including in each model function
(transform, similarity join, neighbors)
* Need to clarify which algorithms we have implemented, improve docs and
references, and fix the algorithms if needed.
These major issues are broken down into detailed issues below.
h3. LSH abstraction
* Rename {{outputDim}} to something indicative of OR-amplification.
** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used in
the future for AND amplification (Thanks [~mlnick]!)
* transform
** Update output schema to {{Array of Vector}} instead of {{Vector}}. This is
the "raw" output of all hash functions, i.e., with no aggregation for
amplification.
** Clarify meaning of output in terms of multiple hash functions and
amplification.
** Note: We will _not_ worry about users using this output for dimensionality
reduction; if anything, that use case can be explained in the User Guide.
* Documentation
** Clarify terminology used everywhere
*** hash function {{h_i}}: basic hash function without amplification
*** hash value {{h_i(key)}}: output of a hash function
*** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with
AND-amplification using K base hash functions
*** compound hash function value {{g(key)}}: vector-valued output
*** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with
OR-amplification using L compound hash functions
*** hash table value {{H(key)}}: output of array of vectors
*** This terminology is largely pulled from Wang et al.'s survey and the
multi-probe LSH paper.
** Link clearly to documentation (Wikipedia or papers) which matches our
terminology and what we implemented
h3. RandomProjection (or P-Stable Distributions)
* Rename {{RandomProjection}}
** Options include: {{ScalarRandomProjectionLSH}},
{{BucketedRandomProjectionLSH}}, {{PStableLSH}}
* API privacy
** Make randUnitVectors private
* hashFunction
** Currently, this uses OR-amplification for single probing, as we intended.
** It does *not* do multiple probing, at least not in the sense of the original
MPLSH paper. We should fix that or at least document its behavior.
* Documentation
** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
** Also link to the multi-probe LSH paper since that explains this method very
clearly.
** Clarify hash function and distance metric
h3. MinHash
* Rename {{MinHash}} -> {{MinHashLSH}}
* API privacy
** Make randCoefficients, numEntries private
* hashDistance (used in approxNearestNeighbors)
** Update to use average of indicators of hash collisions [SPARK-18334]
** See [Wikipedia |
https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a
reference
h3. All references
I'm just listing references I looked at.
Papers
* [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
* [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
* [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
* [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe LSH
paper
Wikipedia
*
[https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
* [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]