[jira] [Created] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

Joseph K. Bradley (JIRA) Wed, 09 Nov 2016 16:02:29 -0800

Joseph K. Bradley created SPARK-18392:
-----------------------------------------


             Summary: LSH API, algorithm, and documentation follow-ups
                 Key: SPARK-18392
                 URL: https://issues.apache.org/jira/browse/SPARK-18392
             Project: Spark
          Issue Type: Improvement
          Components: ML
            Reporter: Joseph K. Bradley


This JIRA summarizes discussions from the initial LSH PR 
[https://github.com/apache/spark/pull/15148] as well as the follow-up for hash 
distance [https://github.com/apache/spark/pull/15800].  This will be broken 
into subtasks:
* API changes (targeted for 2.1)
* algorithmic fixes (targeted for 2.1)
* documentation improvements (ideally 2.1, but could slip)

The major issues we have mentioned are as follows:
* OR vs AND amplification
** Need to make API flexible enough to support both types of amplification in 
the future
** Need to clarify which we support, including in each model function 
(transform, similarity join, neighbors)
* Need to clarify which algorithms we have implemented, improve docs and 
references, and fix the algorithms if needed.

These major issues are broken down into detailed issues below.

h3. LSH abstraction

* Rename {{outputDim}} to something indicative of OR-amplification.
** My current top pick is {{numHashTables}}, with {{numHashFunctions}} used in 
the future for AND amplification (Thanks [~mlnick]!)

* transform
** Update output schema to {{Array of Vector}} instead of {{Vector}}.  This is 
the "raw" output of all hash functions, i.e., with no aggregation for 
amplification.
** Clarify meaning of output in terms of multiple hash functions and 
amplification.
** Note: We will _not_ worry about users using this output for dimensionality 
reduction; if anything, that use case can be explained in the User Guide.

* Documentation
** Clarify terminology used everywhere
*** hash function {{h_i}}: basic hash function without amplification
*** hash value {{h_i(key)}}: output of a hash function
*** compound hash function {{g = (h_0,h_1,...h_{K-1})}}: hash function with 
AND-amplification using K base hash functions
*** compound hash function value {{g(key)}}: vector-valued output
*** hash table {{H = (g_0,g_1,...g_{L-1})}}: hash function with 
OR-amplification using L compound hash functions
*** hash table value {{H(key)}}: output of array of vectors
*** This terminology is largely pulled from Wang et al.'s survey and the 
multi-probe LSH paper.
** Link clearly to documentation (Wikipedia or papers) which matches our 
terminology and what we implemented

h3. RandomProjection (or P-Stable Distributions)

* Rename {{RandomProjection}}
** Options include: {{ScalarRandomProjectionLSH}}, 
{{BucketedRandomProjectionLSH}}, {{PStableLSH}}

* API privacy
** Make randUnitVectors private

* hashFunction
** Currently, this uses OR-amplification for single probing, as we intended.
** It does *not* do multiple probing, at least not in the sense of the original 
MPLSH paper.  We should fix that or at least document its behavior.

* Documentation
** Clarify this is the P-Stable Distribution LSH method listed in Wikipedia
** Also link to the multi-probe LSH paper since that explains this method very 
clearly.
** Clarify hash function and distance metric

h3. MinHash

* Rename {{MinHash}} -> {{MinHashLSH}}

* API privacy
** Make randCoefficients, numEntries private

* hashDistance (used in approxNearestNeighbors)
** Update to use average of indicators of hash collisions [SPARK-18334]
** See [Wikipedia | 
https://en.wikipedia.org/wiki/MinHash#Variant_with_many_hash_functions] for a 
reference

h3. All references

I'm just listing references I looked at.

Papers
* [http://cseweb.ucsd.edu/~dasgupta/254-embeddings/lawrence.pdf]
* [https://people.csail.mit.edu/indyk/p117-andoni.pdf]
* [http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf]
* [http://www.cs.princeton.edu/cass/papers/mplsh_vldb07.pdf] - Multi-probe LSH 
paper

Wikipedia
* 
[https://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search]
* [https://en.wikipedia.org/wiki/Locality-sensitive_hashing#Amplification]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-18392) LSH API, algorithm, and documentation follow-ups

Reply via email to