[GitHub] spark issue #15874: [Spark-18408][ML] API Improvements for LSH

jkbradley Thu, 17 Nov 2016 12:01:03 -0800

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/15874
  
    Other comments:
    
    **MinHash**
    
    Looking yet again at this, I think it's using a technically incorrect hash 
function.  It is *not* a perfect hash function.  It can hash 2 input indices to 
the same hash bucket.  (As before, check out the Wikipedia page to see how it's 
missing the 2nd stage in the construction of a perfect hash function.)  If we 
want to fix this, then we could alternatively precompute a random permutation 
of indices, which also serves as a perfect hash function.
    
    That said, perhaps it does not matter in practice.  If numEntries 
(inputDim) is large enough, then the current hash function will probably behave 
similarly to a perfect hash function.
    
    **approxNearestNeighbors**
    
    This is still not what I proposed, even for single-probe queries.  It will 
still have the potential to consider (and sort) a number of candidates much 
larger than numNearestNeighbors.  Since we're running out of time, I'm fine 
with leaving it as is for now and just changing the behavior for the next 
release.  However, can you please add a note to the method documentation that 
this method is experimental and will likely change behavior in the next release?
    
    Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15874: [Spark-18408][ML] API Improvements for LSH

Reply via email to