[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

Yunni Thu, 10 Nov 2016 16:11:04 -0800

Github user Yunni commented on the issue:

    https://github.com/apache/spark/pull/15800
  
    @jkbradley I agree with your idea to get rid of full sorting and use 
`approxQuantile` to find the threshold. Doing a full sort on whole dataset 
hurts a lot in performance. Please file a ticket for this.
    
    > You're talking about enlarging search ranges, or iterations, a few times.
    
    Enlarging search ranges does not necessarily mean iterations. The same 
threshold logic for (a) gives a larger search range than for (b). Do you agree 
with this?
    
    > In both (a) and (b), you come up with some set of candidates. I was 
assuming we would compute keyDistance for those candidates and pick the top 
ones, just as in the current implementation.
    
    Agree with this part.
    
    BTW, one concrete example, you can run `approxNearestNeighbors for min 
hash` in MinHashSuite.scala. Please change `singleProbe = false`
     - `hashDistance` in (a) gives precision/recall as (0.95,0.95)
     - `hashDistance` in (b) gives precision/recall as (0.6,0.6)



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

Reply via email to