[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

sethah Fri, 11 Nov 2016 08:16:55 -0800

Github user sethah commented on the issue:

    https://github.com/apache/spark/pull/15800
  
    @jkbradley Thanks for clarifying, I see your argument now. I agree that it 
makes sense from a statistical perspective. Still, I have not seen a single 
paper that describes anything quite exactly like what we're proposing. I would 
be ok disabling the multi-probe option for the 2.1 release, so we could carry 
on this discussion and continue hashing out (pun intended :) the APIs. 
    
    It is my understanding that the main benefit of multi-probe described in 
the reference paper is to cut down the storage space required by computing many 
hash tables, but we are not actually storing the entire hash table as a data 
structure so our implementation is a bit different. I think there's room for 
discussion/tests about what the benefits are and how drastically they impact 
performance.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

Reply via email to