[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

jkbradley Fri, 11 Nov 2016 16:01:05 -0800

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/15800
  
    @Yunni I guess we should remove it from the public API.  I'm OK with 
leaving the code there and making it private for now.
    
    *One response:*
    
    > Enlarging search ranges does not necessarily mean iterations. The same 
threshold logic for (a) gives a larger search range than for (b). Do you agree 
with this?
    
    If you use the same threshold for both, then I agree.  But that's not a 
reasonable comparison since (a) will do many times more work and communicate 
many times more data (up to L times more).  This will happen when you do 
posexplode.
    
    If you compare the 2 where each selects the same number of rows (on which 
to compute the keyDistance and select neighbors), then (b) will select many 
more candidates since it will not have duplicates.
    
    *Also, one new comment:*
    
    I'm testing vs the current implementation (min(abs(query bucket - row 
bucket))).  Weirdly, the current one is getting consistently better results 
than my proposal...even though this does not make sense to me statistically 
(and even though the current implementation isn't what any of us are proposing 
to use!).  I'm still banging my head against this...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

Reply via email to