[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

jkbradley Thu, 10 Nov 2016 12:06:06 -0800

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/15800
  
    I agree @sethah and I are on the same page.  Two clarifications about 
@Yunni 's post:
    * I'm not sure what you mean by "your method will only have 1 indicator for 
each row."  I'm proposing to compute some number of buckets (which I called 
"LxK" above), computing indicators for each, and averaging the indicators.
    * I am not proposing multiple iterations of searching, but sorting by hash 
distance would effectively do those iterations in a single sort.
    
    I also just realized something else: For approxNearestNeighbors with 
multiple probing, we should not sort the entire dataset.  Shall we switch to 
something else which will avoid sorting all rows, such as using approxQuantiles 
to pick a threshold?  I'm OK with this improvement coming in a later release.  
If you agree, I'll make a JIRA.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15800: [SPARK-18334] MinHash should use binary hash distance

Reply via email to