On 17 April 2015 at 13:52, Daniel Vainsencher <daniel.vainsenc...@gmail.com>
wrote:

> On 04/16/2015 05:49 PM, Joel Nothman wrote:
> > I more or less agree. Certainly we only need to do one searchsorted per
> > query per tree, and then do linear scans. There is a question of how
> > close we stay to the original LSHForest algorithm, which relies on
> > matching prefixes rather than hamming distance. Hamming distance is
> > easier to calculate in NumPy and is probably faster to calculate in C
> > too (with or without using POPCNT). Perhaps the only advantage of using
> > Cython in your solution is to avoid the memory overhead of unpackbits.
> You obviously know more than I do about Cython vs numpy options.
>
> > However, n_candidates before and after is arguably not sufficient if one
> > side has more than n_candidates with a high prefix overlap.
> I disagree. Being able to look at 2*n_candidates that must contain
> n_candidates of the closest ones, rather than "as many as happen to
> agree on x number of bits" is a feature, not a bug. Especially if we
> then filter them by hamming distance.
>  <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
>

But it need not contain the closest ones that would have been retrieved by
LSHForest (assuming we're only looking at a single tree). Let's say
n_candidates is 1, our query is 110000 and our index contains

A. 101111 agreed = 1
B. 110011 agreed = 3
C. 110100 agreed = 5

A binary search will find A-B. The n-candidates x 2 window includes A and
B. C is closer and has a longer prefix overlap with the query than A does.
My understanding of LSHForest is that its ascent by prefix length would
necessarily find C. Your approach would not.

While that may be a feature of your approach, I think we have reason to
prefer a published algorithm.
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to