Re: [Scikit-learn-general] Performance of LSHForest

Daniel Vainsencher Mon, 20 Apr 2015 08:46:29 -0700

On 04/19/2015 08:18 AM, Joel Nothman wrote:
>
>
> On 17 April 2015 at 13:52, Daniel Vainsencher
> <daniel.vainsenc...@gmail.com <mailto:daniel.vainsenc...@gmail.com>> wrote:
>
>     On 04/16/2015 05:49 PM, Joel Nothman wrote:
>     > I more or less agree. Certainly we only need to do one searchsorted per
>     > query per tree, and then do linear scans. There is a question of how
>     > close we stay to the original LSHForest algorithm, which relies on
>     > matching prefixes rather than hamming distance. Hamming distance is
>     > easier to calculate in NumPy and is probably faster to calculate in C
>     > too (with or without using POPCNT). Perhaps the only advantage of using
>     > Cython in your solution is to avoid the memory overhead of unpackbits.
>     You obviously know more than I do about Cython vs numpy options.
>
>     > However, n_candidates before and after is arguably not sufficient if one
>     > side has more than n_candidates with a high prefix overlap.
>     I disagree. Being able to look at 2*n_candidates that must contain
>     n_candidates of the closest ones, rather than "as many as happen to
>     agree on x number of bits" is a feature, not a bug. Especially if we
>     then filter them by hamming distance.
>     <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general>
>
>
> But it need not contain the closest ones that would have been retrieved
> by LSHForest (assuming we're only looking at a single tree). Let's say
> n_candidates is 1, our query is 110000 and our index contains
>
> A. 101111 agreed = 1
> B. 110011 agreed = 3
> C. 110100 agreed = 5
>
> A binary search will find A-B. The n-candidates x 2 window includes A
> and B. C is closer and has a longer prefix overlap with the query than A
> does. My understanding of LSHForest is that its ascent by prefix length
> would necessarily find C. Your approach would not.
Agreed!


>
> While that may be a feature of your approach, I think we have reason to
> prefer a published algorithm.
Ok, then I guess this is not the place for this idea.



> ------------------------------------------------------------------------------
> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> Develop your own process in accordance with the BPMN 2 standard
> Learn Process modeling best practices with Bonita BPM through live exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
>
>
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>


------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Performance of LSHForest

Reply via email to