On 04/19/2015 08:18 AM, Joel Nothman wrote: > > > On 17 April 2015 at 13:52, Daniel Vainsencher > <daniel.vainsenc...@gmail.com <mailto:daniel.vainsenc...@gmail.com>> wrote: > > On 04/16/2015 05:49 PM, Joel Nothman wrote: > > I more or less agree. Certainly we only need to do one searchsorted per > > query per tree, and then do linear scans. There is a question of how > > close we stay to the original LSHForest algorithm, which relies on > > matching prefixes rather than hamming distance. Hamming distance is > > easier to calculate in NumPy and is probably faster to calculate in C > > too (with or without using POPCNT). Perhaps the only advantage of using > > Cython in your solution is to avoid the memory overhead of unpackbits. > You obviously know more than I do about Cython vs numpy options. > > > However, n_candidates before and after is arguably not sufficient if one > > side has more than n_candidates with a high prefix overlap. > I disagree. Being able to look at 2*n_candidates that must contain > n_candidates of the closest ones, rather than "as many as happen to > agree on x number of bits" is a feature, not a bug. Especially if we > then filter them by hamming distance. > <https://lists.sourceforge.net/lists/listinfo/scikit-learn-general> > > > But it need not contain the closest ones that would have been retrieved > by LSHForest (assuming we're only looking at a single tree). Let's say > n_candidates is 1, our query is 110000 and our index contains > > A. 101111 agreed = 1 > B. 110011 agreed = 3 > C. 110100 agreed = 5 > > A binary search will find A-B. The n-candidates x 2 window includes A > and B. C is closer and has a longer prefix overlap with the query than A > does. My understanding of LSHForest is that its ascent by prefix length > would necessarily find C. Your approach would not. Agreed!
> > While that may be a feature of your approach, I think we have reason to > prefer a published algorithm. Ok, then I guess this is not the place for this idea. > ------------------------------------------------------------------------------ > BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT > Develop your own process in accordance with the BPMN 2 standard > Learn Process modeling best practices with Bonita BPM through live exercises > http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ > source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF > > > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > ------------------------------------------------------------------------------ BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT Develop your own process in accordance with the BPMN 2 standard Learn Process modeling best practices with Bonita BPM through live exercises http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_ source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general