Except apparently that commit breaks the code... Maybe I've misunderstood
something :(

On 16 April 2015 at 14:18, Joel Nothman <joel.noth...@gmail.com> wrote:

> ball tree is not vectorized in the sense of SIMD, but there is
> Python/numpy overhead in LSHForest that is not present in ball tree.
>
> I think one of the problems is the high n_candidates relative to the
> n_neighbors. This really increases the search time.
>
> Once we're dealing with large enough index and n_candidates, most time is
> spent in searchsorted in the "synchronous ascending phase", while any
> overhead around it is marginal. Currently we are searching over the whole
> array in each searchsorted, while it could be rewritten to keep better
> track of context to cut down the overall array when searching. While
> possible, I suspect this will look confusing in Python/Numpy, and Cython
> will be a clearer and faster way to present this logic.
>
> On the other hand, time spent in _compute_distances is substantial, and
> yet most of its consumption is *outside* of pairwise_distances. This
> commit
> <https://github.com/scikit-learn/scikit-learn/commit/c1f335f70aa0f766a930f8ac54eeaa601245725a>
> cuts a basic benchmark from 85 to 70 seconds. Vote here for merge
> <https://github.com/scikit-learn/scikit-learn/pull/4603>!
>
> On 16 April 2015 at 12:32, Maheshakya Wijewardena <pmaheshak...@gmail.com>
> wrote:
>
>> Moreover, this drawback occurs because LSHForest does not vectorize
>> multiple queries as in 'ball_tree' or any other method. This slows the
>> exact neighbor distance calculation down significantly after approximation.
>> This will not be a problem if queries are for individual points.
>> Unfortunately, former is the more useful usage of LSHForest.
>> Are you trying individual queries or multiple queries (for n_samples)?
>>
>> On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher <
>> daniel.vainsenc...@gmail.com> wrote:
>>
>>> LHSForest is not intended for dimensions at which exact methods work
>>> well, nor for tiny datasets. Try d>500, n_points>100000, I don't remember
>>> the switchover point.
>>>
>>> The documentation should make this clear, but unfortunately I don't see
>>> that it does.
>>> On Apr 15, 2015 7:08 PM, "Joel Nothman" <joel.noth...@gmail.com> wrote:
>>>
>>>> I agree this is disappointing, and we need to work on making LSHForest
>>>> faster. Portions should probably be coded in Cython, for instance, as the
>>>> current implementation is a bit circuitous in order to work in numpy. PRs
>>>> are welcome.
>>>>
>>>> LSHForest could use parallelism to be faster, but so can (and will) the
>>>> exact neighbors methods. In theory in LSHForest, each "tree" could be
>>>> stored on entirely different machines, providing memory benefits, but
>>>> scikit-learn can't really take advantage of this.
>>>>
>>>> Having said that, I would also try with higher n_features and
>>>> n_queries. We have to limit the scale of our examples in order to limit the
>>>> overall document compilation time.
>>>>
>>>> On 16 April 2015 at 01:12, Miroslav Batchkarov <mbatchka...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> was really impressed by the speedups provided by LSHForest compared to
>>>>> brute-force search. Out of curiosity, I compared LSRForest to the existing
>>>>> ball tree implementation. The approximate algorithm is consistently slower
>>>>> (see below). Is this normal and should it be mentioned in the
>>>>> documentation? Does approximate search offer any benefits in terms of
>>>>> memory usage?
>>>>>
>>>>>
>>>>> I ran the same example
>>>>> <http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py>
>>>>>  with
>>>>> a algorithm=ball_tree. I also had to set metric=‘euclidean’ (this may
>>>>> affect results). The output is:
>>>>>
>>>>> Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0, accuracy:
>>>>> 1.00 +/-0.00
>>>>> Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1, accuracy:
>>>>> 0.94 +/-0.05
>>>>> Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2, accuracy:
>>>>> 0.92 +/-0.07
>>>>> Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3,
>>>>> accuracy: 0.92 +/-0.07
>>>>> Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5,
>>>>> accuracy: 0.84 +/-0.10
>>>>> Index size: 100000, exact: 0.008s, LSHF: 0.016s, speedup: 0.5,
>>>>> accuracy: 0.80 +/-0.06
>>>>>
>>>>> With n_candidates=100, the output is
>>>>>
>>>>> Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0, accuracy:
>>>>> 1.00 +/-0.00
>>>>> Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1, accuracy:
>>>>> 0.94 +/-0.05
>>>>> Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2, accuracy:
>>>>> 0.92 +/-0.07
>>>>> Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4,
>>>>> accuracy: 0.90 +/-0.11
>>>>> Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7,
>>>>> accuracy: 0.82 +/-0.13
>>>>> Index size: 100000, exact: 0.007s, LSHF: 0.013s, speedup: 0.6,
>>>>> accuracy: 0.78 +/-0.04
>>>>>
>>>>>
>>>>>
>>>>> ---
>>>>> Miroslav Batchkarov
>>>>> PhD Student,
>>>>> Text Analysis Group,
>>>>> Department of Informatics,
>>>>> University of Sussex
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
>>>>> Develop your own process in accordance with the BPMN 2 standard
>>>>> Learn Process modeling best practices with Bonita BPM through live
>>>>> exercises
>>>>> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
>>>>> event?utm_
>>>>> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
>>>> Develop your own process in accordance with the BPMN 2 standard
>>>> Learn Process modeling best practices with Bonita BPM through live
>>>> exercises
>>>> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
>>>> event?utm_
>>>> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>> ------------------------------------------------------------------------------
>>> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
>>> Develop your own process in accordance with the BPMN 2 standard
>>> Learn Process modeling best practices with Bonita BPM through live
>>> exercises
>>> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
>>> event?utm_
>>> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> --
>>
>> *Maheshakya Wijewardena,Undergraduate,*
>> *Department of Computer Science and Engineering,*
>> *Faculty of Engineering.*
>> *University of Moratuwa,*
>> *Sri Lanka*
>>
>>
>> ------------------------------------------------------------------------------
>> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
>> Develop your own process in accordance with the BPMN 2 standard
>> Learn Process modeling best practices with Bonita BPM through live
>> exercises
>> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
>> event?utm_
>> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to