Oh. Silly mistake. Doesn't break with the correct patch, now at PR#4604...

On 16 April 2015 at 14:24, Joel Nothman <joel.noth...@gmail.com> wrote:

> Except apparently that commit breaks the code... Maybe I've misunderstood
> something :(
>
> On 16 April 2015 at 14:18, Joel Nothman <joel.noth...@gmail.com> wrote:
>
>> ball tree is not vectorized in the sense of SIMD, but there is
>> Python/numpy overhead in LSHForest that is not present in ball tree.
>>
>> I think one of the problems is the high n_candidates relative to the
>> n_neighbors. This really increases the search time.
>>
>> Once we're dealing with large enough index and n_candidates, most time is
>> spent in searchsorted in the "synchronous ascending phase", while any
>> overhead around it is marginal. Currently we are searching over the whole
>> array in each searchsorted, while it could be rewritten to keep better
>> track of context to cut down the overall array when searching. While
>> possible, I suspect this will look confusing in Python/Numpy, and Cython
>> will be a clearer and faster way to present this logic.
>>
>> On the other hand, time spent in _compute_distances is substantial, and
>> yet most of its consumption is *outside* of pairwise_distances. This
>> commit
>> <https://github.com/scikit-learn/scikit-learn/commit/c1f335f70aa0f766a930f8ac54eeaa601245725a>
>> cuts a basic benchmark from 85 to 70 seconds. Vote here for merge
>> <https://github.com/scikit-learn/scikit-learn/pull/4603>!
>>
>> On 16 April 2015 at 12:32, Maheshakya Wijewardena <pmaheshak...@gmail.com
>> > wrote:
>>
>>> Moreover, this drawback occurs because LSHForest does not vectorize
>>> multiple queries as in 'ball_tree' or any other method. This slows the
>>> exact neighbor distance calculation down significantly after approximation.
>>> This will not be a problem if queries are for individual points.
>>> Unfortunately, former is the more useful usage of LSHForest.
>>> Are you trying individual queries or multiple queries (for n_samples)?
>>>
>>> On Thu, Apr 16, 2015 at 6:14 AM, Daniel Vainsencher <
>>> daniel.vainsenc...@gmail.com> wrote:
>>>
>>>> LHSForest is not intended for dimensions at which exact methods work
>>>> well, nor for tiny datasets. Try d>500, n_points>100000, I don't remember
>>>> the switchover point.
>>>>
>>>> The documentation should make this clear, but unfortunately I don't see
>>>> that it does.
>>>> On Apr 15, 2015 7:08 PM, "Joel Nothman" <joel.noth...@gmail.com> wrote:
>>>>
>>>>> I agree this is disappointing, and we need to work on making LSHForest
>>>>> faster. Portions should probably be coded in Cython, for instance, as the
>>>>> current implementation is a bit circuitous in order to work in numpy. PRs
>>>>> are welcome.
>>>>>
>>>>> LSHForest could use parallelism to be faster, but so can (and will)
>>>>> the exact neighbors methods. In theory in LSHForest, each "tree" could be
>>>>> stored on entirely different machines, providing memory benefits, but
>>>>> scikit-learn can't really take advantage of this.
>>>>>
>>>>> Having said that, I would also try with higher n_features and
>>>>> n_queries. We have to limit the scale of our examples in order to limit 
>>>>> the
>>>>> overall document compilation time.
>>>>>
>>>>> On 16 April 2015 at 01:12, Miroslav Batchkarov <mbatchka...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> was really impressed by the speedups provided by LSHForest compared
>>>>>> to brute-force search. Out of curiosity, I compared LSRForest to the
>>>>>> existing ball tree implementation. The approximate algorithm is
>>>>>> consistently slower (see below). Is this normal and should it be 
>>>>>> mentioned
>>>>>> in the documentation? Does approximate search offer any benefits in terms
>>>>>> of memory usage?
>>>>>>
>>>>>>
>>>>>> I ran the same example
>>>>>> <http://scikit-learn.org/stable/auto_examples/neighbors/plot_approximate_nearest_neighbors_scalability.html#example-neighbors-plot-approximate-nearest-neighbors-scalability-py>
>>>>>>  with
>>>>>> a algorithm=ball_tree. I also had to set metric=‘euclidean’ (this may
>>>>>> affect results). The output is:
>>>>>>
>>>>>> Index size: 1000, exact: 0.000s, LSHF: 0.007s, speedup: 0.0,
>>>>>> accuracy: 1.00 +/-0.00
>>>>>> Index size: 2511, exact: 0.001s, LSHF: 0.007s, speedup: 0.1,
>>>>>> accuracy: 0.94 +/-0.05
>>>>>> Index size: 6309, exact: 0.001s, LSHF: 0.008s, speedup: 0.2,
>>>>>> accuracy: 0.92 +/-0.07
>>>>>> Index size: 15848, exact: 0.002s, LSHF: 0.008s, speedup: 0.3,
>>>>>> accuracy: 0.92 +/-0.07
>>>>>> Index size: 39810, exact: 0.005s, LSHF: 0.010s, speedup: 0.5,
>>>>>> accuracy: 0.84 +/-0.10
>>>>>> Index size: 100000, exact: 0.008s, LSHF: 0.016s, speedup: 0.5,
>>>>>> accuracy: 0.80 +/-0.06
>>>>>>
>>>>>> With n_candidates=100, the output is
>>>>>>
>>>>>> Index size: 1000, exact: 0.000s, LSHF: 0.006s, speedup: 0.0,
>>>>>> accuracy: 1.00 +/-0.00
>>>>>> Index size: 2511, exact: 0.001s, LSHF: 0.006s, speedup: 0.1,
>>>>>> accuracy: 0.94 +/-0.05
>>>>>> Index size: 6309, exact: 0.001s, LSHF: 0.005s, speedup: 0.2,
>>>>>> accuracy: 0.92 +/-0.07
>>>>>> Index size: 15848, exact: 0.002s, LSHF: 0.007s, speedup: 0.4,
>>>>>> accuracy: 0.90 +/-0.11
>>>>>> Index size: 39810, exact: 0.005s, LSHF: 0.008s, speedup: 0.7,
>>>>>> accuracy: 0.82 +/-0.13
>>>>>> Index size: 100000, exact: 0.007s, LSHF: 0.013s, speedup: 0.6,
>>>>>> accuracy: 0.78 +/-0.04
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---
>>>>>> Miroslav Batchkarov
>>>>>> PhD Student,
>>>>>> Text Analysis Group,
>>>>>> Department of Informatics,
>>>>>> University of Sussex
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ------------------------------------------------------------------------------
>>>>>> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
>>>>>> Develop your own process in accordance with the BPMN 2 standard
>>>>>> Learn Process modeling best practices with Bonita BPM through live
>>>>>> exercises
>>>>>> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
>>>>>> event?utm_
>>>>>> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
>>>>>> _______________________________________________
>>>>>> Scikit-learn-general mailing list
>>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------------
>>>>> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
>>>>> Develop your own process in accordance with the BPMN 2 standard
>>>>> Learn Process modeling best practices with Bonita BPM through live
>>>>> exercises
>>>>> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
>>>>> event?utm_
>>>>> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
>>>>> _______________________________________________
>>>>> Scikit-learn-general mailing list
>>>>> Scikit-learn-general@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>>
>>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
>>>> Develop your own process in accordance with the BPMN 2 standard
>>>> Learn Process modeling best practices with Bonita BPM through live
>>>> exercises
>>>> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
>>>> event?utm_
>>>> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> Scikit-learn-general@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> *Maheshakya Wijewardena,Undergraduate,*
>>> *Department of Computer Science and Engineering,*
>>> *Faculty of Engineering.*
>>> *University of Moratuwa,*
>>> *Sri Lanka*
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
>>> Develop your own process in accordance with the BPMN 2 standard
>>> Learn Process modeling best practices with Bonita BPM through live
>>> exercises
>>> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
>>> event?utm_
>>> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to