Re: [Scikit-learn-general] Speeding up K-means clustering model with fast approximate neighbor search methods

Robert Layton Wed, 16 Apr 2014 17:00:07 -0700

Good to hear. You are right -- you should follow the API closely, deviating
only if necessary -- the consistent API is a core part of the popularity of
scikit-learn.



On 17 April 2014 06:51, Maheshakya Wijewardena <[email protected]>wrote:

> Hi Robert,
> As I see in the current implementation of DBSCAN, if the metric is not
> 'precomputed', then a nearest neighbor model is trained with the existing
> implementation of neighbors module. What I meant is since this ANN search
> will also be implemented similar (because it must adhere the API of
> neighbors module) to those exact neighbor search methods, I think it will
> not be much of a problem to apply ANN in DBSCAN.
>
>
> On Wed, Apr 16, 2014 at 4:03 AM, Robert Layton <[email protected]>wrote:
>
>> I wrote the original DBSCAN, in a time before I knew anything about
>> sparse matrices (I know now a little), so there may be artefacts in there
>> that aren't scalable -- i.e. a separate iteration over the array for
>> something or an operation that copies the matrix.
>> It has since been updated though, and I haven't had a chance to check out
>> the new code.
>>
>> The reason I say this is that if you improve ANN, you might get a cheap
>> improvement in the other algorithms, but it would be worth ensuring that
>> the rest of the code can "handle" the increased scale.
>>
>>
>> On 16 April 2014 00:39, Maheshakya Wijewardena <[email protected]>wrote:
>>
>>> Both mean-shift and dbscan directly use
>>> `sklearn.neighbors.NearestNeighbors` to train models and get nearest
>>> neighbors, unlike k-means. So I suppose, as the ANN will also act similar
>>> to Nearest neighbors, it can be used in that place without having to change
>>> the usage or semantics of those clustering  methods.
>>>
>>>
>>> On Fri, Apr 11, 2014 at 3:24 PM, Lars Buitinck <[email protected]>wrote:
>>>
>>>> 2014-04-11 10:55 GMT+02:00 Daniel Vainsencher <
>>>> [email protected]>:
>>>> > In any case, the approximate nature of the search raises the
>>>> possibility
>>>> > of going a step further: index the data points, and adjust each
>>>> cluster
>>>> > to its ANNs (in this case, for a very long list of candidates). This
>>>> is
>>>> > no longer k-means (closer to a mean-shift algorithm) and may or may
>>>> not
>>>> > work, but could be very fast.
>>>>
>>>> Speaking of, mean-shift is already implemented using NN. Judging from
>>>> GitHub issues, ML questions and the complexity notes in the mean-shift
>>>> docstrings, I also believe that optimizing it would be more valuable
>>>> than optimizing k-means, since we already have minibatch k-means.
>>>>
>>>> (Also k-means can still benefit from the Elkan optimization, which
>>>> doesn't change its semantics.)
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Put Bad Developers to Shame
>>>> Dominate Development with Jenkins Continuous Integration
>>>> Continuously Automate Build, Test & Deployment
>>>> Start a new project now. Try Jenkins in the cloud.
>>>> http://p.sf.net/sfu/13600_Cloudbees
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>>
>>>
>>>
>>>
>>> --
>>> Undergraduate,
>>> Department of Computer Science and Engineering,
>>> Faculty of Engineering.
>>> University of Moratuwa,
>>> Sri Lanka
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Learn Graph Databases - Download FREE O'Reilly Book
>>> "Graph Databases" is the definitive new guide to graph databases and
>>> their
>>> applications. Written by three acclaimed leaders in the field,
>>> this first edition is now available. Download your free book today!
>>> http://p.sf.net/sfu/NeoTech
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>> Learn Graph Databases - Download FREE O'Reilly Book
>> "Graph Databases" is the definitive new guide to graph databases and their
>> applications. Written by three acclaimed leaders in the field,
>> this first edition is now available. Download your free book today!
>> http://p.sf.net/sfu/NeoTech
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> --
> Undergraduate,
> Department of Computer Science and Engineering,
> Faculty of Engineering.
> University of Moratuwa,
> Sri Lanka
>
>
> ------------------------------------------------------------------------------
> Learn Graph Databases - Download FREE O'Reilly Book
> "Graph Databases" is the definitive new guide to graph databases and their
> applications. Written by three acclaimed leaders in the field,
> this first edition is now available. Download your free book today!
> http://p.sf.net/sfu/NeoTech
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Speeding up K-means clustering model with fast approximate neighbor search methods

Reply via email to