You might also consider looking at hdbscan:

https://github.com/scikit-learn-contrib/hdbscan


On 05/13/2018 11:07 PM, Joel Nothman wrote:
Note that this has long been documented under "Memory consumption for large sample sizes" at http://scikit-learn.org/stable/modules/clustering.html#dbscan

On 14 May 2018 at 12:59, Joel Nothman <joel.noth...@gmail.com <mailto:joel.noth...@gmail.com>> wrote:

    This is quite a common issue with our implementation of DBSCAN,
    and improvements to documentation would be very, very welcome.

    The high memory cost comes from constructing the pairwise radius
    neighbors for all points. If using a distance metric that cannot
    be indexed with a KD-tree or Ball Tree, this results in n^2 floats
    being stored in memory even before the radius neighbors are computed.

    You have the following strategies available to you currently:

    1. Calculate the radius neighborhoods using radius_neighbors_graph
    in chunks, so as to avoid all pairs being calculated and stored at
    once. This produces a sparse graph representation, which can be
    passed into dbscan with metric='precomputed'. (I've just seen
    Sebastian suggested the same.)
    2. Reduce the number of samples in your dataset and represent
    (near-)duplicate points with sample_weight (i.e. two identical
    points would be merged but would have a sample_weight of 2).

    There is also a proposal to offer an alternative memory-efficient
    mode at https://github.com/scikit-learn/scikit-learn/pull/6813
    <https://github.com/scikit-learn/scikit-learn/pull/6813>. Feedback
    is welcome.

    Cheers,

    Joel





_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to