Note that this has long been documented under "Memory consumption for large
sample sizes" at
http://scikit-learn.org/stable/modules/clustering.html#dbscan

On 14 May 2018 at 12:59, Joel Nothman <joel.noth...@gmail.com> wrote:

> This is quite a common issue with our implementation of DBSCAN, and
> improvements to documentation would be very, very welcome.
>
> The high memory cost comes from constructing the pairwise radius neighbors
> for all points. If using a distance metric that cannot be indexed with a
> KD-tree or Ball Tree, this results in n^2 floats being stored in memory
> even before the radius neighbors are computed.
>
> You have the following strategies available to you currently:
>
> 1. Calculate the radius neighborhoods using radius_neighbors_graph in
> chunks, so as to avoid all pairs being calculated and stored at once. This
> produces a sparse graph representation, which can be passed into dbscan
> with metric='precomputed'. (I've just seen Sebastian suggested the same.)
> 2. Reduce the number of samples in your dataset and represent
> (near-)duplicate points with sample_weight (i.e. two identical points would
> be merged but would have a sample_weight of 2).
>
> There is also a proposal to offer an alternative memory-efficient mode at
> https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback is
> welcome.
>
> Cheers,
>
> Joel
>
>
>
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to