Note that this has long been documented under "Memory consumption for large sample sizes" at http://scikit-learn.org/stable/modules/clustering.html#dbscan
On 14 May 2018 at 12:59, Joel Nothman <joel.noth...@gmail.com> wrote: > This is quite a common issue with our implementation of DBSCAN, and > improvements to documentation would be very, very welcome. > > The high memory cost comes from constructing the pairwise radius neighbors > for all points. If using a distance metric that cannot be indexed with a > KD-tree or Ball Tree, this results in n^2 floats being stored in memory > even before the radius neighbors are computed. > > You have the following strategies available to you currently: > > 1. Calculate the radius neighborhoods using radius_neighbors_graph in > chunks, so as to avoid all pairs being calculated and stored at once. This > produces a sparse graph representation, which can be passed into dbscan > with metric='precomputed'. (I've just seen Sebastian suggested the same.) > 2. Reduce the number of samples in your dataset and represent > (near-)duplicate points with sample_weight (i.e. two identical points would > be merged but would have a sample_weight of 2). > > There is also a proposal to offer an alternative memory-efficient mode at > https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback is > welcome. > > Cheers, > > Joel > > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn