Re: [scikit-learn] DBScan freezes my computer !!!

Andreas Mueller Wed, 16 May 2018 10:46:24 -0700

Should we have "low memory"/batched version of k_neighbors_graph andepsilon_neighbors_graph functions? I assume

those instantiate the dense matrix right now.


On 05/13/2018 10:59 PM, Joel Nothman wrote:

This is quite a common issue with our implementation of DBSCAN, andimprovements to documentation would be very, very welcome.
The high memory cost comes from constructing the pairwise radiusneighbors for all points. If using a distance metric that cannot beindexed with a KD-tree or Ball Tree, this results in n^2 floats beingstored in memory even before the radius neighbors are computed.
You have the following strategies available to you currently:
1. Calculate the radius neighborhoods using radius_neighbors_graph inchunks, so as to avoid all pairs being calculated and stored at once.This produces a sparse graph representation, which can be passed intodbscan with metric='precomputed'. (I've just seen Sebastian suggestedthe same.)2. Reduce the number of samples in your dataset and represent(near-)duplicate points with sample_weight (i.e. two identical pointswould be merged but would have a sample_weight of 2).
There is also a proposal to offer an alternative memory-efficient modeat https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback iswelcome.
Cheers,

Joel




_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] DBScan freezes my computer !!!

Reply via email to