Should we have "low memory"/batched version of k_neighbors_graph and
epsilon_neighbors_graph functions? I assume
those instantiate the dense matrix right now.
On 05/13/2018 10:59 PM, Joel Nothman wrote:
This is quite a common issue with our implementation of DBSCAN, and
improvements to documentation would be very, very welcome.
The high memory cost comes from constructing the pairwise radius
neighbors for all points. If using a distance metric that cannot be
indexed with a KD-tree or Ball Tree, this results in n^2 floats being
stored in memory even before the radius neighbors are computed.
You have the following strategies available to you currently:
1. Calculate the radius neighborhoods using radius_neighbors_graph in
chunks, so as to avoid all pairs being calculated and stored at once.
This produces a sparse graph representation, which can be passed into
dbscan with metric='precomputed'. (I've just seen Sebastian suggested
the same.)
2. Reduce the number of samples in your dataset and represent
(near-)duplicate points with sample_weight (i.e. two identical points
would be merged but would have a sample_weight of 2).
There is also a proposal to offer an alternative memory-efficient mode
at https://github.com/scikit-learn/scikit-learn/pull/6813. Feedback is
welcome.
Cheers,
Joel
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn