Re: [D] ST_DBSCAN vs sklearn.DBSCAN: understanding tradeoffs [sedona]

via GitHub Tue, 10 Jun 2025 11:33:30 -0700


GitHub user james-willis added a comment to the discussion: ST_DBSCAN vs 
sklearn.DBSCAN: understanding tradeoffs

The execution time of DBSCAN on different implementations/approaches is
multifactorial. Specifically, it will be influenced by:
* The dataset itself
* dataset size
* spatial density
* spatial patterns in the data (e.g. some severe spatial concentration). In
the graphframes connected components algorithm this can lead to a long right
tail for execution time vs number of running tasks. In other words you whole
cluster is waiting for only a few cores to do work.
* The parameters for DBSCAN:
* larger values of epsilon result in higher selectivity of the distance join
and a larger graph for the connected components calculation
* smaller values of min pts means a larger graph for the connected components
calculation
* cluster characteristics - spark is a multihost solution so the composition of
the cluster comes in to play
* network speed - the connected components algorithm from graphframes does a
lot of shuffles and so network throughput is critical to performance.
* host size - for a given cluster size (ie cores and ram) larger hosts will
have more
[local](https://medium.com/data-engineer/understanding-spark-locality-levels-d4ab14d15be1)
tasks which will reduce the amount of network io and speed up shuffle stages.

The Sedona implementation of DBSCAN is designed to be performant and robust on
large datasets with many core points. It will process datasets that are not
feasible on sklearn. It will often not outperform sklearn when the data is
smaller, as you've found.

sklearn is a single host solution and so saves itself a lot of the overhead
that spark has as a distributed compute platform. Among other things that
impair its per-core throughput, Spark incurs a lot of overhead when there is
data shuffle. But as you've noticed sklearn is memory hungry. As is often the
case memory consumption and CPU throughput are directly in tension with each
other.

Out of the box, graphframes uses the algorithm described
[here](https://dl.acm.org/doi/pdf/10.1145/2670979.2670997) to calculate the
connected components. In graphx, they use a [message passing
approach](https://github.com/apache/spark/blob/v4.0.0/graphx/src/main/scala/org/apache/spark/graphx/lib/ConnectedComponents.scala).
In most workloads this connected components calculation will dominate the
runtime.

When graphframes 0.9.0 releases, you will be able to change which algorithm is
being used between these two. See [this
PR](https://github.com/graphframes/graphframes/pull/563) I made. Using the
graphx algorithm will be faster but more memory hungry and less robust. You can
test if this will be a desirable middle ground for your use case.

In this write up I mostly focused on the connected components element of
DBSCAN. There are characteristics of the data that can make the distance join
element slower or faster but since you mention sklearn I assume you are working
only with point data and thus are probably getting good performance on that
front.

GitHub link:
https://github.com/apache/sedona/discussions/1965#discussioncomment-13424667

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] ST_DBSCAN vs sklearn.DBSCAN: understanding tradeoffs [sedona]

Reply via email to