GitHub user james-willis added a comment to the discussion: ST_DBSCAN vs 
sklearn.DBSCAN: understanding tradeoffs

The execution time of DBSCAN on different implementations/approaches is 
multifactorial. Specifically, it will be influenced by:
* The dataset itself
  * dataset size
  * spatial density
  * spatial patterns in the data (e.g. some severe spatial concentration). In 
the graphframes connected components algorithm this can lead to a long right 
tail for execution time vs number of running tasks. In other words you whole 
cluster is waiting for only a few cores to do work.
* The parameters for DBSCAN:
  * larger values of epsilon result in higher selectivity of the distance join 
and a larger graph for the connected components calculation
  * smaller values of min pts means a larger graph for the connected components 
calculation
* cluster characteristics - spark is a multihost solution so the composition of 
the cluster comes in to play
  * network speed - the connected components algorithm from graphframes does a 
lot of shuffles and so network throughput is critical to performance. 
  * host size - for a given cluster size (ie cores and ram) larger hosts will 
have more 
[local](https://medium.com/data-engineer/understanding-spark-locality-levels-d4ab14d15be1)
 tasks which will reduce the amount of network io and speed up shuffle stages.

The Sedona implementation of DBSCAN is designed to be performant and robust on 
large datasets with many core points. It will process datasets that are not 
feasible on sklearn. It will often not outperform sklearn when the data is 
smaller, as you've found. 

sklearn is a single host solution and so saves itself a lot of the overhead 
that spark has as a distributed compute platform. Among other things that 
impair its per-core throughput, Spark incurs a lot of overhead when there is 
data shuffle. But as you've noticed sklearn is memory hungry. As is often the 
case memory consumption and CPU throughput are directly in tension with each 
other.


Out of the box, graphframes uses the algorithm described 
[here](https://dl.acm.org/doi/pdf/10.1145/2670979.2670997) to calculate the 
connected components. In graphx, they use a [message passing 
approach](https://github.com/apache/spark/blob/v4.0.0/graphx/src/main/scala/org/apache/spark/graphx/lib/ConnectedComponents.scala).
 In most workloads this connected components calculation will dominate the 
runtime.

When graphframes 0.9.0 releases, you will be able to change which algorithm is 
being used between these two. See [this 
PR](https://github.com/graphframes/graphframes/pull/563) I made. Using the 
graphx algorithm will be faster but more memory hungry and less robust. You can 
test if this will be a desirable middle ground for your use case.


In this write up I mostly focused on the connected components element of 
DBSCAN. There are characteristics of the data that can make the distance join 
element slower or faster but since you mention sklearn I assume you are working 
only with point data and thus are probably getting good performance on that 
front.

GitHub link: 
https://github.com/apache/sedona/discussions/1965#discussioncomment-13424667

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to