GitHub user ishraqrayeedbp edited a discussion: ST_DBSCAN vs sklearn.DBSCAN: 
understanding tradeoffs

Hi Sedona team,

Most of my geospatial ETL workflows run on AWS Glue, and I’ve successfully 
integrated Apache Sedona in that environment.

For one use case, I needed to perform DBSCAN clustering on partitioned data 
(each partition corresponds to let's say - a city or county). The size of each 
batch varies significantly - sometimes as low as 100K rows, other times as high 
as 20M rows.

I tested both:

- ST_DBSCAN from Sedona (Spark-based)
- sklearn.DBSCAN (Pandas in-memory)

What I observed:

- sklearn.DBSCAN was faster, but consumed significantly more memory, since it 
pulls the data into memory using Pandas.
- ST_DBSCAN was slower, but more memory-efficient, presumably because it 
leverages Spark's distributed processing.

Is this behavior expected?

And is there any guidance on when to prefer ST_DBSCAN vs a manual fallback to 
sklearn.DBSCAN, based on batch size or resource availability? I'd prefer not to 
maintain two separate pipelines for small vs large datasets, but optimizing for 
both memory and performance has been a challenge.

Thanks in advance for any insights!

GitHub link: https://github.com/apache/sedona/discussions/1965

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to