GitHub user ishraqrayeedbp edited a discussion: ST_DBSCAN vs sklearn.DBSCAN: understanding tradeoffs
Hi Sedona team, Most of my geospatial ETL workflows run on AWS Glue, and I’ve successfully integrated Apache Sedona in that environment. For one use case, I needed to perform DBSCAN clustering on partitioned data (each partition corresponds to let's say - a city or county). The size of each batch varies significantly - sometimes as low as 100K rows, other times as high as 20M rows. I tested both: - ST_DBSCAN from Sedona (Spark-based) - sklearn.DBSCAN (Pandas in-memory) What I observed: - sklearn.DBSCAN was faster, but consumed significantly more memory, since it pulls the data into memory using Pandas. - ST_DBSCAN was slower, but more memory-efficient, presumably because it leverages Spark's distributed processing. Is this behavior expected? And is there any guidance on when to prefer ST_DBSCAN vs a manual fallback to sklearn.DBSCAN, based on batch size or resource availability? I'd prefer not to maintain two separate pipelines for small vs large datasets, but optimizing for both memory and performance has been a challenge. Thanks in advance for any insights! GitHub link: https://github.com/apache/sedona/discussions/1965 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
