paleolimbot commented on PR #24: URL: https://github.com/apache/sedona-db/pull/24#issuecomment-3258486907
I haven't run into DuckDB running stuff on a single thread before although I have dealt with wildly different timings with the CLI vs Python. It's worth checking something other than ST_Contains, too, and also ensuring we're benchmarking against forthcoming DuckDB (`pip install duckdb --upgrade --pre`) since these will all change in a week. Probably the benchmark we want to focus on is the "user-facing default", since that is how people will perceive the speed of our engine? We might want to consider tuning the PostGIS instance to be smarter since the defaults are very bad for geometry (the last time I did this was https://dewey.dunnington.ca/post/2024/wrangling-and-joining-130m-points-with-duckdb--the-open-source-spatial-stack/#postgis ). Some other issues with the current benchmarks: - They benchmark on a "table" and not a Parquet scan. We have the edge on a Parquet scan, PostGIS and DuckDB have an edge with their native table format. The Parquet scan probably is more realistic. - We don't benchmark predicates with realistic input (they are ST_Contains with two identical inputs). The array/scalar case is probably best to focus on (more likely to affect the perceived speed of our engine's fist release). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@sedona.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org