Re: [PR] Run queries in python benchmarks using only one thread [sedona-db]

via GitHub Fri, 05 Sep 2025 18:21:06 -0700


paleolimbot commented on PR #24:
URL: https://github.com/apache/sedona-db/pull/24#issuecomment-3258486907

I haven't run into DuckDB running stuff on a single thread before although I
have dealt with wildly different timings with the CLI vs Python. It's worth
checking something other than ST_Contains, too, and also ensuring we're
benchmarking against forthcoming DuckDB (`pip install duckdb --upgrade --pre`)
since these will all change in a week.

Probably the benchmark we want to focus on is the "user-facing default",
since that is how people will perceive the speed of our engine? We might want
to consider tuning the PostGIS instance to be smarter since the defaults are
very bad for geometry (the last time I did this was
https://dewey.dunnington.ca/post/2024/wrangling-and-joining-130m-points-with-duckdb--the-open-source-spatial-stack/#postgis
).

Some other issues with the current benchmarks:

- They benchmark on a "table" and not a Parquet scan. We have the edge on a
Parquet scan, PostGIS and DuckDB have an edge with their native table format.
The Parquet scan probably is more realistic.
- We don't benchmark predicates with realistic input (they are ST_Contains
with two identical inputs). The array/scalar case is probably best to focus on
(more likely to affect the perceived speed of our engine's fist release).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@sedona.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] Run queries in python benchmarks using only one thread [sedona-db]

Reply via email to