GitHub user paleolimbot added a comment to the discussion: Observations from R 
and Python benchmarks: performance bottlenecks and optimization ideas for 
sedona-db

Very cool! Thank you for putting together these benchmarks! I think you're 
right about these bullets: 

- We can possibly make SedonaDB->geopandas faster
- We really do need to be able to read GDAL/OGR via R. I'll open a ticket for 
this and see if I can squeeze it in...it's more complicated than Python since 
we don't have pyogrio to help.
- We'd love to add ST_LineMerge and ST_Subdivide. These are "just" GEOS 
functions and have already been merged to georust/geos ( 
https://github.com/georust/geos/blob/47afbad2483e489911ddb456417808340e9342c3/src/geometry.rs#L2789-L2801
 ). I'll open tickets for these.
- I'll fix the schema mismatch issue this week 🙂

Reading GeoPackages and converting outside the Arrow universe are always going 
to be slower than GeoParquet + staying inside the Arrow universe, and part of 
SedonaDB is strengthening those ecosystems to the point that those operations 
don't have to happen (i.e., we also want to make SedonaDB->geopandas/sf and 
reading .gpkg files unnecessary most of the time by making sure we support the 
next step).

If I'm reading these correctly, these benchmarks are of a `.gpkg` read(s), 
followed by some operation, with a collect back into various existing 
frameworks. I think the reason `sedonadb-sf` appears so fast is that you're 
using `sd_collect()`, which doesn't actually produce `sf` objects but something 
closer to a zero-copy ALTREP wrapper around the array (a `geoarrow_vctr`, to be 
precise). If you changed `sd_collect()` to `st_as_sf()` I think you'd see 
something more similar to `sedonadb-geopandas`.

I'm not sure why sedonadb-polars isn't identical to sedonadb-sf for the 
spatial_join benchmark (I would have expected those results to be identical). I 
think that geopandas caches the spatial index and I'm not sure you have a 
totally "fresh" GeoDataFrame for each iteration of your benchmark (or also, 
this might be a case where Python/R string handling shines over Arrow since 
there are a lot of repeated strings in the output). 16 Polygons x 100k points 
is pretty small and I'm pleased with the fact that SedonaDB doesn't add so much 
overhead that the performance on that sort of microbenchmark is reasonable.


> how I can best contribute to testing these high-performance paths!

Continuing to kick the tires and write about it is fantastic! Knowing that 
there's interest in SedonaDB for R is helpful (it's a bit of a side project for 
the other SedonaDB work I do and it's motivating if I know that anybody 
actually plans on using it 🙂 ). I'm not sure I will ever get to writing a 
GeocompX variant but in theory that's what we're trying to provide with 
SedonaDB and it's a great blueprint for stuff that SedonaDB should be able to 
do at some point.

GitHub link: 
https://github.com/apache/sedona/discussions/2576#discussioncomment-15402640

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to