paleolimbot opened a new pull request, #646:
URL: https://github.com/apache/sedona-db/pull/646

   In #251 we tried to use the file metadata cache and found that it actually 
slowed down queries. Hiroaki kindly benchmarked the effect of the cache against 
DuckDB to demonstrate that the file cache there is effective for queries 
against large tables. @b4l kindly showed how to do this in #604.
   
   This PR pipes through the requisite options to ensure the cache is used for 
GeoParquet reads. This is especially important because we need to pull two 
extra copies of the metadata after DataFusion has already pulled it: if we 
don't use the cached version, we issue three requests where we could have 
issued one.
   
   A secondary issue is that the default size of the cache is not well-equiped 
to deal with Overture buildings, which we were using to benchmark this. The 
buildings data requires almost 900 megabytes of cache space and because it is a 
least-recently used cache being queried roughly in order three times, if the 
cache size is even a little bit smaller than the full size of the dataset then 
it is 0% useful. The increase we see in time is probably because of contention 
on the mutex guarding the in-memory cache.
   
   ```python
   import re
   import os
   os.environ["AWS_SKIP_SIGNATURE"] = "true"
   os.environ["AWS_DEFAULT_REGION"] = "us-west-2"
   import sedona.db
   
   sd = sedona.db.connect()
   
   sd.sql("SET datafusion.runtime.metadata_cache_limit = '900M'").execute()
   
   # 16s on main, 10s on this PR with a big enough cache
   sd.read_parquet(
       
"s3://overturemaps-us-west-2/release/2026-02-18.0/theme=buildings/type=building/"
   ).to_view("buildings", overwrite=True)
   
   # Second time: 16s on main, 0s with this PR
   sd.read_parquet(
       
"s3://overturemaps-us-west-2/release/2026-02-18.0/theme=buildings/type=building/"
   ).to_view("buildings", overwrite=True)
   ```
   
   I took the opportunity to redo the Overture buildings documentation page to 
include this and a few other improvements we added in the last few months.
   
   Closes #250.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to