paleolimbot commented on issue #686:
URL: https://github.com/apache/sedona-db/issues/686#issuecomment-4000874632

   Here's one that's related to the blog post I wrote on counting buildings by 
zipcode a while back. I used census tracts here because that's the first 
shapefile I found:
   
   ```python
   # curl -L 
https://www2.census.gov/geo/tiger/GENZ2024/shp/cb_2024_us_tract_5m.zip -o 
tracts.shp.zip
   # curl -L 
https://github.com/geoarrow/geoarrow-data/releases/download/v0.2.0/microsoft-buildings_point.parquet
 -o buildings.parquet
   import sedona.db
   
   sd = sedona.db.connect()
   sd.options.interactive = True
   sd.options.memory_limit = "unlimited"
   
   sd.read_pyogrio("tracts.shp.zip").to_view("tracts")
   sd.read_parquet("buildings.parquet").to_view("buildings")
   
   # 6.2s
   sd.sql("""
   SELECT
       t."GEOIDFQ",
       COUNT(b.geometry) AS building_count
   FROM tracts AS t
   LEFT JOIN buildings AS b
       ON ST_Contains(t.wkb_geometry, ST_Transform(b.geometry, 'epsg:4269'))
   GROUP BY t."GEOIDFQ"
   ORDER BY building_count DESC
   """).to_memtable().to_view("result", overwrite=True)
   ```
   
   I'm not sure what DuckDB is doing differently here but DuckDB (1.5.0 pre) 
takes 90s to do this:
   
   ```python
   import duckdb
   
   duckdb.load_extension("spatial")
   duckdb.sql("""
   CREATE OR REPLACE VIEW tracts AS
   SELECT *
   FROM ST_Read('tracts.shp.zip')
   """)
   
   duckdb.sql("""
   SELECT
       t."GEOIDFQ",
       COUNT(b.geometry) AS building_count
   FROM tracts AS t
   LEFT JOIN "buildings.parquet" AS b
       ON ST_Contains(t.geom, ST_Transform(b.geometry, 'epsg:4326', 
'epsg:4269'))
   GROUP BY t."GEOIDFQ"
   ORDER BY building_count DESC
   """).to_table("results")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to