Just for completeness - via pipeline

time gdal vector pipeline ! read PARQUET:/parquet-file ! clip --geometry 'WKT' 
! sql --sql 'select sum(file_size) from layer_name' ! info --features --limit 1
INFO: Open of `PARQUET:/parquet-file'
      using driver `(null)' successful.

Layer name: layer-name
Geometry: None
Feature Count: 1
Layer SRS WKT:
(unknown)
SUM_file_size: Real (0.0)
OGRFeature(layer-name):0
  SUM_file_size (Real) = 1139758617

real    1m19.788s

On 2/17/26, 5:48 AM, "Michael Smith" <[email protected] 
<mailto:[email protected]>> wrote:


I wanted to get a sum of the value of a column using a spatial filter on a 
parquet file. I can easily do this with duckdb but I was trying via gdal. 
I was able to do it via fetching features but was unable to do it just with 
executeSQL as the spatialfilter part wouldn’t find the geometry column unless 
it was part of the query


This worked:
gf = gdal.OpenEx(f'PARQUET:{parquet_file')
lay = gf.GetLayer()
lay.SetSpatialFilter(ogr.CreateGeometryFromWkb(aoi.wkb))
totsize_bytes += sum([feat.GetFieldAsInteger64('file_size') for feat in lay]) 


This didn’t:


res = gf.ExecuteSQL('select sum(file_size) from "parquet-file"', 
ogr.CreateGeometryFromWkb(aoi.wkb))
RuntimeError: Cannot set spatial filter: no geometry field present in layer.


Is this just a limitation of OGR SQL?


Via duckdb:
wkb_bytes = aoi.wkb.tobytes()
sql = f"select sum(file_size) from read_parquet('{str(parquet-file)}') where 
ST_Intersects_Extent(geometry, ST_GeomFromWKB(?))"
params = [wkb_bytes]


Performance difference:
gdal: size: 1139758617, time: 0:01:37.471977
duck: size: 1139758617, time: 0:00:15.171584




-- 


Michael Smith 
RSGIS Center – ERDC CRREL NH 
US Army Corps 












_______________________________________________
gdal-dev mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Reply via email to