Thanks Blake for bringing this up! When we were adding the geospatial logical types, page-level geo stats were not added by purpose to avoid storage bloat before real world use cases appear.
I agree with Arnav that we may need more concrete data to justify it. Best, Gang On Thu, Feb 26, 2026 at 5:03 PM Arnav Balyan <[email protected]> wrote: > Hi Blake, > > Thanks for sharing the benchmarks, the results look quite compelling. Page > level pruning seems like a promising direction, I had a couple of > questions: > > 1. Storage overhead/compression ratio: > Do you have measurements on the storage overhead by page level stats in > this benchmark? In particular, for the 1800 pages in the geometry column: > > - What was the approximate per page metadata size > - Did you observe any impact on compression ratio/file size compared to > the baseline? > > 2. Dataset: > Could you share more details on how the test fixture was derived? It > appears to be based on an Overture dataset, it would be helpful to > understand: > > - What themes the data was drawn from (buildings, places, > transportation, etc) > - Does this represent a specific region, and the mix of geometry types > present. > > Additionally, have you considered evaluating this on other real world > datasets like OpenStreetMap to understand how the performance varies on > different data/spatial characteristics? > > Thanks, > Arnav > > > On Tue, Feb 24, 2026 at 6:43 AM Blake Orth <[email protected]> wrote: > > > Hello all, > > > > I would like to start a discussion on allowing Page level statistics for > > the new GEO column types. > > > > If I understand correctly, the discussion during the formalization of GEO > > types initially included page-level statistics. However, the decision was > > made to only allow Row Group level statistics because there was no > > compelling evidence that Page statistics would meaningfully impact query > > performance enough to offset their potential impact on file size. Some > > discussions with other members of the GeoParquet community prompted me to > > build some benchmarks exploring the effect a specialized spatial index > > could have on query performance.The benchmarks explored the > > differences between three cases: a base Parquet file that allows Row > Group > > level pruning, a case simulating Page level pruning using simple flat > > statistics (like the standard Parquet page stats structures), and > finally a > > case simulating Page level pruning using a specialized spatial index. The > > simple flat statistics performed nearly identically to the spatial index, > > and allowing page-level pruning improved query performance by almost 2x > > over the base Row Group pruning. Considering these results we felt that > > pursuing a specialized index specifically for GeoParquet is likely > > unnecessary. Allowing page-level statistics for GEO columns shows > > meaningful query performance gains. > > > > The source to reproduce the benchmarks can be found here, with some > simple > > instructions in the README on how to obtain the test fixture and run the > > benchmarks: > > https://github.com/BlakeOrth/geodatafusion/tree/feature/benchmarks > > > > The benchmarks leverage a modified version of GeoDatafusion to compute a > > relatively selective geometry intersection query, filtering approximately > > 3,000 geometries from a test fixture containing over 10,000,000 rows. The > > file itself is approximately 1.1GB in size and has just over 1800 pages > in > > its geometry column. In this case, allowing statistics on those pages > > should represent minimal file size overhead. > > > > If anyone has any additional requests for benchmarks or information on > the > > benchmarks provided please let me know! > > > > Thanks, > > -Blake Orth > > >
