Thanks Blake for bringing this up!

When we were adding the geospatial logical types, page-level geo stats were
not
added by purpose to avoid storage bloat before real world use cases appear.

I agree with Arnav that we may need more concrete data to justify it.

Best,
Gang

On Thu, Feb 26, 2026 at 5:03 PM Arnav Balyan <[email protected]> wrote:

> Hi Blake,
>
> Thanks for sharing the benchmarks, the results look quite compelling. Page
> level pruning seems like a promising direction, I had a couple of
> questions:
>
> 1. Storage overhead/compression ratio:
> Do you have measurements on the storage overhead by page level stats in
> this benchmark? In particular, for the 1800 pages in the geometry column:
>
>    - What was the approximate per page metadata size
>    - Did you observe any impact on compression ratio/file size compared to
>    the baseline?
>
> 2. Dataset:
> Could you share more details on how the test fixture was derived? It
> appears to be based on an Overture dataset, it would be helpful to
> understand:
>
>    - What themes the data was drawn from (buildings, places,
>    transportation, etc)
>    - Does this represent a specific region, and the mix of geometry types
>    present.
>
> Additionally, have you considered evaluating this on other real world
> datasets like OpenStreetMap to understand how the performance varies on
> different data/spatial characteristics?
>
> Thanks,
> Arnav
>
>
> On Tue, Feb 24, 2026 at 6:43 AM Blake Orth <[email protected]> wrote:
>
> > Hello all,
> >
> > I would like to start a discussion on allowing Page level statistics for
> > the new GEO column types.
> >
> > If I understand correctly, the discussion during the formalization of GEO
> > types initially included page-level statistics. However, the decision was
> > made to only allow Row Group level statistics because there was no
> > compelling evidence that Page statistics would meaningfully impact query
> > performance enough to offset their potential impact on file size. Some
> > discussions with other members of the GeoParquet community prompted me to
> > build some benchmarks exploring the effect a specialized spatial index
> > could have on query performance.The benchmarks explored the
> > differences between three cases: a base Parquet file that allows Row
> Group
> > level pruning, a case simulating Page level pruning using simple flat
> > statistics (like the standard Parquet page stats structures), and
> finally a
> > case simulating Page level pruning using a specialized spatial index. The
> > simple flat statistics performed nearly identically to the spatial index,
> > and allowing page-level pruning improved query performance by almost 2x
> > over the base Row Group pruning. Considering these results we felt that
> > pursuing a specialized index specifically for GeoParquet is likely
> > unnecessary. Allowing page-level statistics for GEO columns shows
> > meaningful query performance gains.
> >
> > The source to reproduce the benchmarks can be found here, with some
> simple
> > instructions in the README on how to obtain the test fixture and run the
> > benchmarks:
> > https://github.com/BlakeOrth/geodatafusion/tree/feature/benchmarks
> >
> > The benchmarks leverage a modified version of GeoDatafusion to compute a
> > relatively selective geometry intersection query, filtering approximately
> > 3,000 geometries from a test fixture containing over 10,000,000 rows. The
> > file itself is approximately 1.1GB in size and has just over 1800 pages
> in
> > its geometry column. In this case, allowing statistics on those pages
> > should represent minimal file size overhead.
> >
> > If anyone has any additional requests for benchmarks or information on
> the
> > benchmarks provided please let me know!
> >
> > Thanks,
> > -Blake Orth
> >
>

Reply via email to