Hi Blake,

Thanks for sharing the benchmarks, the results look quite compelling. Page
level pruning seems like a promising direction, I had a couple of questions:

1. Storage overhead/compression ratio:
Do you have measurements on the storage overhead by page level stats in
this benchmark? In particular, for the 1800 pages in the geometry column:

   - What was the approximate per page metadata size
   - Did you observe any impact on compression ratio/file size compared to
   the baseline?

2. Dataset:
Could you share more details on how the test fixture was derived? It
appears to be based on an Overture dataset, it would be helpful to
understand:

   - What themes the data was drawn from (buildings, places,
   transportation, etc)
   - Does this represent a specific region, and the mix of geometry types
   present.

Additionally, have you considered evaluating this on other real world
datasets like OpenStreetMap to understand how the performance varies on
different data/spatial characteristics?

Thanks,
Arnav


On Tue, Feb 24, 2026 at 6:43 AM Blake Orth <[email protected]> wrote:

> Hello all,
>
> I would like to start a discussion on allowing Page level statistics for
> the new GEO column types.
>
> If I understand correctly, the discussion during the formalization of GEO
> types initially included page-level statistics. However, the decision was
> made to only allow Row Group level statistics because there was no
> compelling evidence that Page statistics would meaningfully impact query
> performance enough to offset their potential impact on file size. Some
> discussions with other members of the GeoParquet community prompted me to
> build some benchmarks exploring the effect a specialized spatial index
> could have on query performance.The benchmarks explored the
> differences between three cases: a base Parquet file that allows Row Group
> level pruning, a case simulating Page level pruning using simple flat
> statistics (like the standard Parquet page stats structures), and finally a
> case simulating Page level pruning using a specialized spatial index. The
> simple flat statistics performed nearly identically to the spatial index,
> and allowing page-level pruning improved query performance by almost 2x
> over the base Row Group pruning. Considering these results we felt that
> pursuing a specialized index specifically for GeoParquet is likely
> unnecessary. Allowing page-level statistics for GEO columns shows
> meaningful query performance gains.
>
> The source to reproduce the benchmarks can be found here, with some simple
> instructions in the README on how to obtain the test fixture and run the
> benchmarks:
> https://github.com/BlakeOrth/geodatafusion/tree/feature/benchmarks
>
> The benchmarks leverage a modified version of GeoDatafusion to compute a
> relatively selective geometry intersection query, filtering approximately
> 3,000 geometries from a test fixture containing over 10,000,000 rows. The
> file itself is approximately 1.1GB in size and has just over 1800 pages in
> its geometry column. In this case, allowing statistics on those pages
> should represent minimal file size overhead.
>
> If anyone has any additional requests for benchmarks or information on the
> benchmarks provided please let me know!
>
> Thanks,
> -Blake Orth
>

Reply via email to