Thanks for the engagement, everyone. I'm glad there's generally some
interest in exploring this idea.

Arnav, to address your questions I'll actually address the 2nd one first as
it adds some overall context to the discussion.
2. Dataset
As you noted, the test fixture was derived from Overture Maps Foundation
GeoParquet data. It is a single file selected from the building footprints
partition. This means it represents a collection of 2D polygons and their
associated properties. The processing dropped the "GeoParquet" specific
data (key-value metadata and dedicated bbox "covering columns") and wrote a
new parquet file containing the Parquet geometry metadata. All the
processing was done using GDAL with its default settings. Due to GDAL's
overwhelming prevalence in processing spatial data we believe this is a
pretty good representation of what we expect to see in real world use
cases. The output file maintains the row ordering of the input. This is
somewhat important to note because Overture data is internally partitioned,
placing co-located geometries into the same Row Group (I believe based on
their level 3 GeoHash), however they are not "sorted" in a traditional
sense. GDAL defaults to blindly truncating Row Groups to 65536 rows. All
this is to say that while the test fixture is generally "well formed"
spatially, it doesn't represent a solution optimized for page-level pruning.

1. Storage overhead/compression ratio:
I don't have specific measurements now, but I can provide exact numbers for
this case if needed. I noted that the page statistics are "simulated"
because I don't actually have a prototype implementation to write them to
the file. This initial effort just collects the data that would exist in a
page index (covering bbox) for each page and stores it for use during a
scan. For discussion's sake, we can do some quick "napkin math" in the mean
time. The current spatial statistics bbox is 4 required doubles and 4
optional doubles. Unless I'm mistaken, this should yield 32 to 64 bytes per
page in the metadata. This test fixture has 2D polygons, so it will use the
32 byte bbox. Approximately 1,800 pages would result in 57,600 bytes. For a
file that's about 1.1GB, adding the addition of the bbox in the page
statistics would increase the file size by about 0.00005%. The "geometry"
column accounts for the bulk of the file's compressed size. Row Group
statistics suggest that the geometry column's compressed size is generally
between 8MB and 9MB. With 104 Row Groups in the file it's probably safe to
assume about 850MB of compressed geometry data. Again, if we want more
exact numbers, let me know and I can provide them in a follow-up.

I think Andrew's points are important here: the writer primarily drives the
effectiveness of page-level statistics, both in terms of compression ratio
and pruning potential. I don't feel like this is a geospatial specific
statement either, though it's probably more applicable to any column
represented as Binary than to primitive types.

Andrew, regarding your recommendation on how to drive this forward, which I
believe I should have time to do. Is the request to effectively modify a
parquet library (I'd use arrow-rs) to actually write and read spatial
statistics in the page index? I don't expect readers here to have reviewed
the code I've already provided, but I'm trying to understand how your
suggestion differs from the benchmarks I've already explored.

-Blake




On Thu, Feb 26, 2026 at 4:49 AM Andrew Lamb <[email protected]> wrote:

> Personally I think having page level statistics is a good idea and I am not
> sure we need to do a lot more empirical evaluation before doing a POC.
>
> I think the overhead of page-level statistics will depend almost entirely
> on how you configure the parquet writer. For example, if you configure the
> writer for pages with 10000 of GEO points, the overhead of statistics will
> be much lower than if you configure the writer with pages that have 100
> points.
>
> The performance benefit of having such indexes will depend heavily on how
> the data is distributed among the pages and what the query predicate is,
> which will determine how much pruning is effective.
>
> I am not surprised there are reasonable usecases where page level
> statistics make a substantial difference (which is what Blake appears to
> have shown with the benchmark).
>
> My personal suggestion for anyone who is interested in driving this forward
> is:
> 1. Create a proof of concept (add the relevant statistics to the PageIndex)
> 2. Demonstrate real world performance gains in a plausible benchmark
>
> Ideally the proof of concept would be enough to run the other experiments
> suggested by Gang and Arnav.
>
> Andrew
>
>
>
>
> On Thu, Feb 26, 2026 at 4:37 AM Gang Wu <[email protected]> wrote:
>
> > Thanks Blake for bringing this up!
> >
> > When we were adding the geospatial logical types, page-level geo stats
> were
> > not
> > added by purpose to avoid storage bloat before real world use cases
> appear.
> >
> > I agree with Arnav that we may need more concrete data to justify it.
> >
> > Best,
> > Gang
> >
> > On Thu, Feb 26, 2026 at 5:03 PM Arnav Balyan <[email protected]>
> > wrote:
> >
> > > Hi Blake,
> > >
> > > Thanks for sharing the benchmarks, the results look quite compelling.
> > Page
> > > level pruning seems like a promising direction, I had a couple of
> > > questions:
> > >
> > > 1. Storage overhead/compression ratio:
> > > Do you have measurements on the storage overhead by page level stats in
> > > this benchmark? In particular, for the 1800 pages in the geometry
> column:
> > >
> > >    - What was the approximate per page metadata size
> > >    - Did you observe any impact on compression ratio/file size compared
> > to
> > >    the baseline?
> > >
> > > 2. Dataset:
> > > Could you share more details on how the test fixture was derived? It
> > > appears to be based on an Overture dataset, it would be helpful to
> > > understand:
> > >
> > >    - What themes the data was drawn from (buildings, places,
> > >    transportation, etc)
> > >    - Does this represent a specific region, and the mix of geometry
> types
> > >    present.
> > >
> > > Additionally, have you considered evaluating this on other real world
> > > datasets like OpenStreetMap to understand how the performance varies on
> > > different data/spatial characteristics?
> > >
> > > Thanks,
> > > Arnav
> > >
> > >
> > > On Tue, Feb 24, 2026 at 6:43 AM Blake Orth <[email protected]> wrote:
> > >
> > > > Hello all,
> > > >
> > > > I would like to start a discussion on allowing Page level statistics
> > for
> > > > the new GEO column types.
> > > >
> > > > If I understand correctly, the discussion during the formalization of
> > GEO
> > > > types initially included page-level statistics. However, the decision
> > was
> > > > made to only allow Row Group level statistics because there was no
> > > > compelling evidence that Page statistics would meaningfully impact
> > query
> > > > performance enough to offset their potential impact on file size.
> Some
> > > > discussions with other members of the GeoParquet community prompted
> me
> > to
> > > > build some benchmarks exploring the effect a specialized spatial
> index
> > > > could have on query performance.The benchmarks explored the
> > > > differences between three cases: a base Parquet file that allows Row
> > > Group
> > > > level pruning, a case simulating Page level pruning using simple flat
> > > > statistics (like the standard Parquet page stats structures), and
> > > finally a
> > > > case simulating Page level pruning using a specialized spatial index.
> > The
> > > > simple flat statistics performed nearly identically to the spatial
> > index,
> > > > and allowing page-level pruning improved query performance by almost
> 2x
> > > > over the base Row Group pruning. Considering these results we felt
> that
> > > > pursuing a specialized index specifically for GeoParquet is likely
> > > > unnecessary. Allowing page-level statistics for GEO columns shows
> > > > meaningful query performance gains.
> > > >
> > > > The source to reproduce the benchmarks can be found here, with some
> > > simple
> > > > instructions in the README on how to obtain the test fixture and run
> > the
> > > > benchmarks:
> > > > https://github.com/BlakeOrth/geodatafusion/tree/feature/benchmarks
> > > >
> > > > The benchmarks leverage a modified version of GeoDatafusion to
> compute
> > a
> > > > relatively selective geometry intersection query, filtering
> > approximately
> > > > 3,000 geometries from a test fixture containing over 10,000,000 rows.
> > The
> > > > file itself is approximately 1.1GB in size and has just over 1800
> pages
> > > in
> > > > its geometry column. In this case, allowing statistics on those pages
> > > > should represent minimal file size overhead.
> > > >
> > > > If anyone has any additional requests for benchmarks or information
> on
> > > the
> > > > benchmarks provided please let me know!
> > > >
> > > > Thanks,
> > > > -Blake Orth
> > > >
> > >
> >
>

Reply via email to