Re: [DISCUSS] Allow Page-level GEO statistics

Andrew Lamb Sun, 01 Mar 2026 04:04:38 -0800

> Andrew, regarding your recommendation on how to drive this forward, which
I
> believe I should have time to do. Is the request to effectively modify a
> parquet library (I'd use arrow-rs) to actually write and read spatial
> statistics in the page index? I don't expect readers here to have reviewed
> the code I've already provided, but I'm trying to understand how your
> suggestion differs from the benchmarks I've already explored.


Yes

I suggest modifying an existing system with the proposed changes so that
1. It is 100% clear what is proposed (we can review the code)
2. We don't have to speculate about the potential costs / benefits (we can
instead measure them directly with the proposal)




On Thu, Feb 26, 2026 at 4:33 PM Blake Orth <[email protected]> wrote:

> Thanks for the engagement, everyone. I'm glad there's generally some
> interest in exploring this idea.
>
> Arnav, to address your questions I'll actually address the 2nd one first as
> it adds some overall context to the discussion.
> 2. Dataset
> As you noted, the test fixture was derived from Overture Maps Foundation
> GeoParquet data. It is a single file selected from the building footprints
> partition. This means it represents a collection of 2D polygons and their
> associated properties. The processing dropped the "GeoParquet" specific
> data (key-value metadata and dedicated bbox "covering columns") and wrote a
> new parquet file containing the Parquet geometry metadata. All the
> processing was done using GDAL with its default settings. Due to GDAL's
> overwhelming prevalence in processing spatial data we believe this is a
> pretty good representation of what we expect to see in real world use
> cases. The output file maintains the row ordering of the input. This is
> somewhat important to note because Overture data is internally partitioned,
> placing co-located geometries into the same Row Group (I believe based on
> their level 3 GeoHash), however they are not "sorted" in a traditional
> sense. GDAL defaults to blindly truncating Row Groups to 65536 rows. All
> this is to say that while the test fixture is generally "well formed"
> spatially, it doesn't represent a solution optimized for page-level
> pruning.
>
> 1. Storage overhead/compression ratio:
> I don't have specific measurements now, but I can provide exact numbers for
> this case if needed. I noted that the page statistics are "simulated"
> because I don't actually have a prototype implementation to write them to
> the file. This initial effort just collects the data that would exist in a
> page index (covering bbox) for each page and stores it for use during a
> scan. For discussion's sake, we can do some quick "napkin math" in the mean
> time. The current spatial statistics bbox is 4 required doubles and 4
> optional doubles. Unless I'm mistaken, this should yield 32 to 64 bytes per
> page in the metadata. This test fixture has 2D polygons, so it will use the
> 32 byte bbox. Approximately 1,800 pages would result in 57,600 bytes. For a
> file that's about 1.1GB, adding the addition of the bbox in the page
> statistics would increase the file size by about 0.00005%. The "geometry"
> column accounts for the bulk of the file's compressed size. Row Group
> statistics suggest that the geometry column's compressed size is generally
> between 8MB and 9MB. With 104 Row Groups in the file it's probably safe to
> assume about 850MB of compressed geometry data. Again, if we want more
> exact numbers, let me know and I can provide them in a follow-up.
>
> I think Andrew's points are important here: the writer primarily drives the
> effectiveness of page-level statistics, both in terms of compression ratio
> and pruning potential. I don't feel like this is a geospatial specific
> statement either, though it's probably more applicable to any column
> represented as Binary than to primitive types.
>
> Andrew, regarding your recommendation on how to drive this forward, which I
> believe I should have time to do. Is the request to effectively modify a
> parquet library (I'd use arrow-rs) to actually write and read spatial
> statistics in the page index? I don't expect readers here to have reviewed
> the code I've already provided, but I'm trying to understand how your
> suggestion differs from the benchmarks I've already explored.
>
> -Blake
>
>
>
>
> On Thu, Feb 26, 2026 at 4:49 AM Andrew Lamb <[email protected]>
> wrote:
>
> > Personally I think having page level statistics is a good idea and I am
> not
> > sure we need to do a lot more empirical evaluation before doing a POC.
> >
> > I think the overhead of page-level statistics will depend almost entirely
> > on how you configure the parquet writer. For example, if you configure
> the
> > writer for pages with 10000 of GEO points, the overhead of statistics
> will
> > be much lower than if you configure the writer with pages that have 100
> > points.
> >
> > The performance benefit of having such indexes will depend heavily on how
> > the data is distributed among the pages and what the query predicate is,
> > which will determine how much pruning is effective.
> >
> > I am not surprised there are reasonable usecases where page level
> > statistics make a substantial difference (which is what Blake appears to
> > have shown with the benchmark).
> >
> > My personal suggestion for anyone who is interested in driving this
> forward
> > is:
> > 1. Create a proof of concept (add the relevant statistics to the
> PageIndex)
> > 2. Demonstrate real world performance gains in a plausible benchmark
> >
> > Ideally the proof of concept would be enough to run the other experiments
> > suggested by Gang and Arnav.
> >
> > Andrew
> >
> >
> >
> >
> > On Thu, Feb 26, 2026 at 4:37 AM Gang Wu <[email protected]> wrote:
> >
> > > Thanks Blake for bringing this up!
> > >
> > > When we were adding the geospatial logical types, page-level geo stats
> > were
> > > not
> > > added by purpose to avoid storage bloat before real world use cases
> > appear.
> > >
> > > I agree with Arnav that we may need more concrete data to justify it.
> > >
> > > Best,
> > > Gang
> > >
> > > On Thu, Feb 26, 2026 at 5:03 PM Arnav Balyan <[email protected]>
> > > wrote:
> > >
> > > > Hi Blake,
> > > >
> > > > Thanks for sharing the benchmarks, the results look quite compelling.
> > > Page
> > > > level pruning seems like a promising direction, I had a couple of
> > > > questions:
> > > >
> > > > 1. Storage overhead/compression ratio:
> > > > Do you have measurements on the storage overhead by page level stats
> in
> > > > this benchmark? In particular, for the 1800 pages in the geometry
> > column:
> > > >
> > > >    - What was the approximate per page metadata size
> > > >    - Did you observe any impact on compression ratio/file size
> compared
> > > to
> > > >    the baseline?
> > > >
> > > > 2. Dataset:
> > > > Could you share more details on how the test fixture was derived? It
> > > > appears to be based on an Overture dataset, it would be helpful to
> > > > understand:
> > > >
> > > >    - What themes the data was drawn from (buildings, places,
> > > >    transportation, etc)
> > > >    - Does this represent a specific region, and the mix of geometry
> > types
> > > >    present.
> > > >
> > > > Additionally, have you considered evaluating this on other real world
> > > > datasets like OpenStreetMap to understand how the performance varies
> on
> > > > different data/spatial characteristics?
> > > >
> > > > Thanks,
> > > > Arnav
> > > >
> > > >
> > > > On Tue, Feb 24, 2026 at 6:43 AM Blake Orth <[email protected]> wrote:
> > > >
> > > > > Hello all,
> > > > >
> > > > > I would like to start a discussion on allowing Page level
> statistics
> > > for
> > > > > the new GEO column types.
> > > > >
> > > > > If I understand correctly, the discussion during the formalization
> of
> > > GEO
> > > > > types initially included page-level statistics. However, the
> decision
> > > was
> > > > > made to only allow Row Group level statistics because there was no
> > > > > compelling evidence that Page statistics would meaningfully impact
> > > query
> > > > > performance enough to offset their potential impact on file size.
> > Some
> > > > > discussions with other members of the GeoParquet community prompted
> > me
> > > to
> > > > > build some benchmarks exploring the effect a specialized spatial
> > index
> > > > > could have on query performance.The benchmarks explored the
> > > > > differences between three cases: a base Parquet file that allows
> Row
> > > > Group
> > > > > level pruning, a case simulating Page level pruning using simple
> flat
> > > > > statistics (like the standard Parquet page stats structures), and
> > > > finally a
> > > > > case simulating Page level pruning using a specialized spatial
> index.
> > > The
> > > > > simple flat statistics performed nearly identically to the spatial
> > > index,
> > > > > and allowing page-level pruning improved query performance by
> almost
> > 2x
> > > > > over the base Row Group pruning. Considering these results we felt
> > that
> > > > > pursuing a specialized index specifically for GeoParquet is likely
> > > > > unnecessary. Allowing page-level statistics for GEO columns shows
> > > > > meaningful query performance gains.
> > > > >
> > > > > The source to reproduce the benchmarks can be found here, with some
> > > > simple
> > > > > instructions in the README on how to obtain the test fixture and
> run
> > > the
> > > > > benchmarks:
> > > > > https://github.com/BlakeOrth/geodatafusion/tree/feature/benchmarks
> > > > >
> > > > > The benchmarks leverage a modified version of GeoDatafusion to
> > compute
> > > a
> > > > > relatively selective geometry intersection query, filtering
> > > approximately
> > > > > 3,000 geometries from a test fixture containing over 10,000,000
> rows.
> > > The
> > > > > file itself is approximately 1.1GB in size and has just over 1800
> > pages
> > > > in
> > > > > its geometry column. In this case, allowing statistics on those
> pages
> > > > > should represent minimal file size overhead.
> > > > >
> > > > > If anyone has any additional requests for benchmarks or information
> > on
> > > > the
> > > > > benchmarks provided please let me know!
> > > > >
> > > > > Thanks,
> > > > > -Blake Orth
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Allow Page-level GEO statistics

Reply via email to