> Andrew, regarding your recommendation on how to drive this forward, which I > believe I should have time to do. Is the request to effectively modify a > parquet library (I'd use arrow-rs) to actually write and read spatial > statistics in the page index? I don't expect readers here to have reviewed > the code I've already provided, but I'm trying to understand how your > suggestion differs from the benchmarks I've already explored.
Yes I suggest modifying an existing system with the proposed changes so that 1. It is 100% clear what is proposed (we can review the code) 2. We don't have to speculate about the potential costs / benefits (we can instead measure them directly with the proposal) On Thu, Feb 26, 2026 at 4:33 PM Blake Orth <[email protected]> wrote: > Thanks for the engagement, everyone. I'm glad there's generally some > interest in exploring this idea. > > Arnav, to address your questions I'll actually address the 2nd one first as > it adds some overall context to the discussion. > 2. Dataset > As you noted, the test fixture was derived from Overture Maps Foundation > GeoParquet data. It is a single file selected from the building footprints > partition. This means it represents a collection of 2D polygons and their > associated properties. The processing dropped the "GeoParquet" specific > data (key-value metadata and dedicated bbox "covering columns") and wrote a > new parquet file containing the Parquet geometry metadata. All the > processing was done using GDAL with its default settings. Due to GDAL's > overwhelming prevalence in processing spatial data we believe this is a > pretty good representation of what we expect to see in real world use > cases. The output file maintains the row ordering of the input. This is > somewhat important to note because Overture data is internally partitioned, > placing co-located geometries into the same Row Group (I believe based on > their level 3 GeoHash), however they are not "sorted" in a traditional > sense. GDAL defaults to blindly truncating Row Groups to 65536 rows. All > this is to say that while the test fixture is generally "well formed" > spatially, it doesn't represent a solution optimized for page-level > pruning. > > 1. Storage overhead/compression ratio: > I don't have specific measurements now, but I can provide exact numbers for > this case if needed. I noted that the page statistics are "simulated" > because I don't actually have a prototype implementation to write them to > the file. This initial effort just collects the data that would exist in a > page index (covering bbox) for each page and stores it for use during a > scan. For discussion's sake, we can do some quick "napkin math" in the mean > time. The current spatial statistics bbox is 4 required doubles and 4 > optional doubles. Unless I'm mistaken, this should yield 32 to 64 bytes per > page in the metadata. This test fixture has 2D polygons, so it will use the > 32 byte bbox. Approximately 1,800 pages would result in 57,600 bytes. For a > file that's about 1.1GB, adding the addition of the bbox in the page > statistics would increase the file size by about 0.00005%. The "geometry" > column accounts for the bulk of the file's compressed size. Row Group > statistics suggest that the geometry column's compressed size is generally > between 8MB and 9MB. With 104 Row Groups in the file it's probably safe to > assume about 850MB of compressed geometry data. Again, if we want more > exact numbers, let me know and I can provide them in a follow-up. > > I think Andrew's points are important here: the writer primarily drives the > effectiveness of page-level statistics, both in terms of compression ratio > and pruning potential. I don't feel like this is a geospatial specific > statement either, though it's probably more applicable to any column > represented as Binary than to primitive types. > > Andrew, regarding your recommendation on how to drive this forward, which I > believe I should have time to do. Is the request to effectively modify a > parquet library (I'd use arrow-rs) to actually write and read spatial > statistics in the page index? I don't expect readers here to have reviewed > the code I've already provided, but I'm trying to understand how your > suggestion differs from the benchmarks I've already explored. > > -Blake > > > > > On Thu, Feb 26, 2026 at 4:49 AM Andrew Lamb <[email protected]> > wrote: > > > Personally I think having page level statistics is a good idea and I am > not > > sure we need to do a lot more empirical evaluation before doing a POC. > > > > I think the overhead of page-level statistics will depend almost entirely > > on how you configure the parquet writer. For example, if you configure > the > > writer for pages with 10000 of GEO points, the overhead of statistics > will > > be much lower than if you configure the writer with pages that have 100 > > points. > > > > The performance benefit of having such indexes will depend heavily on how > > the data is distributed among the pages and what the query predicate is, > > which will determine how much pruning is effective. > > > > I am not surprised there are reasonable usecases where page level > > statistics make a substantial difference (which is what Blake appears to > > have shown with the benchmark). > > > > My personal suggestion for anyone who is interested in driving this > forward > > is: > > 1. Create a proof of concept (add the relevant statistics to the > PageIndex) > > 2. Demonstrate real world performance gains in a plausible benchmark > > > > Ideally the proof of concept would be enough to run the other experiments > > suggested by Gang and Arnav. > > > > Andrew > > > > > > > > > > On Thu, Feb 26, 2026 at 4:37 AM Gang Wu <[email protected]> wrote: > > > > > Thanks Blake for bringing this up! > > > > > > When we were adding the geospatial logical types, page-level geo stats > > were > > > not > > > added by purpose to avoid storage bloat before real world use cases > > appear. > > > > > > I agree with Arnav that we may need more concrete data to justify it. > > > > > > Best, > > > Gang > > > > > > On Thu, Feb 26, 2026 at 5:03 PM Arnav Balyan <[email protected]> > > > wrote: > > > > > > > Hi Blake, > > > > > > > > Thanks for sharing the benchmarks, the results look quite compelling. > > > Page > > > > level pruning seems like a promising direction, I had a couple of > > > > questions: > > > > > > > > 1. Storage overhead/compression ratio: > > > > Do you have measurements on the storage overhead by page level stats > in > > > > this benchmark? In particular, for the 1800 pages in the geometry > > column: > > > > > > > > - What was the approximate per page metadata size > > > > - Did you observe any impact on compression ratio/file size > compared > > > to > > > > the baseline? > > > > > > > > 2. Dataset: > > > > Could you share more details on how the test fixture was derived? It > > > > appears to be based on an Overture dataset, it would be helpful to > > > > understand: > > > > > > > > - What themes the data was drawn from (buildings, places, > > > > transportation, etc) > > > > - Does this represent a specific region, and the mix of geometry > > types > > > > present. > > > > > > > > Additionally, have you considered evaluating this on other real world > > > > datasets like OpenStreetMap to understand how the performance varies > on > > > > different data/spatial characteristics? > > > > > > > > Thanks, > > > > Arnav > > > > > > > > > > > > On Tue, Feb 24, 2026 at 6:43 AM Blake Orth <[email protected]> wrote: > > > > > > > > > Hello all, > > > > > > > > > > I would like to start a discussion on allowing Page level > statistics > > > for > > > > > the new GEO column types. > > > > > > > > > > If I understand correctly, the discussion during the formalization > of > > > GEO > > > > > types initially included page-level statistics. However, the > decision > > > was > > > > > made to only allow Row Group level statistics because there was no > > > > > compelling evidence that Page statistics would meaningfully impact > > > query > > > > > performance enough to offset their potential impact on file size. > > Some > > > > > discussions with other members of the GeoParquet community prompted > > me > > > to > > > > > build some benchmarks exploring the effect a specialized spatial > > index > > > > > could have on query performance.The benchmarks explored the > > > > > differences between three cases: a base Parquet file that allows > Row > > > > Group > > > > > level pruning, a case simulating Page level pruning using simple > flat > > > > > statistics (like the standard Parquet page stats structures), and > > > > finally a > > > > > case simulating Page level pruning using a specialized spatial > index. > > > The > > > > > simple flat statistics performed nearly identically to the spatial > > > index, > > > > > and allowing page-level pruning improved query performance by > almost > > 2x > > > > > over the base Row Group pruning. Considering these results we felt > > that > > > > > pursuing a specialized index specifically for GeoParquet is likely > > > > > unnecessary. Allowing page-level statistics for GEO columns shows > > > > > meaningful query performance gains. > > > > > > > > > > The source to reproduce the benchmarks can be found here, with some > > > > simple > > > > > instructions in the README on how to obtain the test fixture and > run > > > the > > > > > benchmarks: > > > > > https://github.com/BlakeOrth/geodatafusion/tree/feature/benchmarks > > > > > > > > > > The benchmarks leverage a modified version of GeoDatafusion to > > compute > > > a > > > > > relatively selective geometry intersection query, filtering > > > approximately > > > > > 3,000 geometries from a test fixture containing over 10,000,000 > rows. > > > The > > > > > file itself is approximately 1.1GB in size and has just over 1800 > > pages > > > > in > > > > > its geometry column. In this case, allowing statistics on those > pages > > > > > should represent minimal file size overhead. > > > > > > > > > > If anyone has any additional requests for benchmarks or information > > on > > > > the > > > > > benchmarks provided please let me know! > > > > > > > > > > Thanks, > > > > > -Blake Orth > > > > > > > > > > > > > > >
