Hello all,

I would like to start a discussion on allowing Page level statistics for
the new GEO column types.

If I understand correctly, the discussion during the formalization of GEO
types initially included page-level statistics. However, the decision was
made to only allow Row Group level statistics because there was no
compelling evidence that Page statistics would meaningfully impact query
performance enough to offset their potential impact on file size. Some
discussions with other members of the GeoParquet community prompted me to
build some benchmarks exploring the effect a specialized spatial index
could have on query performance.The benchmarks explored the
differences between three cases: a base Parquet file that allows Row Group
level pruning, a case simulating Page level pruning using simple flat
statistics (like the standard Parquet page stats structures), and finally a
case simulating Page level pruning using a specialized spatial index. The
simple flat statistics performed nearly identically to the spatial index,
and allowing page-level pruning improved query performance by almost 2x
over the base Row Group pruning. Considering these results we felt that
pursuing a specialized index specifically for GeoParquet is likely
unnecessary. Allowing page-level statistics for GEO columns shows
meaningful query performance gains.

The source to reproduce the benchmarks can be found here, with some simple
instructions in the README on how to obtain the test fixture and run the
benchmarks:
https://github.com/BlakeOrth/geodatafusion/tree/feature/benchmarks

The benchmarks leverage a modified version of GeoDatafusion to compute a
relatively selective geometry intersection query, filtering approximately
3,000 geometries from a test fixture containing over 10,000,000 rows. The
file itself is approximately 1.1GB in size and has just over 1800 pages in
its geometry column. In this case, allowing statistics on those pages
should represent minimal file size overhead.

If anyone has any additional requests for benchmarks or information on the
benchmarks provided please let me know!

Thanks,
-Blake Orth

Reply via email to