jiayuasu opened a new issue, #2949:
URL: https://github.com/apache/sedona/issues/2949

   Follow-up to #2938.
   
   ## Scope
   
   The current Box2D filter pushdown (`Box2DLeafFilter`) prunes files using the 
*geometry column's* recorded bbox, on the assumption that per-row Box2D values 
equal per-row geometry envelopes. This is sound for Box2D columns produced by 
`ST_Box2D(geom)` (Sedona's writer, and most users' workflows). It is **not 
sound** when the covering Box2D column is conservatively wider than the 
geometry — which the GeoParquet 1.1 spec permits (e.g., `apache/sedona-db`'s 
Float32 writer uses `next_after` rounding).
   
   This issue tracks the proper fix: prune using Parquet column statistics for 
the Box2D struct's `xmin/ymin/xmax/ymax` nested fields, which give a tight 
file-level bound on the Box2D values themselves regardless of how they relate 
to the geometry.
   
   ## Implementation outline
   
   - Extend the GeoParquet read path to expose per-file (and ideally 
per-row-group) statistics for the Box2D column's nested float/double fields.
   - Plumb those statistics into `Box2DLeafFilter.evaluate`, or replace 
`Box2DLeafFilter` with a stats-aware variant.
   - The pruning logic itself doesn't change: intersect the file-level union 
Box2D with the query Box2D.
   - Once this lands, the `spark.sedona.geoparquet.box2dFilterPushDown` opt-out 
conf added in #2938 can default to "always on" or be removed.
   
   ## Why deferred
   
   Parquet column statistics for nested struct fields require working with the 
Parquet `FooterFiles` API and the Spark `ParquetFileFormat` internals, which is 
a chunkier change than the recognition logic in #2938. Better to ship the 
SQL-surface recognition first (which covers the common case soundly) and follow 
up with the universal soundness fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to