huan233usc opened a new pull request, #16650:
URL: https://github.com/apache/iceberg/pull/16650

   > [!IMPORTANT]
   > WIP / Draft. **Stacked on top of #16607** (`API: Single-value binary 
serialization for geometry and geography`). The first 1 commit of this PR is 
from #16607; please review only the most recent commit (`Spark 4.1: WIP 
geometry / geography end-to-end support`) here. This PR will be rebased / 
re-targeted once #16607 lands.
   
   ## Summary
   
   Wires up Iceberg's geometry and geography types end-to-end on Spark 4.1, so 
`CREATE TABLE / INSERT / SELECT / DELETE` work over Parquet with the 
in-progress geospatial column types.
   
   The change has three layers:
   
   1. **Parquet schema mapping** — round-trip `GEOMETRY` / `GEOGRAPHY` through 
`LogicalTypeAnnotation.geometryType` / `geographyType` (`TypeToMessageType`, 
`MessageTypeToType`).
   2. **Generic Parquet RW** — `BaseParquetReaders` / `BaseParquetWriter` 
surface WKB as `ByteBuffer` for the engine-agnostic data path.
   3. **Spark 4.1 integration**:
      - `TypeToSparkType` / `SparkTypeToType` bridge Iceberg ↔ Spark 
`GeometryType` / `GeographyType`, preserving CRS.
      - `PruneColumnsWithoutReordering` accepts geo column types during schema 
pruning.
      - `SparkParquetReaders` / `SparkParquetWriters` translate between Spark's 
internal `GeometryVal` / `GeographyVal` (4-byte little-endian SRID header + 
WKB) and the pure-WKB Parquet representation. SRID is derived per-column from 
the CRS via Spark's `CartesianSpatialReferenceSystemMapper` / 
`GeographicSpatialReferenceSystemMapper`.
      - `ParquetMetrics` skips lex `bounds()` for geo (no `Comparators.forType` 
ordering for spatial WKB) and falls back to value / null counts. Real spatial 
bounding-box stats (X:Y:Z:M) will be plumbed through `FieldMetrics` in a 
follow-up.
   
   ## What's intentionally out of scope (follow-ups)
   
   - Spatial bounding-box statistics (`GeospatialBound` X:Y:Z:M) plumbed 
through `FieldMetrics` and used for partition / file-skipping pruning.
   - Vectorized reads — geo columns force `read.parquet.vectorization.enabled = 
false` per-table for now.
   - Topological predicate pushdown (`ST_Intersects`, `ST_Within`, ...) — not 
part of stock Spark 4.1.
   - Flink / Pig / Hive / Spark 3.5 wiring.
   - ORC / Avro — Parquet-only in this PR.
   
   ## Test plan
   
   `TestSparkGeoTypes` (Spark 4.1, 11 cases):
   
   - [x] `testGeometryRoundTrip` — flat `GEOMETRY` round-trip
   - [x] `testGeographyRoundTrip` — flat `GEOGRAPHY` round-trip
   - [x] `testSridFilterRoundtrip` — `ST_Srid(geom) = N` predicate validates 
CRS → SRID re-attachment on read
   - [x] `testDeleteWithDeletionVector` — v3 + MoR DELETE produces a Puffin DV 
(`added-dvs=1`, `added-position-deletes=1`)
   - [x] `testNullGeometryValue` — NULL geometry rows mixed with non-NULL in 
the same data file
   - [x] `testMultipleGeoColumnsInOneTable` — `GEOMETRY` + `GEOGRAPHY` side by 
side, per-column CRS metadata honored
   - [x] `testStructWithGeometry` — `STRUCT<eid, loc: GEOMETRY>`
   - [x] `testArrayOfGeometry` — `ARRAY<GEOMETRY>`
   - [x] `testMapOfGeometry` — `MAP<STRING, GEOMETRY>`
   - [x] `testStructOfArrayOfGeometry` — `STRUCT<tid, points: ARRAY<GEOMETRY>>`
   - [x] `testDeleteWithDeletionVectorOnNestedGeometry` — DV path on a table 
where geometry sits inside a struct
   
   ```
   TestSparkGeoTypes: 11 tests, 0 failures (~9.5s locally)
   ```
   
   `./gradlew :iceberg-spark:iceberg-spark-4.1_2.13:spotlessCheck 
:iceberg-parquet:spotlessCheck` is clean.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to