huan233usc opened a new pull request, #16650:
URL: https://github.com/apache/iceberg/pull/16650
> [!IMPORTANT]
> WIP / Draft. **Stacked on top of #16607** (`API: Single-value binary
serialization for geometry and geography`). The first 1 commit of this PR is
from #16607; please review only the most recent commit (`Spark 4.1: WIP
geometry / geography end-to-end support`) here. This PR will be rebased /
re-targeted once #16607 lands.
## Summary
Wires up Iceberg's geometry and geography types end-to-end on Spark 4.1, so
`CREATE TABLE / INSERT / SELECT / DELETE` work over Parquet with the
in-progress geospatial column types.
The change has three layers:
1. **Parquet schema mapping** — round-trip `GEOMETRY` / `GEOGRAPHY` through
`LogicalTypeAnnotation.geometryType` / `geographyType` (`TypeToMessageType`,
`MessageTypeToType`).
2. **Generic Parquet RW** — `BaseParquetReaders` / `BaseParquetWriter`
surface WKB as `ByteBuffer` for the engine-agnostic data path.
3. **Spark 4.1 integration**:
- `TypeToSparkType` / `SparkTypeToType` bridge Iceberg ↔ Spark
`GeometryType` / `GeographyType`, preserving CRS.
- `PruneColumnsWithoutReordering` accepts geo column types during schema
pruning.
- `SparkParquetReaders` / `SparkParquetWriters` translate between Spark's
internal `GeometryVal` / `GeographyVal` (4-byte little-endian SRID header +
WKB) and the pure-WKB Parquet representation. SRID is derived per-column from
the CRS via Spark's `CartesianSpatialReferenceSystemMapper` /
`GeographicSpatialReferenceSystemMapper`.
- `ParquetMetrics` skips lex `bounds()` for geo (no `Comparators.forType`
ordering for spatial WKB) and falls back to value / null counts. Real spatial
bounding-box stats (X:Y:Z:M) will be plumbed through `FieldMetrics` in a
follow-up.
## What's intentionally out of scope (follow-ups)
- Spatial bounding-box statistics (`GeospatialBound` X:Y:Z:M) plumbed
through `FieldMetrics` and used for partition / file-skipping pruning.
- Vectorized reads — geo columns force `read.parquet.vectorization.enabled =
false` per-table for now.
- Topological predicate pushdown (`ST_Intersects`, `ST_Within`, ...) — not
part of stock Spark 4.1.
- Flink / Pig / Hive / Spark 3.5 wiring.
- ORC / Avro — Parquet-only in this PR.
## Test plan
`TestSparkGeoTypes` (Spark 4.1, 11 cases):
- [x] `testGeometryRoundTrip` — flat `GEOMETRY` round-trip
- [x] `testGeographyRoundTrip` — flat `GEOGRAPHY` round-trip
- [x] `testSridFilterRoundtrip` — `ST_Srid(geom) = N` predicate validates
CRS → SRID re-attachment on read
- [x] `testDeleteWithDeletionVector` — v3 + MoR DELETE produces a Puffin DV
(`added-dvs=1`, `added-position-deletes=1`)
- [x] `testNullGeometryValue` — NULL geometry rows mixed with non-NULL in
the same data file
- [x] `testMultipleGeoColumnsInOneTable` — `GEOMETRY` + `GEOGRAPHY` side by
side, per-column CRS metadata honored
- [x] `testStructWithGeometry` — `STRUCT<eid, loc: GEOMETRY>`
- [x] `testArrayOfGeometry` — `ARRAY<GEOMETRY>`
- [x] `testMapOfGeometry` — `MAP<STRING, GEOMETRY>`
- [x] `testStructOfArrayOfGeometry` — `STRUCT<tid, points: ARRAY<GEOMETRY>>`
- [x] `testDeleteWithDeletionVectorOnNestedGeometry` — DV path on a table
where geometry sits inside a struct
```
TestSparkGeoTypes: 11 tests, 0 failures (~9.5s locally)
```
`./gradlew :iceberg-spark:iceberg-spark-4.1_2.13:spotlessCheck
:iceberg-parquet:spotlessCheck` is clean.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]