huan233usc opened a new pull request, #17073:
URL: https://github.com/apache/iceberg/pull/17073
Follow-up to the geo type work: the Spark type mapping (#16851) and Iceberg's
own Parquet value path (#16982) are in place, but the Spark Parquet
reader/writer did not handle geometry/geography values.
Geometry and geography columns carry a Parquet `LogicalTypeAnnotation` with
no
legacy `OriginalType`. `SparkParquetReaders` and `SparkParquetWriters`
dispatch
geo through the `OriginalType` / logical-type paths, so:
- the reader fell through to the physical `BINARY` case and returned a raw
`byte[]`, which is the wrong in-memory type for a geo column (Spark's
`InternalRow.getGeometry` / `getGeography` expect `GeometryVal` /
`GeographyVal`);
- the writer hit the unsupported-logical-type branch and threw.
This reads a WKB `BINARY` column into Spark's `GeometryVal` / `GeographyVal`
and
writes those values back as their WKB bytes, mirroring the existing binary
handling. Geo values are stored as pure WKB, so no transformation is needed
beyond wrapping/unwrapping the byte payload.
Testing:
- Enables the shared geospatial `DataTest` coverage for the Spark Parquet
reader (`supportsGeospatial()`), exercising geometry and geography read
round-trips through `SparkParquetReaders`.
- Adds a Spark writer round-trip test (`TestSparkParquetWriter`) that writes
`GeometryVal` / `GeographyVal` through `SparkParquetWriters` and reads them
back, including null values.
Vectorized (Arrow) geo reads are out of scope and remain a follow-up.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]