nssalian opened a new pull request, #16292: URL: https://github.com/apache/iceberg/pull/16292
Follow up to https://github.com/apache/iceberg/pull/16087 - fixing the Vectorized support for variant to remove the temporary patches. ## Rationale for this Change Variant columns currently force the entire table into row-at-a-time reads because the vectorized reader doesn't handle them. This PR fixes that by reading variant's metadata and value children as Arrow VarBinary batches. ## What changes are included in this PR? - `VectorizedReaderBuilder` - adds `variantVisitor()` that creates a `VectorizedVariantVisitor` scoped to each variant column's Parquet path - `VectorizedVariantVisitor` - walks variant's internal structure, creates Arrow readers for metadata + value leaves - `VectorizedArrowReader.VectorizedVariantReader` - composes two child readers, delegates `read`/`setRowGroupInfo`/`setBatchSize`/`close` - `VectorHolder.VariantVectorHolder` - carries both child holders through the batch pipeline - `VariantColumnVector` (new) - Spark `ColumnVector` implementing `getChild(0)` = value, `getChild(1)` = metadata per Spark's `getVariant()` contract - `ColumnVectorBuilder` - dispatches `VariantVectorHolder` before `isDummy()` check - `SparkBatch` - allows variant through the batch reads check - Tests - removed `assumeThat(vectorized).isFalse()` guards; all variant read tests now run with vectorization enabled - Both Spark 4.0 and 4.1 ## Not covered (follow-up) - Shredded variant fields are not read in vectorized mode. - Variant inside structs/lists/maps still falls back to row-at-a-time (pre-existing limitation for all complex types). ## Are these changes tested? - `TestSparkVariantRead` (v4.0 + v4.1) - all tests now run with both `vectorized=false` and `vectorized=true` ## Are there any user-facing changes? - Enabling vectorization will run for variant columns after this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
