voonhous opened a new pull request, #18938:
URL: https://github.com/apache/hudi/pull/18938
### Describe the issue this Pull Request addresses
Closes #18931.
Builds on #18065, which added variant shredding on the AVRO write path. That
PR left a fail-fast guard: when compaction or clustering read an
already-shredded base file through the AVRO record path, records arrived
shredded and the writer threw, because nothing reconstructed the unshredded
variant on read. This PR adds that read-side reconstruction and removes the
guard.
### Summary and Changelog
Reading a shredded variant base file via the AVRO record path now rebuilds
the unshredded `{metadata, value}` variant before records reach the
merger/writer, so compaction and clustering over shredded base files work. The
SPARK/InternalRow path is unchanged (Spark reconstructs variants natively).
- Add `VariantShreddingProvider.rebuildVariantRecord` (inverse of
`shredVariantRecord`). `Spark4VariantShreddingProvider` implements it using
Spark's `ShreddingUtils.rebuild` over an Avro-backed `ShreddedRow`, mirroring
the existing write-side `AvroShreddedResult`.
- `HoodieAvroParquetReader` detects shredded variant columns, reads them at
the file's shredded schema so `typed_value` is materialized, and reconstructs
each to the unshredded form per record (new `VariantReconstruction`). The
provider is resolved from `hoodie.parquet.variant.shredding.provider.class` or
auto-detected on the classpath; gated on
`hoodie.parquet.variant.allow.reading.shredded`.
- Extract `stripVariantShredding` into a shared `VariantSchemaUtils` used by
both reader and writer.
- Remove the read-then-reshred guard (`assertInputNotAlreadyShredded`) from
`HoodieAvroWriteSupport` and its unit test.
- Extend the MOR compaction test in `TestVariantDataType` to write shredded,
compact, then read back, covering AVRO reconstruction and the SPARK native path
via `withRecordType`.
No code copied.
### Impact
AVRO record-type reads of shredded variant base files now return correct
unshredded variants instead of failing. No new configs: reuses
`hoodie.parquet.variant.allow.reading.shredded` (default true) and
`hoodie.parquet.variant.shredding.provider.class`. No change for non-Spark
engines or the SPARK read path.
### Risk Level
Medium. Touches the AVRO base-file read path. Mitigations: reconstruction
only activates when the file actually has shredded variant columns and a
provider is available, otherwise reads proceed unchanged; it is gated by
`hoodie.parquet.variant.allow.reading.shredded`; the SPARK path is untouched.
Covered by the extended MOR compaction test (write shredded, compact, read
back) under both AVRO and SPARK record types.
### Documentation Update
none
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Enough context is provided in the sections above
- [x] Adequate tests were added if applicable
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]