voonhous commented on code in PR #18938:
URL: https://github.com/apache/hudi/pull/18938#discussion_r3458800167
##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/dml/schema/TestVariantDataType.scala:
##########
@@ -142,6 +150,8 @@ class TestVariantDataType extends HoodieSparkSqlTestBase {
| primaryKey = 'id',
| type = 'mor',
| preCombineField = 'ts',
+ | hoodie.parquet.variant.write.shredding.enabled = 'true',
Review Comment:
`withRecordType` only swaps the record merger and log-block format, and
Spark compaction reads base files via `SparkReaderContextFactory` ->
`SparkFileFormatInternalRowReaderContext` (InternalRow), so
`HoodieAvroParquetReader` (where
`HoodieVariantReconstruction`/`rebuildVariantRecord` run) is never reached on
either record type.
Fixed in cc41b4f: added `TestSpark4VariantShreddingProvider`, a direct shred
-> rebuild round-trip over the provider covering scalar
(numeric/string/boolean/decimal), object, and array shapes - this exercises
`rebuildVariantRecord` and the `AvroVariantRow`/`AvroObjectRow`/`AvroArrayRow`
accessors. The two `create()` fail-fast branches are covered by
`TestHoodieVariantReconstruction` (added for your other comment). Also
corrected the misleading comment on the SQL test to say it covers the Spark
InternalRow (native) read of a shredded base file, not the Avro path.
Note on the Java-client suggestion: `Spark4VariantShreddingProvider` is the
only `VariantShreddingProvider`, so a Java/Flink compaction over a shredded
base file would hit the no-provider fail-fast branch rather than reconstruct -
which is why I went with the direct unit test for the reconstruction coverage.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]