[
https://issues.apache.org/jira/browse/HUDI-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
voon updated HUDI-7017:
-----------------------
Attachment: (was: image-2023-11-01-11-43-02-166.png)
> Prevent full schema evolution from wrongly falling back to OOB
> --------------------------------------------------------------
>
> Key: HUDI-7017
> URL: https://issues.apache.org/jira/browse/HUDI-7017
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: voon
> Assignee: voon
> Priority: Major
> Attachments: image-2023-11-01-11-41-25-604.png,
> image-2023-11-01-11-43-14-149.png
>
>
> For MOR tables that have these 2 configurations enabled:
>
> {code:java}
> hoodie.schema.on.read.enable=true
> hoodie.datasource.read.extract.partition.values.from.path=true{code}
>
>
> BaseFileReader will use a *requiredSchemaReader* when reading some of the
> parquet files. This BaseFileReader will have an empty *internalSchemaStr*
> causing *Spark3XLegacyHoodieParquetInputFormat* to fall back to OOB schema
> evolution.
>
> Although there are required safeguards that are added in HUDI-5400 to force
> the code execution path to use Hudi Full Schema Evolution, we should still
> fix this so that future changes that may deprecate the use of
> *Spark3XLegacyHoodieParquetInputFormat* will not cause issues.
>
> A sample test to invoke this:
> {code:java}
> test("Test wrong fallback to OOB schema evolution") {
> withRecordType()(withTempDir { tmp =>
> Seq("mor").foreach { tableType =>
> val tableName = generateTableName
> val tablePath = s"${new Path(tmp.getCanonicalPath,
> tableName).toUri.toString}"
> if (HoodieSparkUtils.gteqSpark3_1) {
> spark.sql("set " + SPARK_SQL_INSERT_INTO_OPERATION.key + "=upsert")
> spark.sql("set hoodie.schema.on.read.enable=true")
>
> spark.sql("hoodie.datasource.read.extract.partition.values.from.path=true")
> // NOTE: This is required since as this tests use type coercions
> which were only permitted in Spark 2.x
> // and are disallowed now by default in Spark 3.x
> spark.sql("set spark.sql.storeAssignmentPolicy=legacy")
> createAndPreparePartitionTable(spark, tableName, tablePath, tableType)
> // date -> string
> spark.sql(s"alter table $tableName alter column col6 type String")
> checkAnswer(spark.sql(s"select col6 from $tableName where id =
> 1").collect())(
> Seq("2021-12-25")
> )
> }
> }
> })
> } {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)