[ 
https://issues.apache.org/jira/browse/HUDI-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

voon reassigned HUDI-7017:
--------------------------

    Assignee: voon

> Prevent full schema evolution from wrongly falling back to OOB
> --------------------------------------------------------------
>
>                 Key: HUDI-7017
>                 URL: https://issues.apache.org/jira/browse/HUDI-7017
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: voon
>            Assignee: voon
>            Priority: Major
>
> For MOR tables that have these 2 configurations enabled:
>  
> {code:java}
> hoodie.schema.on.read.enable=true
> hoodie.datasource.read.extract.partition.values.from.path=true{code}
>  
>  
> BaseFileReader will use a *requiredSchemaReader* when reading some of the 
> parquet files. This BaseFileReader will have an empty *internalSchemaStr* 
> causing *Spark3XLegacyHoodieParquetInputFormat* to fall back to OOB schema 
> evolution.
>  
> Although there are required safeguards that are added in HUDI-5400 to force 
> the code execution path to use Hudi Full Schema Evolution, we should still 
> fix this so that future changes that may deprecate the use of 
> *Spark3XLegacyHoodieParquetInputFormat* will not cause issues.
>  
> A sample test to invoke this:
> {code:java}
> test("Test wrong fallback to OOB schema evolution") {
>   withRecordType()(withTempDir { tmp =>
>     Seq("mor").foreach { tableType =>
>       val tableName = generateTableName
>       val tablePath = s"${new Path(tmp.getCanonicalPath, 
> tableName).toUri.toString}"
>       if (HoodieSparkUtils.gteqSpark3_1) {
>         spark.sql("set " + SPARK_SQL_INSERT_INTO_OPERATION.key + "=upsert")
>         spark.sql("set hoodie.schema.on.read.enable=true")
>         
> spark.sql("hoodie.datasource.read.extract.partition.values.from.path=true")
>         // NOTE: This is required since as this tests use type coercions 
> which were only permitted in Spark 2.x
>         //       and are disallowed now by default in Spark 3.x
>         spark.sql("set spark.sql.storeAssignmentPolicy=legacy")
>         createAndPreparePartitionTable(spark, tableName, tablePath, tableType)
>         // date -> string
>         spark.sql(s"alter table $tableName alter column col6 type String")
>         checkAnswer(spark.sql(s"select col6 from $tableName where id = 
> 1").collect())(
>           Seq("2021-12-25")
>         )
>       }
>     }
>   })
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to