[jira] [Updated] (HUDI-7017) Prevent full schema evolution from wrongly falling back to OOB

voon (Jira) Tue, 31 Oct 2023 19:42:06 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-7017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


voon updated HUDI-7017:
-----------------------
    Description: 
For MOR tables that have these 2 configurations enabled:

 
{code:java}
hoodie.schema.on.read.enable=true
hoodie.datasource.read.extract.partition.values.from.path=true{code}
 

 

BaseFileReader will use a *requiredSchemaReader* when reading some of the 
parquet files. This BaseFileReader will have an empty *internalSchemaStr* 
causing *Spark3XLegacyHoodieParquetInputFormat* to fall back to OOB schema 
evolution.

 

Although there are required safeguards that are added in HUDI-5400 to force the 
code execution path to use Hudi Full Schema Evolution, we should still fix this 
so that future changes that may deprecate the use of 
*Spark3XLegacyHoodieParquetInputFormat* will not cause issues.

 

A sample test to invoke this:
{code:java}
test("Test wrong fallback to OOB schema evolution") {
  withRecordType()(withTempDir { tmp =>
    Seq("mor").foreach { tableType =>
      val tableName = generateTableName
      val tablePath = s"${new Path(tmp.getCanonicalPath, 
tableName).toUri.toString}"
      if (HoodieSparkUtils.gteqSpark3_1) {
        spark.sql("set " + SPARK_SQL_INSERT_INTO_OPERATION.key + "=upsert")
        spark.sql("set hoodie.schema.on.read.enable=true")
        
spark.sql("hoodie.datasource.read.extract.partition.values.from.path=true")
        // NOTE: This is required since as this tests use type coercions which 
were only permitted in Spark 2.x
        //       and are disallowed now by default in Spark 3.x
        spark.sql("set spark.sql.storeAssignmentPolicy=legacy")
        createAndPreparePartitionTable(spark, tableName, tablePath, tableType)
        // date -> string
        spark.sql(s"alter table $tableName alter column col6 type String")
        checkAnswer(spark.sql(s"select col6 from $tableName where id = 
1").collect())(
          Seq("2021-12-25")
        )
      }
    }
  })
} {code}
 

  was:
For MOR tables that have these 2 configurations enabled:

 
{code:java}
hoodie.schema.on.read.enable=true
hoodie.datasource.read.extract.partition.values.from.path=true{code}
 

 

BaseFileReader will use a *requiredSchemaReader* when reading some of the 
parquet files. This BaseFileReader will have an empty *internalSchemaStr* 
causing *Spark3XLegacyHoodieParquetInputFormat* to fall back to OOB schema 
evolution.

 

Although there are required safeguards that are added in HUDI-5400 to force the 
code execution path to use Hudi Full Schema Evolution, we should still fix this 
so that future changes that may deprecate the use of 
*Spark3XLegacyHoodieParquetInputFormat* will not cause issues.

 

A sample test to invoke this:
{code:java}
test("Test wrong fallback to OOB schema evolution") {
  withRecordType()(withTempDir { tmp =>
    Seq("mor").foreach { tableType =>
      val tableName = generateTableName
      val tablePath = s"${new Path(tmp.getCanonicalPath, 
tableName).toUri.toString}"
      if (HoodieSparkUtils.gteqSpark3_1) {
        spark.sql("set " + SPARK_SQL_INSERT_INTO_OPERATION.key + "=upsert")
        spark.sql("set hoodie.schema.on.read.enable=true")
        
spark.sql("hoodie.datasource.read.extract.partition.values.from.path=true")
        // NOTE: This is required since as this tests use type coercions which 
were only permitted in Spark 2.x
        //       and are disallowed now by default in Spark 3.x
        spark.sql("set spark.sql.storeAssignmentPolicy=legacy")
        createAndPreparePartitionTable(spark, tableName, tablePath, tableType)
        // date -> string -> date
        spark.sql(s"alter table $tableName alter column col6 type String")
        checkAnswer(spark.sql(s"select col6 from $tableName where id = 
1").collect())(
          Seq("2021-12-25")
        )
      }
    }
  })
} {code}
 


> Prevent full schema evolution from wrongly falling back to OOB
> --------------------------------------------------------------
>
>                 Key: HUDI-7017
>                 URL: https://issues.apache.org/jira/browse/HUDI-7017
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: voon
>            Priority: Major
>
> For MOR tables that have these 2 configurations enabled:
>  
> {code:java}
> hoodie.schema.on.read.enable=true
> hoodie.datasource.read.extract.partition.values.from.path=true{code}
>  
>  
> BaseFileReader will use a *requiredSchemaReader* when reading some of the 
> parquet files. This BaseFileReader will have an empty *internalSchemaStr* 
> causing *Spark3XLegacyHoodieParquetInputFormat* to fall back to OOB schema 
> evolution.
>  
> Although there are required safeguards that are added in HUDI-5400 to force 
> the code execution path to use Hudi Full Schema Evolution, we should still 
> fix this so that future changes that may deprecate the use of 
> *Spark3XLegacyHoodieParquetInputFormat* will not cause issues.
>  
> A sample test to invoke this:
> {code:java}
> test("Test wrong fallback to OOB schema evolution") {
>   withRecordType()(withTempDir { tmp =>
>     Seq("mor").foreach { tableType =>
>       val tableName = generateTableName
>       val tablePath = s"${new Path(tmp.getCanonicalPath, 
> tableName).toUri.toString}"
>       if (HoodieSparkUtils.gteqSpark3_1) {
>         spark.sql("set " + SPARK_SQL_INSERT_INTO_OPERATION.key + "=upsert")
>         spark.sql("set hoodie.schema.on.read.enable=true")
>         
> spark.sql("hoodie.datasource.read.extract.partition.values.from.path=true")
>         // NOTE: This is required since as this tests use type coercions 
> which were only permitted in Spark 2.x
>         //       and are disallowed now by default in Spark 3.x
>         spark.sql("set spark.sql.storeAssignmentPolicy=legacy")
>         createAndPreparePartitionTable(spark, tableName, tablePath, tableType)
>         // date -> string
>         spark.sql(s"alter table $tableName alter column col6 type String")
>         checkAnswer(spark.sql(s"select col6 from $tableName where id = 
> 1").collect())(
>           Seq("2021-12-25")
>         )
>       }
>     }
>   })
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7017) Prevent full schema evolution from wrongly falling back to OOB

Reply via email to