[PR] [HUDI-8486] Merge into enforces column type match for some columns [hudi]

via GitHub Wed, 06 Nov 2024 10:38:26 -0800


Davis-Zhang-Onehouse opened a new pull request, #12215:
URL: https://github.com/apache/hudi/pull/12215


   
   
   ### Change Logs
   
   
   ### Merge into updated & insert actions will fail the query once detect 
column type mismatch for primary key/partition key/precombined key
   
   Allowing implicit casting of partition key can leads to partition path 
corruption. An example is if we do MIT where source table has a double value 
column and target table partitions on the same column with int type, the 
partition path will contain double value. Later select queries over the table 
will complain it cannot convert double to int.
   
   Allowing implicit casting of primary key can leads to data correctness 
issue. Here is an example:
   
           // Create target table with double primary key
           spark.sql(
             s"""
                |create table $targetTable (
                |  id double,
                |  name string,
                |  value_double double,
                |  ts long
                |) using hudi
                |location '${tmp.getCanonicalPath}/$targetTable'
                |tblproperties (
                |  type = '$tableType',
                |  primaryKey = 'id',
                |  preCombineField = 'ts'
                |)
            """.stripMargin)
   
           // Create source table with int primary key
           spark.sql(
             s"""
                |create table $sourceTable (
                |  id int,
                |  name string,
                |  value_double double,
                |  ts long
                |) using hudi
                |location '${tmp.getCanonicalPath}/$sourceTable'
                |tblproperties (
                |  type = '$tableType',
                |  primaryKey = 'id',
                |  preCombineField = 'ts'
                |)
            """.stripMargin)
   
           // Insert initial data into target table
           spark.sql(
             s"""
                |insert into $targetTable
                |select
                |  cast(1 as double) as id,
                |  'initial1' as name,
                |  1.1 as value_double,
                |  1000 as ts
            """.stripMargin)
   
           // Insert data into source table with int ids
           spark.sql(
             s"""
                |insert into $sourceTable
                |select
                |  1 as id,
                |  'updated1' as name,
                |  1.11 as value_double,
                |  1001 as ts
            """.stripMargin)
   
           // Perform merge operation
           spark.sql(
             s"""
                |merge into $targetTable t
                |using $sourceTable s
                |on t.id = s.id
                |when matched then update set *
                |when not matched then insert *
            """.stripMargin)
   
   In the end we expect target table with 1 record
   (1, updated1, 1.1, 1001)
   
   but actually it is
   (1, initial, 1.1, 1000)
   (1, updated1, 1.1, 1001)
   
   For precombined field we enforce the same data type check to avoid any 
potential data correctness issue.
   
   If target table id column is int, we got the expected behavior.
   
   ### Merge into delete action continue to follow today's behavior
   
   Delete action does not require strict data type matching, as long as the 
column types are cast-able from source column to target column, it is allowed.
   
   - Unlike insert/update which contains assignment from source to target, 
delete operation does not assign values but just comparing them in the ON 
clause, we just need to ensure the comparison part accounts for type 
mismatches. Especially, the precombined key column is out of scope as it only 
takes effect when assignment happens. Similarily, if partition key/primary key 
are not involved in ON clause, we don't need to do any check.
   - For condition clause, the recordKeyAttributeToConditionExpression variable 
already takes care of data type handling for both columns. Today's behavior is 
it will do best effort type casting to match source column data type to target, 
if cast succeeds everything works as expected, otherwise incompatible data type 
error is thrown.
   
   ### For all other data column types, implicit type casting is allowed and 
validated.
   
   Since MIT only requires column type matches for the 3 types of columns, for 
others spark-hudi did implicit type casting. Tests are written to capture the 
exhaustive behavior of such handling.
   
   handle delete MIT action + exhaustive coverage
   
   ### Impact
   
   Merge into insert & update actions now enforce strict column data type 
matching for primary key, precombined key and partition key. Queries previously 
work might error out.
   
   ### Risk level (write none, low medium or high below)
   
   Low
   
   ### Documentation Update
   
   We should update the merge into doc about this new restriction.
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [HUDI-8486] Merge into enforces column type match for some columns [hudi]

Reply via email to