[PR] [HUDI-8486] Merge into enforces data type match for some columns [hudi]

via GitHub Thu, 06 Feb 2025 09:32:29 -0800


Davis-Zhang-Onehouse opened a new pull request, #12798:
URL: https://github.com/apache/hudi/pull/12798


   ### Change Logs
   
   
   ### Merge into updated & insert actions will fail the query once detect 
column type mismatch for primary key/partition key/precombined key
   
   Allowing implicit casting of partition key can leads to partition path 
corruption. An example is if we do MIT where source table has a double value 
column and target table partitions on the same column with int type, after 
merge into completes the partition path will contain double value. Later select 
queries over the table will complain it cannot convert double to int.
   
   Allowing implicit casting of primary key can leads to data correctness 
issue. Here is an example:
   ```
                create table $targetTable (
                  id double,
                  name string,
                  value_double double,
                  ts long
                ) using hudi
                location '${tmp.getCanonicalPath}/$targetTable'
                tblproperties (
                  type = '$tableType',
                  primaryKey = 'id',
                  preCombineField = 'ts'
                );
   
   
                create table $sourceTable (
                  id int,
                  name string,
                  value_double double,
                  ts long
                ) using hudi
                location '${tmp.getCanonicalPath}/$sourceTable'
                tblproperties (
                  type = '$tableType',
                  primaryKey = 'id',
                  preCombineField = 'ts'
                );
   
                insert into $targetTable
                select
                  cast(1 as double) as id,
                  'initial1' as name,
                  1.1 as value_double,
                  1000 as ts;
   
   
                insert into $sourceTable
                select
                  1 as id,
                  'updated1' as name,
                  1.11 as value_double,
                  1001 as ts;
   
   
                merge into $targetTable t
                using $sourceTable s
                on t.id = s.id
                when matched then update set *
                when not matched then insert *;
   ```
   
   In the end we expect target table with 1 record
   (1, updated1, 1.1, 1001)
   
   but actually it is
   (1, initial, 1.1, 1000)
   (1, updated1, 1.1, 1001)
   
   For precombined field we enforce the same data type check to avoid any 
potential data correctness issue.
   
   If target table id column is int, we got the expected behavior.
   
   ### Impact
   
   
   Delete action does not require strict data type matching, as long as the 
column types are cast-able from source columns to target columns, it is allowed.
   
   - Unlike insert/update which contains assignment from source to target, 
delete operation does not assign values but just comparing them in the ON 
clause, we just need to ensure the comparison part accounts for type 
mismatches. Especially, the precombined key column is out of scope as it only 
takes effect when assignment happens. Similarily, if partition key/primary key 
are not involved in ON clause, we don't need to do any check.
   - For condition clause, the recordKeyAttributeToConditionExpression variable 
already takes care of data type handling for both columns. Today's behavior is 
it will do best effort type casting to match source column data type to target, 
if cast succeeds everything works as expected, otherwise incompatible data type 
error is thrown.
   
   ### For all other data column types, implicit type casting is allowed and 
validated.
   
   Since MIT only requires column type matches for the 3 types of columns, for 
others spark-hudi did implicit type casting. Tests are written to capture the 
exhaustive behavior of such handling.
   
   handle delete MIT action + exhaustive coverage
   
   ### Risk level (write none, low medium or high below)
   none
   
   ### Documentation Update
   
   need to update the new limitation we impose on MIT
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [HUDI-8486] Merge into enforces data type match for some columns [hudi]

Reply via email to