Davis-Zhang-Onehouse opened a new pull request, #12798:
URL: https://github.com/apache/hudi/pull/12798
### Change Logs
### Merge into updated & insert actions will fail the query once detect
column type mismatch for primary key/partition key/precombined key
Allowing implicit casting of partition key can leads to partition path
corruption. An example is if we do MIT where source table has a double value
column and target table partitions on the same column with int type, after
merge into completes the partition path will contain double value. Later select
queries over the table will complain it cannot convert double to int.
Allowing implicit casting of primary key can leads to data correctness
issue. Here is an example:
```
create table $targetTable (
id double,
name string,
value_double double,
ts long
) using hudi
location '${tmp.getCanonicalPath}/$targetTable'
tblproperties (
type = '$tableType',
primaryKey = 'id',
preCombineField = 'ts'
);
create table $sourceTable (
id int,
name string,
value_double double,
ts long
) using hudi
location '${tmp.getCanonicalPath}/$sourceTable'
tblproperties (
type = '$tableType',
primaryKey = 'id',
preCombineField = 'ts'
);
insert into $targetTable
select
cast(1 as double) as id,
'initial1' as name,
1.1 as value_double,
1000 as ts;
insert into $sourceTable
select
1 as id,
'updated1' as name,
1.11 as value_double,
1001 as ts;
merge into $targetTable t
using $sourceTable s
on t.id = s.id
when matched then update set *
when not matched then insert *;
```
In the end we expect target table with 1 record
(1, updated1, 1.1, 1001)
but actually it is
(1, initial, 1.1, 1000)
(1, updated1, 1.1, 1001)
For precombined field we enforce the same data type check to avoid any
potential data correctness issue.
If target table id column is int, we got the expected behavior.
### Impact
Delete action does not require strict data type matching, as long as the
column types are cast-able from source columns to target columns, it is allowed.
- Unlike insert/update which contains assignment from source to target,
delete operation does not assign values but just comparing them in the ON
clause, we just need to ensure the comparison part accounts for type
mismatches. Especially, the precombined key column is out of scope as it only
takes effect when assignment happens. Similarily, if partition key/primary key
are not involved in ON clause, we don't need to do any check.
- For condition clause, the recordKeyAttributeToConditionExpression variable
already takes care of data type handling for both columns. Today's behavior is
it will do best effort type casting to match source column data type to target,
if cast succeeds everything works as expected, otherwise incompatible data type
error is thrown.
### For all other data column types, implicit type casting is allowed and
validated.
Since MIT only requires column type matches for the 3 types of columns, for
others spark-hudi did implicit type casting. Tests are written to capture the
exhaustive behavior of such handling.
handle delete MIT action + exhaustive coverage
### Risk level (write none, low medium or high below)
none
### Documentation Update
need to update the new limitation we impose on MIT
### Contributor's checklist
- [ ] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [ ] Change Logs and Impact were stated clearly
- [ ] Adequate tests were added if applicable
- [ ] CI passed
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]