[
https://issues.apache.org/jira/browse/HUDI-6410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17734099#comment-17734099
]
Aditya Goenka commented on HUDI-6410:
-------------------------------------
There was a problem in analysis . there is typo in precombineKey. It should be
preCombineField. It is working as expected. Closing this JIRA.
spark-sql>
>
> create table spark_mor_no_pre_t8 (
> id int,
> name string,
> updated_at timestamp
> ) using hudi
> options (
> type = 'mor',
> primaryKey = 'id',
> preCombineField = 'updated_at'
> ) location 'file:///tmp/output/spark_mor_no_pre_t8';
Time taken: 0.271 seconds
spark-sql>
> merge into spark_mor_no_pre_t8 as target
> using (
> select 1 as id, 'c' as name, current_timestamp as updated_at
> union select 1 as id,'d' as name, current_timestamp as updated_at
> union select 1 as id,'e' as name, current_timestamp as updated_at
> ) source
> on target.id = source.id
> when matched then update set *
> when not matched then insert *;
23/06/19 15:32:15 WARN HoodieBackedTableMetadata: Metadata table was not found
at path file:/tmp/output/spark_mor_no_pre_t8/.hoodie/metadata
Time taken: 3.903 seconds
spark-sql>
> select * from spark_mor_no_pre_t8;
20230619153215056 20230619153215056_0_0 1
06d12bb0-6bf9-4389-8fee-96fabc2a8c14-0_0-81-78_20230619153215056.parquet 1 e
2023-06-19 15:32:15.151468
Time taken: 0.36 seconds, Fetched 1 row(s)
> MERGE INTO giving duplicate rows even if table have precombineKey
> -----------------------------------------------------------------
>
> Key: HUDI-6410
> URL: https://issues.apache.org/jira/browse/HUDI-6410
> Project: Apache Hudi
> Issue Type: Bug
> Reporter: Aditya Goenka
> Priority: Blocker
> Fix For: 0.14.0
>
> Attachments: image-2023-06-19-15-16-58-055.png,
> image-2023-06-19-15-37-27-202.png
>
>
> Merge into is giving duplicate rows even if precombine key is there.
>
> Example -
> spark-sql> create table spark_mor_no_pre_t5 (
> > id int,
> > name string,
> > updated_at timestamp
> > ) using hudi
> > options (
> > type = 'mor',
> > primaryKey = 'id',
> > precombineKey = 'updated_at'
> > ) location 'file:///tmp/output/spark_mor_no_pre_t4';
> Time taken: 0.363 seconds
> spark-sql>
> > merge into spark_mor_no_pre_t5 as target
> > using (
> > select 1 as id, 'c' as name, current_timestamp as updated_at
> > union select 1 as id,'d' as name, current_timestamp as updated_at
> > union select 1 as id,'e' as name, current_timestamp as updated_at
> > ) source
> > on target.id = source.id
> > when matched then update set *
> > when not matched then insert *;
> Time taken: 3.111 seconds
> spark-sql> select * from spark_mor_no_pre_t5;
> 20230619151501003 20230619151501003_0_0 1
> 4405350d-edd6-465b-ac43-8a68d26f957e-0_0-245-274_20230619151501003.parquet 1
> e 2023-06-19 15:15:01.032766
> 20230619151501003 20230619151501003_0_1 1
> 4405350d-edd6-465b-ac43-8a68d26f957e-0_0-245-274_20230619151501003.parquet 1
> e 2023-06-19 15:15:01.032766
> 20230619151501003 20230619151501003_0_2 1
> 4405350d-edd6-465b-ac43-8a68d26f957e-0_0-245-274_20230619151501003.parquet 1
> e 2023-06-19 15:15:01.032766
> Time taken: 0.288 seconds, Fetched 3 row(s)
>
> Github Issue - [https://github.com/apache/hudi/issues/8916]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)