waywtdcc opened a new issue #4305:
URL: https://github.com/apache/hudi/issues/4305
**Describe the problem you faced**
When Flink submits multiple tasks, the write record is repeated. There are
multiple data with the same primary key in the Hudi table.
**To Reproduce**
Steps to reproduce the behavior:
1. flink datagen
`CREATE TABLE datagen_test (
id BIGINT,
name VARCHAR(20),
age int,
birthday TIMESTAMP(3),
ts TIMESTAMP(3)
) WITH (
'connector' = 'datagen',
'rows-per-second'= '20',
'fields.id.min' = '1',
'fields.id.max' = '10000'
);`
2. create hudi table
`CREATE TABLE datagen_hudi_test2(
id bigint ,
name string,
birthday TIMESTAMP(3),
ts TIMESTAMP(3),
`partition_str` VARCHAR(20),
primary key(id) not enforced --必须指定uuid 主键
)
PARTITIONED BY (`partition_str`)
with(
'connector'='hudi',
'path'= 'hdfs:///user/hive/warehouse/hudi.db/datagen_hudi_test2'
, 'hoodie.datasource.write.recordkey.field'= 'id'-- 主键
, 'write.precombine.field'= 'ts'-- 自动precombine的字段
, 'write.tasks'= '1'
, 'compaction.tasks'= '1'
, 'write.rate.limit'= '2000'-- 限速
, 'table.type'= 'MERGE_ON_READ'-- 默认COPY_ON_WRITE,可选MERGE_ON_READ
, 'compaction.async.enabled'= 'true'-- 是否开启异步压缩
, 'compaction.trigger.strategy'= 'num_commits'-- 按次数压缩
, 'compaction.delta_commits'= '5', -- 默认为5
'hive_sync.enable' = 'true',
'hive_sync.mode' = 'hms' ,
'hive_sync.metastore.uris' = 'thrift://**:53083',
'hive_sync.table'='datagen_hudi_test2_hivesync',
'hive_sync.db'='hudi' ,
'index.global.enabled' = 'true'
);`
3. write to hudi table
` insert into test.datagen_hudi_test2
select id,name,birthday,ts as ts,DATE_FORMAT(birthday, 'yyyyMMdd') as
`partition_str`
from test.datagen_test;`
**Environment Description**
* Hudi version : 0.9.0
* Hive version : 2.3.6
* Flink version : 1.12.2
* Hadoop version : 2.7.7
* Storage (HDFS/S3/GCS..) : HDFS
* Running on Docker? (yes/no) :no
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]