[I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

via GitHub Wed, 10 Jan 2024 20:46:18 -0800


CamelliaYjli opened a new issue, #10486:
URL: https://github.com/apache/hudi/issues/10486


   
   **Describe the problem you faced**
   
   I use Flink write Hudi COW table and sync to hive , but hive aggregate query 
(eg. count(*), row_number() over() )results has duplicate data but select * did 
not.
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Flink-SQL write  Hudi COW table.
   
   upsert 
   ```java
           String hudiSinkDDL = "CREATE TABLE hudi_table(\n" +
                   "id String,\n" +
                   "name String,\n" +
                   "age Int,\n" +
                   "PRIMARY KEY (id) NOT ENFORCED \n" +
                   ") WITH (\n" +
                   // 基本配置
                   "'write.operation' = 'upsert',\n" +
                   "'write.precombine' = 'true',\n" +
                   "'connector' = 'hudi',\n" +
                   "'path'= '${basePath}',\n" +
                   "'table.type' = 'COPY_ON_WRITE',\n" +
                   "'write.tasks' = '2',\n" +
                   "'write.bucket_assign.tasks' = '2',\n" +
                   // 同步hive配置
                   "'hive_sync.conf.dir'='/opt/apache-hive-3.1.3-bin/conf',\n" +
                   "'hive_sync.enabled' = 'true',\n" + // 将数据集注册并同步到 hive 
metastore
                   "'hive_sync.mode' = 'hms',\n" + // 采用 hive metastore 同步
                   "'hive_sync.metastore.uris' = 'thrift://localhost:9083',\n" +
                   "'hive_sync.db' = 'cdc_hudi',\n" +
                   "'hive_sync.table' = '${tableName}',\n" +
                   // 小文件&压缩配置
                   "'clean.retain_commits' = '1',\n" + 
                   "'metadata.compaction.delta_commits' = '5',\n" +
                   "'hoodie.parquet.compression.codec' = 'gzip',\n" + 
                   "'hoodie.parquet.max.file.size' = '268435456'\n" +
                   ")";
   ```
   
   2.  insert data into MySQL and update it.
   
   ```sql
   -- insert
   insert into table_test_duplicate_1(id,name,age) 
values('dup_clean_1','Camellia',11);
   -- update
   update table_test_duplicate_1 set age = 20 where id ='dup_clean_1';
   ```
   
   3.  select * from cdc_hudi.table_test_duplicate_1 where id = 'dup_clean_1'; 
Normal results.
   
   <img width="1391" alt="image" 
src="https://github.com/apache/hudi/assets/153248157/7a48b473-cc80-4adf-b0a2-f150b0b3b400";>
   
   4.  execute aggregate function; data duplication.
   
   ```sql
   select count(1) from cdc_hudi.table_test_duplicate_1 where id = 
'dup_clean_1';
   ```
   
   <img width="944" alt="image" 
src="https://github.com/apache/hudi/assets/153248157/c9c336bf-1f5f-4a06-8966-377f0a40ddbc";>
   
   ```sql
   select
   *,
   row_number() over (partition by id order by age desc) as rank
   from cdc_hudi.table_test_duplicate_1 where id = 'dup_clean_1';
   ```
   
   <img width="1388" alt="image" 
src="https://github.com/apache/hudi/assets/153248157/23e9f81b-d00e-4363-8456-4fdeebb50fe8";>
   
   
   **Expected behavior**
   
   Why do aggregated queries and regular queries have inconsistent results？Your 
help is appreciative.
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   * Spark version : no
   
   * Flink version : 1.17.0
   
   * Hive version : 3.1.3
   
   * Hadoop version : 3.3.6
   
   * Storage (HDFS/S3/GCS..) :HDFS
   
   * Running on Docker? (yes/no) :no
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT] Flink write to COW Hudi table,hive aggregate query results has duplicate data but select * did not [hudi]

Reply via email to