[GitHub] [hudi] lvyanquan opened a new issue, #9549: [SUPPORT] Insert operation in Spark will cause inconsistent

via GitHub Sat, 26 Aug 2023 19:48:53 -0700


lvyanquan opened a new issue, #9549:
URL: https://github.com/apache/hudi/issues/9549


   **Describe the problem you faced**
   
   Insert operation in spark don‘t perform tagLocation method to find 
oldLocation， then it may lead to inconsistent.
   
   For example, I have 2 files A and B in a table which primaryKey is id, and A 
contains a record with id = 1, then I insert a new record which id = 1 too.
   1) if file A is small than target size for parquet files, then this new 
record will be inserted to A, and I got one record with id = 1.
   2) if file B is small than target size for parquet files, then this new 
record will be inserted to B, and I got two records with id = 1.
   
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   ```
   create table test (id int, age int) using hudi 
tblproperties(primaryKey='id');
   ```
   1) 
   insert twice, and only one record with id = 1 will be remained.
   ```
   insert into test values(1, 1);
   insert into test values(1, 2);
   ```
   2)
   need to insert into table test with more records.
   table tmp has records range of id from  1 to 1000000.
   ```
   insert into tests as select * from tmp; 
   ```
   then insert record with id = 1
   ```
   insert into test values(1, 2);
   ```
   I got two records with id = 1. 
   
   **Expected behavior**
   
   it's reasonable？it will cause result of count(*) to be inconsistent too.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] lvyanquan opened a new issue, #9549: [SUPPORT] Insert operation in Spark will cause inconsistent

Reply via email to