[GitHub] [hudi] jjtjiang commented on issue #5777: [SUPPORT] Hudi table has duplicate data.

GitBox Thu, 09 Jun 2022 01:22:42 -0700


jjtjiang commented on issue #5777:
URL: https://github.com/apache/hudi/issues/5777#issuecomment-1150822255


   > 
   
   
   
   > > In this test, we did not change the index, we only used the bloom index 
Through the test, I saw a strange phenomenon. At the beginning, the data was 
repeated, and after a few minutes to several hours, the query was repeated and 
there was no repeated data. This happened three times in the past two days. 
![image](https://user-images.githubusercontent.com/48897688/172542564-fe10605c-7e35-4a38-9773-db638a7a63fb.png)
 
![image](https://user-images.githubusercontent.com/48897688/172542643-402adb09-246f-43cc-b13d-f0f79a36f527.png)
 
![image](https://user-images.githubusercontent.com/48897688/172542773-0121594d-59ae-48b4-a4c7-199a626ac8f8.png)
   > 
   > How do you sync your hudi table? I guess your query engine may treat the 
table as normal parquet files rather than a hudi table.
   > 
   > To verify, could you use spark to read and check out the data? (i.e. 
`spark.read().format("hudi")`)
   
   when use spark.read().format("hudi")  ，the table still have duplicate data.  
I  use struncated streaming to sync data  .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] jjtjiang commented on issue #5777: [SUPPORT] Hudi table has duplicate data.

Reply via email to