RuyRoaV opened a new issue, #11959:
URL: https://github.com/apache/hudi/issues/11959

   **_Tips before filing an issue_**
   
   **Describe the problem you faced**
   
   A clear and concise description of the problem.
   
   We have a COW table which is updated via an UPSERT operation through a Glue 
Job; the operations were initially performed on Hudi 0.11.1. Moreover the table 
is partitioned by year, month and day.
   
   Some days after upgrading to Hudi 0.14.0, we noticed that we were having 
less rows for partitions starting from the update date. Moreover, we noticed 
that records for a given partition day were dropped with a delay of 3 days. 
This behaviour was observed when counting the records by partition using Glue 
or Athena.
   
   On another hand, we also have a Redshift Spectrum subscription built from 
this table, and when doing the row count check, we could see the "correct" 
number of rows. However, we could see duplicated data.
   
   Furthermore, we upgraded 4 tables from Hudi 0.11.1 to Hudi 0.14.0 and only 
with this table we observed such behaviour.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Table in Hudi 0.11.1
   2. Upgrade to Hudi 0.14.0
   3. Wait 3 days to observe the data loss.
   
   These are the write configurations set by us.
   
   ![Screenshot 2024-06-21 at 13 12 
16](https://github.com/user-attachments/assets/b3136357-159e-40fb-be3a-07f7c98cfc80)
   
   
   **Expected behavior**
   
   Could you please shed some light on why this could have happened?
   
   We should see the correct number of rows in Athena / Glue.
   
   **Environment Description**
   
   * Hudi version : 0.14.0
   
   * Spark version : 3.3.0 (Glue 4)
   
   * Hive version :
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : No
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to