maabkhan opened a new issue, #11971:
URL: https://github.com/apache/hudi/issues/11971

   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?
   
   - Join the mailing list to engage in conversations and get faster support at 
[email protected].
   
   - If you have triaged this as a bug, then file an 
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   One of my Spark jobs is reading data from a Hudi COPY_ON_WRITE table using 
the snapshot query type. The job runs once a day, but the table is updated 
every hour. When the job reads the table while it is being updated, it loads 
instants up to an inflight commit, leading to all null data being read.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   1. Create a Hudi COPY_ON_WRITE table.
   2. Schedule a Spark job to read from the table using the snapshot query type.
   3. Update the table every hour.
   4. Run the Spark job while the table is being updated.
   
   
   **Expected behavior**
   
   The Spark job should wait for the inflight commit to complete and read data 
from the latest completed commit or read from a clean_completed commit by 
referring to older commit, ensuring that no null data is read.
   
   **Environment Description**
   
   * Hudi version : 0.14
   
   * Spark version : 3.4.1
   
   * Hive version : 3.1.3
   
   * Hadoop version :
   
   * Storage (HDFS/S3/GCS..) : s3
   
   * Running on Docker? (yes/no) : no (oceanspark by spot (spark on eks))
   
   
   **Additional context**
   
   The issue seems to be that the Spark job is loading instants up to an 
inflight commit instead of loading from clean_completed commit. This results in 
reading null data. I am looking for a Hudi configuration that can ensure the 
job waits for inflight commits to complete before reading the data or reads 
from older clean_completed commit.
   
   **Stacktrace**
   
   ```2024-09-18T19:44:14.081421948Z 24/09/18 19:44:14 INFO DataSourceUtils: 
Getting table path..
   2024-09-18T19:44:14.082728611Z 24/09/18 19:44:14 INFO TablePathUtils: 
Getting table path from path : 
s3a://trusted-luna-prod/tmevents_hourly/topics/account_balance_events
   2024-09-18T19:44:14.167799106Z 24/09/18 19:44:14 INFO DefaultSource: 
Obtained hudi table path: 
s3a://trusted-luna-prod/tmevents_hourly/topics/account_balance_events
   2024-09-18T19:44:14.211559950Z 24/09/18 19:44:14 INFO HoodieTableMetaClient: 
Loading HoodieTableMetaClient from 
s3a://trusted-luna-prod/tmevents_hourly/topics/account_balance_events
   2024-09-18T19:44:14.255635781Z 24/09/18 19:44:14 INFO HoodieTableConfig: 
Loading table properties from 
s3a://trusted-luna-prod/tmevents_hourly/topics/account_balance_events/.hoodie/hoodie.properties
   2024-09-18T19:44:14.335476879Z 24/09/18 19:44:14 INFO HoodieTableMetaClient: 
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET) 
from s3a://trusted-luna-prod/tmevents_hourly/topics/account_balance_events
   2024-09-18T19:44:14.350218569Z 24/09/18 19:44:14 INFO DefaultSource: Is 
bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot
   2024-09-18T19:44:14.464630233Z 24/09/18 19:44:14 INFO HoodieActiveTimeline: 
Loaded instants upto : 
Option{val=[==>20240918193949292__commit__INFLIGHT__20240918194221000]}```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to