[I] [SUPPORT]Spark Job Reading Null Data from Hudi COPY_ON_WRITE Table Due to Inflight Commit During Snapshot Query [hudi]

via GitHub Thu, 19 Sep 2024 15:25:31 -0700


maabkhan opened a new issue, #11971:
URL: https://github.com/apache/hudi/issues/11971

**_Tips before filing an issue_**

- Have you gone through our [FAQs](https://hudi.apache.org/learn/faq/)?

- Join the mailing list to engage in conversations and get faster support at
[email protected].

- If you have triaged this as a bug, then file an
[issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.

**Describe the problem you faced**

One of my Spark jobs is reading data from a Hudi COPY_ON_WRITE table using
the snapshot query type. The job runs once a day, but the table is updated
every hour. When the job reads the table while it is being updated, it loads
instants up to an inflight commit, leading to all null data being read.

**To Reproduce**

Steps to reproduce the behavior:
1. Create a Hudi COPY_ON_WRITE table.
2. Schedule a Spark job to read from the table using the snapshot query type.
3. Update the table every hour.
4. Run the Spark job while the table is being updated.

**Expected behavior**

The Spark job should wait for the inflight commit to complete and read data
from the latest completed commit or read from a clean_completed commit by
referring to older commit, ensuring that no null data is read.

**Environment Description**

* Hudi version : 0.14

* Spark version : 3.4.1

* Hive version : 3.1.3

* Hadoop version :

* Storage (HDFS/S3/GCS..) : s3

* Running on Docker? (yes/no) : no (oceanspark by spot (spark on eks))

**Additional context**

The issue seems to be that the Spark job is loading instants up to an
inflight commit instead of loading from clean_completed commit. This results in
reading null data. I am looking for a Hudi configuration that can ensure the
job waits for inflight commits to complete before reading the data or reads
from older clean_completed commit.

**Stacktrace**

```2024-09-18T19:44:14.081421948Z 24/09/18 19:44:14 INFO DataSourceUtils:
Getting table path..
2024-09-18T19:44:14.082728611Z 24/09/18 19:44:14 INFO TablePathUtils:
Getting table path from path :
s3a://trusted-luna-prod/tmevents_hourly/topics/account_balance_events
2024-09-18T19:44:14.167799106Z 24/09/18 19:44:14 INFO DefaultSource:
Obtained hudi table path:
s3a://trusted-luna-prod/tmevents_hourly/topics/account_balance_events
2024-09-18T19:44:14.211559950Z 24/09/18 19:44:14 INFO HoodieTableMetaClient:
Loading HoodieTableMetaClient from
s3a://trusted-luna-prod/tmevents_hourly/topics/account_balance_events
2024-09-18T19:44:14.255635781Z 24/09/18 19:44:14 INFO HoodieTableConfig:
Loading table properties from
s3a://trusted-luna-prod/tmevents_hourly/topics/account_balance_events/.hoodie/hoodie.properties
2024-09-18T19:44:14.335476879Z 24/09/18 19:44:14 INFO HoodieTableMetaClient:
Finished Loading Table of type COPY_ON_WRITE(version=1, baseFileFormat=PARQUET)
from s3a://trusted-luna-prod/tmevents_hourly/topics/account_balance_events
2024-09-18T19:44:14.350218569Z 24/09/18 19:44:14 INFO DefaultSource: Is
bootstrapped table => false, tableType is: COPY_ON_WRITE, queryType is: snapshot
2024-09-18T19:44:14.464630233Z 24/09/18 19:44:14 INFO HoodieActiveTimeline:
Loaded instants upto :
Option{val=[==>20240918193949292__commit__INFLIGHT__20240918194221000]}```

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [SUPPORT]Spark Job Reading Null Data from Hudi COPY_ON_WRITE Table Due to Inflight Commit During Snapshot Query [hudi]

Reply via email to