[
https://issues.apache.org/jira/browse/HUDI-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hans Eschbaum updated HUDI-9540:
--------------------------------
Attachment: image-2025-08-27-15-25-33-375.png
> Incremental Reads Issue With Flink
> ----------------------------------
>
> Key: HUDI-9540
> URL: https://issues.apache.org/jira/browse/HUDI-9540
> Project: Apache Hudi
> Issue Type: Bug
> Affects Versions: 1.0.2
> Reporter: Hans Eschbaum
> Priority: Major
> Attachments: image-2025-08-27-15-25-33-375.png
>
>
> Hey!
> There seems to be an issue with incremental reads with MoR tables in Flink.
>
> When I read via PySpark with, ie.
> {code:java}
> read_options = {
> 'hoodie.datasource.query.type': 'incremental',
> 'hoodie.datasource.read.begin.instanttime': '0',
> }{code}
> Some of the records do not show up that do show up with snapshot reads.
> I think this issue is likely related to the differences between MoR writes
> between Spark and Flink, where Spark always puts inserts into the base file,
> but Flink puts the inserts into the log files as well. So for some reason
> some or all of these inserts that only exist in the log files and haven't
> been compacted into base files are ignored during incremental reads.
> (My data has no deletes)
> Hudi version: 1.0.2
> Flink Version: 1.20.0
> Storage: S3
> (Writes are via AWS Managed Flink, reads are via EMR Serverless)
> To reproduce:
> 1. Do some writes via Flink to a Hudi sink
> 2. Compare an incremental read to a snapshot read, some of the records go
> missing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)