[
https://issues.apache.org/jira/browse/HUDI-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shuo Cheng updated HUDI-9540:
-----------------------------
Description:
Hey!
There seems to be an issue with incremental reads with MoR tables in Spark.
When I read via PySpark with, ie.
{code:java}
read_options = {
'hoodie.datasource.query.type': 'incremental',
'hoodie.datasource.read.begin.instanttime': '0',
}{code}
Some of the records do not show up that do show up with snapshot reads.
I think this issue is likely related to the differences between MoR writes
between Spark and Flink, where Spark always puts inserts into the base file,
but Flink puts the inserts into the log files as well. So for some reason some
or all of these inserts that only exist in the log files and haven't been
compacted into base files are ignored during incremental reads.
(My data has no deletes)
Hudi version: 1.0.2
Flink Version: 1.20.0
Storage: S3
(Writes are via AWS Managed Flink, reads are via EMR Serverless)
To reproduce:
1. Do some writes via Flink to a Hudi sink
2. Compare an incremental read to a snapshot read, some of the records go
missing.
was:
Hey!
There seems to be an issue with incremental reads with MoR tables in Flink.
When I read via PySpark with, ie.
{code:java}
read_options = {
'hoodie.datasource.query.type': 'incremental',
'hoodie.datasource.read.begin.instanttime': '0',
}{code}
Some of the records do not show up that do show up with snapshot reads.
I think this issue is likely related to the differences between MoR writes
between Spark and Flink, where Spark always puts inserts into the base file,
but Flink puts the inserts into the log files as well. So for some reason some
or all of these inserts that only exist in the log files and haven't been
compacted into base files are ignored during incremental reads.
(My data has no deletes)
Hudi version: 1.0.2
Flink Version: 1.20.0
Storage: S3
(Writes are via AWS Managed Flink, reads are via EMR Serverless)
To reproduce:
1. Do some writes via Flink to a Hudi sink
2. Compare an incremental read to a snapshot read, some of the records go
missing.
> Incremental Reads Issue With Spark
> ----------------------------------
>
> Key: HUDI-9540
> URL: https://issues.apache.org/jira/browse/HUDI-9540
> Project: Apache Hudi
> Issue Type: Bug
> Affects Versions: 1.0.2
> Reporter: Hans Eschbaum
> Priority: Major
> Attachments: image-2025-08-27-15-25-33-375.png
>
>
> Hey!
> There seems to be an issue with incremental reads with MoR tables in Spark.
>
> When I read via PySpark with, ie.
> {code:java}
> read_options = {
> 'hoodie.datasource.query.type': 'incremental',
> 'hoodie.datasource.read.begin.instanttime': '0',
> }{code}
> Some of the records do not show up that do show up with snapshot reads.
> I think this issue is likely related to the differences between MoR writes
> between Spark and Flink, where Spark always puts inserts into the base file,
> but Flink puts the inserts into the log files as well. So for some reason
> some or all of these inserts that only exist in the log files and haven't
> been compacted into base files are ignored during incremental reads.
> (My data has no deletes)
> Hudi version: 1.0.2
> Flink Version: 1.20.0
> Storage: S3
> (Writes are via AWS Managed Flink, reads are via EMR Serverless)
> To reproduce:
> 1. Do some writes via Flink to a Hudi sink
> 2. Compare an incremental read to a snapshot read, some of the records go
> missing.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)