[ 
https://issues.apache.org/jira/browse/HUDI-9540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Danny Chen updated HUDI-9540:
-----------------------------
    Fix Version/s: 1.1.0

> Incremental Reads Issue With Spark
> ----------------------------------
>
>                 Key: HUDI-9540
>                 URL: https://issues.apache.org/jira/browse/HUDI-9540
>             Project: Apache Hudi
>          Issue Type: Bug
>    Affects Versions: 1.0.2
>            Reporter: Hans Eschbaum
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.1.0
>
>         Attachments: image-2025-08-27-15-25-33-375.png
>
>
> Hey!
> There seems to be an issue with incremental reads with MoR tables in Spark.
>  
> When I read via PySpark with, ie.
> {code:java}
> read_options = {
>     'hoodie.datasource.query.type': 'incremental',
>     'hoodie.datasource.read.begin.instanttime': '0',
> }{code}
> Some of the records do not show up that do show up with snapshot reads.
> I think this issue is likely related to the differences between MoR writes 
> between Spark and Flink, where Spark always puts inserts into the base file, 
> but Flink puts the inserts into the log files as well. So for some reason 
> some or all of these inserts that only exist in the log files and haven't 
> been compacted into base files are ignored during incremental reads.
> (My data has no deletes)
> Hudi version: 1.0.2
> Flink Version: 1.20.0
> Storage: S3
> (Writes are via AWS Managed Flink, reads are via EMR Serverless)
> To reproduce:
> 1. Do some writes via Flink to a Hudi sink
> 2. Compare an incremental read to a snapshot read, some of the records go 
> missing.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to