[ 
https://issues.apache.org/jira/browse/SPARK-54216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-54216:
----------------------------------
    Fix Version/s:     (was: 4.1.0)

> Cache refresh returns stale data for DataSource V2 tables with immutable 
> Table instances
> ----------------------------------------------------------------------------------------
>
>                 Key: SPARK-54216
>                 URL: https://issues.apache.org/jira/browse/SPARK-54216
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.1.0
>            Reporter: Vitalii Li
>            Priority: Major
>              Labels: pull-request-available
>
> *Problem*
> After modifying a V2 table and calling `refreshTable()` or `recacheByPlan()`, 
> cached queries return stale data instead of updated data.
> *Root Cause*
> `CacheManager.recacheByCondition()` re-executes the old cached plan 
> containing an immutable `Table` instance pointing to a previous snapshot. 
> This reads stale data.
> V1 tables don't have this issue because they use mutable file indexes that 
> implicitly refresh.
> *Reproduce*
> {code:scala}
> spark.table("v2_table").cache().count()  // Cache populated
> spark.sql("INSERT INTO v2_table VALUES (3, 'new')")  // Modify table
> spark.catalog.refreshTable("v2_table")  // Refresh cache
> spark.table("v2_table").show()  // BUG: Shows old data
> {code}
> *Solution*
> - Modify `recacheByCondition` to accept optional `freshPlan` parameter
> - Use fresh plan (with current snapshot) for re-execution instead of old 
> cached plan
> - Update cached plan entry to use fresh plan
> *Impact*
> Affects Delta Lake, Iceberg, and any V2 table with immutable Table instances.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to