[
https://issues.apache.org/jira/browse/SPARK-54216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-54216:
----------------------------------
Fix Version/s: (was: 4.1.0)
> Cache refresh returns stale data for DataSource V2 tables with immutable
> Table instances
> ----------------------------------------------------------------------------------------
>
> Key: SPARK-54216
> URL: https://issues.apache.org/jira/browse/SPARK-54216
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.1.0
> Reporter: Vitalii Li
> Priority: Major
> Labels: pull-request-available
>
> *Problem*
> After modifying a V2 table and calling `refreshTable()` or `recacheByPlan()`,
> cached queries return stale data instead of updated data.
> *Root Cause*
> `CacheManager.recacheByCondition()` re-executes the old cached plan
> containing an immutable `Table` instance pointing to a previous snapshot.
> This reads stale data.
> V1 tables don't have this issue because they use mutable file indexes that
> implicitly refresh.
> *Reproduce*
> {code:scala}
> spark.table("v2_table").cache().count() // Cache populated
> spark.sql("INSERT INTO v2_table VALUES (3, 'new')") // Modify table
> spark.catalog.refreshTable("v2_table") // Refresh cache
> spark.table("v2_table").show() // BUG: Shows old data
> {code}
> *Solution*
> - Modify `recacheByCondition` to accept optional `freshPlan` parameter
> - Use fresh plan (with current snapshot) for re-execution instead of old
> cached plan
> - Update cached plan entry to use fresh plan
> *Impact*
> Affects Delta Lake, Iceberg, and any V2 table with immutable Table instances.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]