Vitalii Li created SPARK-54216:
----------------------------------
Summary: Cache refresh returns stale data for DataSource V2 tables
with immutable Table instances
Key: SPARK-54216
URL: https://issues.apache.org/jira/browse/SPARK-54216
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 4.1.0
Reporter: Vitalii Li
Fix For: 4.1.0
*Problem*
After modifying a V2 table and calling `refreshTable()` or `recacheByPlan()`,
cached queries return stale data instead of updated data.
*Root Cause*
`CacheManager.recacheByCondition()` re-executes the old cached plan containing
an immutable `Table` instance pointing to a previous snapshot. This reads stale
data.
V1 tables don't have this issue because they use mutable file indexes that
implicitly refresh.
*Reproduce*
{code:scala}
spark.table("v2_table").cache().count() // Cache populated
spark.sql("INSERT INTO v2_table VALUES (3, 'new')") // Modify table
spark.catalog.refreshTable("v2_table") // Refresh cache
spark.table("v2_table").show() // BUG: Shows old data
{code}
*Solution*
- Modify `recacheByCondition` to accept optional `freshPlan` parameter
- Use fresh plan (with current snapshot) for re-execution instead of old cached
plan
- Update cached plan entry to use fresh plan
*Impact*
Affects Delta Lake, Iceberg, and any V2 table with immutable Table instances.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]