Ma Jian created HUDI-7264:
-----------------------------
Summary: In a Query-Only Spark Session, the Latest Visible Commit
Is Not Updated
Key: HUDI-7264
URL: https://issues.apache.org/jira/browse/HUDI-7264
Project: Apache Hudi
Issue Type: Improvement
Reporter: Ma Jian
In the current version, HoodieFileIndex is a member variable of
HoodieBaseRelation. PR #7871 has made Hudi's acquisition of Relation behave
more like Spark's. However, in Spark, the relation is cached as follows:
catalog.getCachedPlan(qualifiedTableName, () => \{
val dataSource = DataSource(
sparkSession,
userSpecifiedSchema = if (table.schema.isEmpty) None else
Some(table.schema),
partitionColumns = table.partitionColumnNames,
bucketSpec = table.bucketSpec,
className = table.provider.get,
options = dsOptions,
catalogTable = Some(table)
)
LogicalRelation(dataSource.resolveRelation(checkFilesExist = false), table)
})
This results in the continuous use of the same HoodieFileIndex instance.
However, HoodieFileIndex contains cached items like cachedAllPartitionPaths and
cachedAllInputFileSlices, which only reset upon creating a new HoodieFileIndex.
A sparkSession will only execute refreshTable when actions like 'insert' are
performed. If HoodieFileIndex is never refreshed, then a SparkSession that only
executes queries will always see the version that was cached during the initial
query. This is not the expected behavior. In practice, Delta Lake seems to
attempt updating the snapshot with each listFiles operation. After PR #7871,
Hudi would recreate the relation with each query, obtaining the latest
snapshot. Therefore, I believe there should be an assessment at the start of
listFiles to determine whether the cache (snapshot) needs to be updated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)