[
https://issues.apache.org/jira/browse/HUDI-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated HUDI-7264:
---------------------------------
Labels: pull-request-available (was: )
> In a Query-Only Spark Session, the Latest Visible Commit Is Not Updated
> ------------------------------------------------------------------------
>
> Key: HUDI-7264
> URL: https://issues.apache.org/jira/browse/HUDI-7264
> Project: Apache Hudi
> Issue Type: Improvement
> Reporter: Ma Jian
> Priority: Major
> Labels: pull-request-available
>
> In the current version, HoodieFileIndex is a member variable of
> HoodieBaseRelation. PR #7871 has made Hudi's acquisition of Relation behave
> more like Spark's. However, in Spark, the relation is cached as follows:
> catalog.getCachedPlan(qualifiedTableName, () => \{
> val dataSource = DataSource(
> sparkSession,
> userSpecifiedSchema = if (table.schema.isEmpty) None else
> Some(table.schema),
> partitionColumns = table.partitionColumnNames,
> bucketSpec = table.bucketSpec,
> className = table.provider.get,
> options = dsOptions,
> catalogTable = Some(table)
> )
> LogicalRelation(dataSource.resolveRelation(checkFilesExist = false), table)
> })
> This results in the continuous use of the same HoodieFileIndex instance.
> However, HoodieFileIndex contains cached items like cachedAllPartitionPaths
> and cachedAllInputFileSlices, which only reset upon creating a new
> HoodieFileIndex. A sparkSession will only execute refreshTable when actions
> like 'insert' are performed. If HoodieFileIndex is never refreshed, then a
> SparkSession that only executes queries will always see the version that was
> cached during the initial query. This is not the expected behavior. In
> practice, Delta Lake seems to attempt updating the snapshot with each
> listFiles operation. After PR #7871, Hudi would recreate the relation with
> each query, obtaining the latest snapshot. Therefore, I believe there should
> be an assessment at the start of listFiles to determine whether the cache
> (snapshot) needs to be updated.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)