Ma Jian created HUDI-7264:
-----------------------------

             Summary:  In a Query-Only Spark Session, the Latest Visible Commit 
Is Not Updated
                 Key: HUDI-7264
                 URL: https://issues.apache.org/jira/browse/HUDI-7264
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Ma Jian


In the current version, HoodieFileIndex is a member variable of 
HoodieBaseRelation. PR #7871 has made Hudi's acquisition of Relation behave 
more like Spark's. However, in Spark, the relation is cached as follows:
catalog.getCachedPlan(qualifiedTableName, () => \{
  val dataSource = DataSource(
    sparkSession,
    userSpecifiedSchema = if (table.schema.isEmpty) None else 
Some(table.schema),
    partitionColumns = table.partitionColumnNames,
    bucketSpec = table.bucketSpec,
    className = table.provider.get,
    options = dsOptions,
    catalogTable = Some(table)
  )
  LogicalRelation(dataSource.resolveRelation(checkFilesExist = false), table)
})
This results in the continuous use of the same HoodieFileIndex instance.

However, HoodieFileIndex contains cached items like cachedAllPartitionPaths and 
cachedAllInputFileSlices, which only reset upon creating a new HoodieFileIndex. 
A sparkSession will only execute refreshTable when actions like 'insert' are 
performed. If HoodieFileIndex is never refreshed, then a SparkSession that only 
executes queries will always see the version that was cached during the initial 
query. This is not the expected behavior. In practice, Delta Lake seems to 
attempt updating the snapshot with each listFiles operation. After PR #7871, 
Hudi would recreate the relation with each query, obtaining the latest 
snapshot. Therefore, I believe there should be an assessment at the start of 
listFiles to determine whether the cache (snapshot) needs to be updated.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to