[ 
https://issues.apache.org/jira/browse/HUDI-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7264:
---------------------------------
    Labels: pull-request-available  (was: )

>  In a Query-Only Spark Session, the Latest Visible Commit Is Not Updated
> ------------------------------------------------------------------------
>
>                 Key: HUDI-7264
>                 URL: https://issues.apache.org/jira/browse/HUDI-7264
>             Project: Apache Hudi
>          Issue Type: Improvement
>            Reporter: Ma Jian
>            Priority: Major
>              Labels: pull-request-available
>
> In the current version, HoodieFileIndex is a member variable of 
> HoodieBaseRelation. PR #7871 has made Hudi's acquisition of Relation behave 
> more like Spark's. However, in Spark, the relation is cached as follows:
> catalog.getCachedPlan(qualifiedTableName, () => \{
>   val dataSource = DataSource(
>     sparkSession,
>     userSpecifiedSchema = if (table.schema.isEmpty) None else 
> Some(table.schema),
>     partitionColumns = table.partitionColumnNames,
>     bucketSpec = table.bucketSpec,
>     className = table.provider.get,
>     options = dsOptions,
>     catalogTable = Some(table)
>   )
>   LogicalRelation(dataSource.resolveRelation(checkFilesExist = false), table)
> })
> This results in the continuous use of the same HoodieFileIndex instance.
> However, HoodieFileIndex contains cached items like cachedAllPartitionPaths 
> and cachedAllInputFileSlices, which only reset upon creating a new 
> HoodieFileIndex. A sparkSession will only execute refreshTable when actions 
> like 'insert' are performed. If HoodieFileIndex is never refreshed, then a 
> SparkSession that only executes queries will always see the version that was 
> cached during the initial query. This is not the expected behavior. In 
> practice, Delta Lake seems to attempt updating the snapshot with each 
> listFiles operation. After PR #7871, Hudi would recreate the relation with 
> each query, obtaining the latest snapshot. Therefore, I believe there should 
> be an assessment at the start of listFiles to determine whether the cache 
> (snapshot) needs to be updated.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to