[jira] [Updated] (HUDI-6802) Use completion time in Spark FileIndex for listing

Y Ethan Guo (Jira) Thu, 26 Sep 2024 11:23:07 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-6802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Y Ethan Guo updated HUDI-6802:
------------------------------
    Description: 
We need to make sure that the file index (BaseHoodieTableFileIndex, 
SparkHoodieTableFileIndex, HoodieFileIndex, subclasses, etc.) should 
incorporate completion time when listing the files for different query types 
(snapshot, read-optimized, incremental, CDC) in Spark.  This should already be 
supported as the completion-time-based listing is done in the file system view. 
 Still, we should revisit the code to make sure there is no gap on the Spark 
side.

The log file ordering will be tackled separately.  This ticket makes sure that 
we return the accurate collection of log files based on the completion time, 
especially when NBCC is enabled and there are concurrent writers updating the 
same file group.

  was:We need to make sure that the file index (BaseHoodieTableFileIndex, 
SparkHoodieTableFileIndex, HoodieFileIndex, subclasses, etc.) should 
incorporate completion time when listing the files for different query types 
(snapshot, read-optimized, incremental, CDC) in Spark.  This should already be 
supported as the completion-time-based listing is done in the file system view. 
 Still, we should revisit the code to make sure there is no gap on the Spark 
side.


> Use completion time in Spark FileIndex for listing
> --------------------------------------------------
>
>                 Key: HUDI-6802
>                 URL: https://issues.apache.org/jira/browse/HUDI-6802
>             Project: Apache Hudi
>          Issue Type: New Feature
>            Reporter: Ethan Guo (this is the old account; please use "yihua")
>            Assignee: Jonathan Vexler
>            Priority: Blocker
>             Fix For: 1.0.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> We need to make sure that the file index (BaseHoodieTableFileIndex, 
> SparkHoodieTableFileIndex, HoodieFileIndex, subclasses, etc.) should 
> incorporate completion time when listing the files for different query types 
> (snapshot, read-optimized, incremental, CDC) in Spark.  This should already 
> be supported as the completion-time-based listing is done in the file system 
> view.  Still, we should revisit the code to make sure there is no gap on the 
> Spark side.
> The log file ordering will be tackled separately.  This ticket makes sure 
> that we return the accurate collection of log files based on the completion 
> time, especially when NBCC is enabled and there are concurrent writers 
> updating the same file group.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-6802) Use completion time in Spark FileIndex for listing

Reply via email to