[
https://issues.apache.org/jira/browse/HUDI-6802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Y Ethan Guo updated HUDI-6802:
------------------------------
Description:
We need to make sure that the file index (BaseHoodieTableFileIndex,
SparkHoodieTableFileIndex, HoodieFileIndex, subclasses, etc.) should
incorporate completion time when listing the files for different query types
(snapshot, read-optimized, incremental, CDC) in Spark. This should already be
supported as the completion-time-based listing is done in the file system view.
Still, we should revisit the code to make sure there is no gap on the Spark
side.
The log file ordering will be tackled separately. This ticket makes sure that
we return the accurate collection of log files based on the completion time,
especially when NBCC is enabled and there are concurrent writers updating the
same file group.
was:We need to make sure that the file index (BaseHoodieTableFileIndex,
SparkHoodieTableFileIndex, HoodieFileIndex, subclasses, etc.) should
incorporate completion time when listing the files for different query types
(snapshot, read-optimized, incremental, CDC) in Spark. This should already be
supported as the completion-time-based listing is done in the file system view.
Still, we should revisit the code to make sure there is no gap on the Spark
side.
> Use completion time in Spark FileIndex for listing
> --------------------------------------------------
>
> Key: HUDI-6802
> URL: https://issues.apache.org/jira/browse/HUDI-6802
> Project: Apache Hudi
> Issue Type: New Feature
> Reporter: Ethan Guo (this is the old account; please use "yihua")
> Assignee: Jonathan Vexler
> Priority: Blocker
> Fix For: 1.0.0
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> We need to make sure that the file index (BaseHoodieTableFileIndex,
> SparkHoodieTableFileIndex, HoodieFileIndex, subclasses, etc.) should
> incorporate completion time when listing the files for different query types
> (snapshot, read-optimized, incremental, CDC) in Spark. This should already
> be supported as the completion-time-based listing is done in the file system
> view. Still, we should revisit the code to make sure there is no gap on the
> Spark side.
> The log file ordering will be tackled separately. This ticket makes sure
> that we return the accurate collection of log files based on the completion
> time, especially when NBCC is enabled and there are concurrent writers
> updating the same file group.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)