Thanks Jack. Query engines interact using HoodieTableMetaClient <https://github.com/apache/hudi/blob/36242ff516dbd92fa6ef16bbcc150dfc6488d815/hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java> and TableFileSystemView <https://github.com/apache/hudi/blob/36242ff516dbd92fa6ef16bbcc150dfc6488d815/hudi-common/src/main/java/org/apache/hudi/common/table/view/TableFileSystemView.java> APIs. The file system view APIs support filtering files based on whether the latest snapshot is being queried or a time travel query is being done. The file system view API can make use of files index if metadata table is enabled on the engine side.
Hope that helps. Thanks, Sudha On Wed, Aug 7, 2024 at 1:45 PM Jack Vanlightly <vanligh...@apache.org> wrote: > Thanks Sudha, could you elaborate on how "query engines can identify the > latest file slice > in every file group"? Also, if you could point me to where this exists in > the code, and the Hudi API? > > Much appreciated! > Jack > > On Wed, Aug 7, 2024 at 9:07 PM Bhavani Sudha <bhavanisud...@gmail.com> > wrote: > > > Thanks Jack for the question and for your efforts learning these > concepts. > > There isn’t a safety issue here. There are two services at play: Archival > > and Cleaning. In Hudi, the Archival process moves older commit > information > > in the timeline to the archived directory. The actual data itself is not > > cleaned up by the archival process; this is done by the cleaner process > > based on the cleaner settings. The cleaner only removes older versions > of a > > filegroup. If there is only one file slice (version) in a file group > (e.g., > > no updates since the data was first written), it will remain untouched by > > the cleaner. > > > > For snapshot queries, all query engines can identify the latest file > slice > > in every file group and read from that. Even if the older commit metadata > > for a file group are archived, the file group itself remains accessible. > > However, for time travel and incremental queries, the commits metadata is > > necessary to track changes over time. Archiving older commit info limits > > how far back you can go for these types of queries, restricting them to > the > > oldest commit in the active timeline. This also has implications for > > rollbacks and restores. When commit metadata from timeline are archived, > > all side effects are removed from storage. In other words, that arbitrary > > number is how far back we keep history of metadata in the timeline. The > > latest committed data for all file groups is always available for > querying. > > > > > > Thanks, > > > > Sudha > > > > On Tue, Aug 6, 2024 at 9:24 AM Jack Vanlightly <vanligh...@apache.org> > > wrote: > > > > > Hi all, > > > > > > In April I wrote a formal specification for COW tables ( > > > > > > > > > https://github.com/Vanlightly/table-formats-tlaplus/tree/main/hudi/v5_spec/basic_cow > > > ) > > > and since then I was looking at possibly going back and adding MOR as > > well > > > as archival and compaction. > > > > > > I've read the code, read the docs and there's something that I can't > > figure > > > out about timeline archival - how does Hudi prevent the archive process > > > from archiving "live" instants? If for example, I have a primary key > > table > > > with 2 file groups, and "min commits to keep" is 20 but the last 20 > > commits > > > are all related to file group 2, then the commits of file group 1 would > > be > > > archived, making file group 1 unreadable. > > > > > > Delta Lake handles log cleaning via checkpointing. Once a checkpoint > has > > > been inserted into the Delta Log, prior entries can be removed. But > with > > > Hudi, it seems you choose an arbitrary number of commits to keep, and > so > > I > > > am left wondering how it can be safe? > > > > > > I am sure I have missed something, thanks in advance. > > > > > > Jack Vanlightly > > > > > >