Re: Archival process correctness

Jack Vanlightly Wed, 07 Aug 2024 13:45:14 -0700

Thanks Sudha, could you elaborate on how "query engines can identify the
latest file slice
in every file group"? Also, if you could point me to where this exists in
the code, and the Hudi API?


Much appreciated!
Jack

On Wed, Aug 7, 2024 at 9:07 PM Bhavani Sudha <bhavanisud...@gmail.com>
wrote:

> Thanks Jack for the question and for your efforts learning these concepts.
> There isn’t a safety issue here. There are two services at play: Archival
> and Cleaning. In Hudi, the Archival process moves older commit information
> in the timeline to the archived directory. The actual data itself is not
> cleaned up by the archival process; this is done by the cleaner process
> based on the cleaner settings. The cleaner only removes older versions of a
> filegroup. If there is only one file slice (version) in a file group (e.g.,
> no updates since the data was first written), it will remain untouched by
> the cleaner.
>
> For snapshot queries, all query engines can identify the latest file slice
> in every file group and read from that. Even if the older commit metadata
> for a file group are archived, the file group itself remains accessible.
> However, for time travel and incremental queries, the commits metadata is
> necessary to track changes over time. Archiving older commit info limits
> how far back you can go for these types of queries, restricting them to the
> oldest commit in the active timeline. This also has implications for
> rollbacks and restores. When commit metadata from timeline are archived,
> all side effects are removed from storage. In other words, that arbitrary
> number is how far back we keep history of metadata in the timeline. The
> latest committed data for all file groups is always available for querying.
>
>
> Thanks,
>
> Sudha
>
> On Tue, Aug 6, 2024 at 9:24 AM Jack Vanlightly <vanligh...@apache.org>
> wrote:
>
> > Hi all,
> >
> > In April I wrote a formal specification for COW tables (
> >
> >
> https://github.com/Vanlightly/table-formats-tlaplus/tree/main/hudi/v5_spec/basic_cow
> > )
> > and since then I was looking at possibly going back and adding MOR as
> well
> > as archival and compaction.
> >
> > I've read the code, read the docs and there's something that I can't
> figure
> > out about timeline archival - how does Hudi prevent the archive process
> > from archiving "live" instants? If for example, I have a primary key
> table
> > with 2 file groups, and "min commits to keep" is 20 but the last 20
> commits
> > are all related to file group 2, then the commits of file group 1 would
> be
> > archived, making file group 1 unreadable.
> >
> > Delta Lake handles log cleaning via checkpointing. Once a checkpoint has
> > been inserted into the Delta Log, prior entries can be removed. But with
> > Hudi, it seems you choose an arbitrary number of commits to keep, and so
> I
> > am left wondering how it can be safe?
> >
> > I am sure I have missed something, thanks in advance.
> >
> > Jack Vanlightly
> >
>

Re: Archival process correctness

Reply via email to