[
https://issues.apache.org/jira/browse/HUDI-1325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinoth Chandar updated HUDI-1325:
---------------------------------
Summary: Implement in-memory merging of metadata table with the non-synced
part of data timeline (was: Handle the corner case with syncing completed
compaction from data timeline to metadata timeline. )
> Implement in-memory merging of metadata table with the non-synced part of
> data timeline
> ---------------------------------------------------------------------------------------
>
> Key: HUDI-1325
> URL: https://issues.apache.org/jira/browse/HUDI-1325
> Project: Apache Hudi
> Issue Type: Sub-task
> Reporter: Prashant Wason
> Assignee: Vinoth Chandar
> Priority: Major
>
> Here is a corner case with syncing completed compaction from data timeline to
> metadata timeline. Consider the following sequence of events
> t0: writer schedules compaction at time instant c
> t1: Compactor starts processing c's plan
> t2: compaction finishes with c.commit published on the data timeline (not yet
> synced to metadata timeline)
> t3: Next round of writing, writer opens metadata table, which adds the base
> file produced in c.commit to metadata table.
> Any queries running between t2 and t3, cannot rely on metadata since the new
> base file will not be present in metadata table. The timeline will indicate
> that the compaction completed, and the latest file slice will be computed as
> simply the logs written to the file groups since compaction. This will lead
> to incorrect results.
> If we consider just writer alone, we may be okay since we first sync the
> metadata table before we do anything for the delta commit at t3. But in
> general for queries, we should advise enabling metadata table based listings
> only, after all writers/cleaner/compactor have been enabled to use metadata
> and been successfully using it to publish new/deleted files directly to the
> metadata table. In short, queries cannot rely on metadata table, with the
> syncing mechanism as the main thing that keeps data and metadata timelines
> together.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)