Prashant Wason created HUDI-1325:
------------------------------------
Summary: Handle the corner case with syncing completed compaction
from data timeline to metadata timeline.
Key: HUDI-1325
URL: https://issues.apache.org/jira/browse/HUDI-1325
Project: Apache Hudi
Issue Type: Sub-task
Reporter: Prashant Wason
Here is a corner case with syncing completed compaction from data timeline to
metadata timeline. Consider the following sequence of events
t0: writer schedules compaction at time instant c
t1: Compactor starts processing c's plan
t2: compaction finishes with c.commit published on the data timeline (not yet
synced to metadata timeline)
t3: Next round of writing, writer opens metadata table, which adds the base
file produced in c.commit to metadata table.
Any queries running between t2 and t3, cannot rely on metadata since the new
base file will not be present in metadata table. The timeline will indicate
that the compaction completed, and the latest file slice will be computed as
simply the logs written to the file groups since compaction. This will lead to
incorrect results.
If we consider just writer alone, we may be okay since we first sync the
metadata table before we do anything for the delta commit at t3. But in general
for queries, we should advise enabling metadata table based listings only,
after all writers/cleaner/compactor have been enabled to use metadata and been
successfully using it to publish new/deleted files directly to the metadata
table. In short, queries cannot rely on metadata table, with the syncing
mechanism as the main thing that keeps data and metadata timelines together.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)