vinothchandar commented on pull request #2064:
URL: https://github.com/apache/hudi/pull/2064#issuecomment-696429882


   cc @bvaradar @n3nash as well
   
   @prashantwason Here is a corner case with syncing completed compaction from 
data timeline to metadata timeline. Consider the following sequence of events 
   
   t0: writer schedules compaction at time instant `c`
   t1: Compactor starts processing `c`'s plan
   t2: compaction finishes with `c.commit` published on the data timeline (not 
yet synced to metadata timeline) 
   t3: Next round of writing, writer opens metadata table, which adds the base 
file produced in c.commit to metadata table.
   
   Any queries running between t2 and t3, cannot rely on metadata since the new 
base file will not be present in metadata table. The timeline will indicate 
that the compaction completed, and the latest file slice will be computed  as 
simply the logs written to the file groups since compaction. This will lead to 
incorrect results. 
   
   If we consider just writer alone, we may be okay since we first sync the 
metadata table before we do anything for the delta commit at t3. But in general 
for queries, we should advise enabling metadata table based listings only, 
after all writers/cleaner/compactor have been enabled to use metadata and been 
successfully using it to publish new/deleted files directly to the metadata 
table. In short, queries cannot rely on metadata table, with the syncing 
mechanism as the main thing that keeps data and metadata timelines together. 
   
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to