nbalajee commented on PR #9035:
URL: https://github.com/apache/hudi/pull/9035#issuecomment-1607857303

   > Thanks for the contribution @nbalajee , In general I'm confused why we 
need two marker files for each base file, before the patch, we have in-progress 
marker file and write status real paths, we can diff out the corrupt/retry 
files by comparing the in-progress marker file handles and the paths recorded 
in writestatus.
   > 
   > And we also have some instant completion check in HoodieFileSystemView, to 
ignore the files/file blocks that are still pending, so why the reader view 
could read data sets that are not intented to be exposed?
   
   Thanks for your review @dannyhchen and @nsivabalan for the review.
   
   > Thanks for the contribution @nbalajee , In general I'm confused why we 
need two marker files for each base file, before the patch, we have in-progress 
marker file and write status real paths, we can diff out the corrupt/retry 
files by comparing the in-progress marker file handles and the paths recorded 
in writestatus.
   > 
   > And we also have some instant completion check in HoodieFileSystemView, to 
ignore the files/file blocks that are still pending, so why the reader view 
could read data sets that are not intented to be exposed?
   
   Following diagram summarizes the issue. 
   (a) when a batch of records given to an executor for writing, spills over to 
multiple data files (split into multiple parts due to file size limits, 
f1-0_w1_c1.parquet, f1-1_w1_c1.parquet etc)
   (b) A spark stage is retried as a result all tasks are retried (some of the 
tasks from previous attempts could still be on-going).  Mainly happens with 
spark fetchfailed exception.
   
   ![Screenshot 2023-06-25 at 9 15 35 
PM](https://github.com/apache/hudi/assets/47542891/7121d7e6-e624-4743-ad00-004fde3e8344)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to