nsivabalan commented on code in PR #4913:
URL: https://github.com/apache/hudi/pull/4913#discussion_r1299397830
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##########
@@ -273,4 +280,31 @@ protected static Option<IndexedRecord>
toAvroRecord(HoodieRecord record, Schema
return Option.empty();
}
}
+
+ protected class AppendLogWriteCallback implements HoodieLogFileWriteCallback
{
+ // here we distinguish log files created from log files being appended.
Considering following scenario:
+ // An appending task write to log file.
+ // (1) append to existing file file_instant_writetoken1.log.1
+ // (2) rollover and create file file_instant_writetoken2.log.2
+ // Then this task failed and retry by a new task.
+ // (3) append to existing file file_instant_writetoken1.log.1
+ // (4) rollover and create file file_instant_writetoken3.log.2
+ // finally file_instant_writetoken2.log.2 should not be committed to hudi,
we use marker file to delete it.
+ // keep in mind that log file is not always fail-safe unless it never roll
over
+
Review Comment:
I see. so its an issue even w/ S3 like systems?
Let me go over the scenario that you are referring to so that we are on same
page.
MDT disabled.
Writer1:
writer1 updates records in file group1 which already has a base file and 1
log file.
writer1 writes log file2 and logfile3 (due to spark task retries). but
ideally we just need only log file, i.e. log file2.
Writer2:
Concurrently tries to do a snapshot read from the same table concurrently.
Before reconcile step for writer1 could execute, hudi returns all 3 log files
(log file1, logfile2 and log file3) as part of FSView.
Writer1:
goes through the reconcile logic and deletes the extraneous log file. so log
file3 is deleted.
Writer2:
continues w/ actual read, where it hits file not found exception wrt log
file3.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]