guanziyue commented on code in PR #4913:
URL: https://github.com/apache/hudi/pull/4913#discussion_r1299433537
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##########
@@ -273,4 +280,31 @@ protected static Option<IndexedRecord>
toAvroRecord(HoodieRecord record, Schema
return Option.empty();
}
}
+
+ protected class AppendLogWriteCallback implements HoodieLogFileWriteCallback
{
+ // here we distinguish log files created from log files being appended.
Considering following scenario:
+ // An appending task write to log file.
+ // (1) append to existing file file_instant_writetoken1.log.1
+ // (2) rollover and create file file_instant_writetoken2.log.2
+ // Then this task failed and retry by a new task.
+ // (3) append to existing file file_instant_writetoken1.log.1
+ // (4) rollover and create file file_instant_writetoken3.log.2
+ // finally file_instant_writetoken2.log.2 should not be committed to hudi,
we use marker file to delete it.
+ // keep in mind that log file is not always fail-safe unless it never roll
over
+
Review Comment:
> one more clarification. in hdfs like systems, we can never delete any log
file during reconcile right? for eg, writer1 could have added log file1, but
before writer1 can reach the reconcile step, writer2 could have appended more
data to the same log file. So, writer1 can never delete any log file since a
concurrent writer could have appended to it.
>
> From what we know, writer1's data block could be duplicate due to spark
task retries, but writer2's log block could be a valid one. so, I feel that we
can never delete any files during reconcile step in hdfs like systems (where
appends are allowed). due to concurrent writers appending to an existing log
file.
Currently, OCC model should fence any concurrent writing to the same file
group. So we can actually delete log file because only one writer can generate
log file to a file group.
What I'm not sure is that what should log file mechanism be like. According
to my experience, allow uncommitted content leaving in hudi bring risk to the
whole system. Other components must deliberately address a potential
combination of invalid files and invalid log blocks. Any mistakes may result in
a correctness issue. I tried to make data and meta(commit message and MDT) in
hudi consistent as much as possible by clearing invalid file in reconcile step.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]