nsivabalan commented on code in PR #4913:
URL: https://github.com/apache/hudi/pull/4913#discussion_r1299398169


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieWriteHandle.java:
##########
@@ -273,4 +280,31 @@ protected static Option<IndexedRecord> 
toAvroRecord(HoodieRecord record, Schema
       return Option.empty();
     }
   }
+
+  protected class AppendLogWriteCallback implements HoodieLogFileWriteCallback 
{
+    // here we distinguish log files created from log files being appended. 
Considering following scenario:
+    // An appending task write to log file.
+    // (1) append to existing file file_instant_writetoken1.log.1
+    // (2) rollover and create file file_instant_writetoken2.log.2
+    // Then this task failed and retry by a new task.
+    // (3) append to existing file file_instant_writetoken1.log.1
+    // (4) rollover and create file file_instant_writetoken3.log.2
+    // finally file_instant_writetoken2.log.2 should not be committed to hudi, 
we use marker file to delete it.
+    // keep in mind that log file is not always fail-safe unless it never roll 
over
+

Review Comment:
   one more clarification. in hdfs like systems, we can never delete any log 
file during reconcile right?
   for eg, writer1 could have added log file1, but before writer1 can reach the 
reconcile step, writer2 could have appended more data to the same log file. So, 
writer1 can never delete any log file since a concurrent writer could have 
appended to it. 
   
   From what we know, writer1's data block could be duplicate due to spark task 
retries, but writer2's log block could be a valid one. 
   so, I feel that we can never delete any files during reconcile step in hdfs 
like systems (where appends are allowed). due to concurrent writers appending 
to an existing log file. 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to