scxwhite commented on a change in pull request #5030:
URL: https://github.com/apache/hudi/pull/5030#discussion_r825649696



##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java
##########
@@ -280,8 +281,11 @@ HoodieCompactionPlan generateCompactionPlan(
         .getLatestFileSlices(partitionPath)
         .filter(slice -> 
!fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId()))
         .map(s -> {
+          // In most business scenarios, the latest data is in the latest 
delta log file, so we sort it from large
+          // to small according to the instance time, which can largely avoid 
rewriting the data in the
+          // compact process, and then optimize the compact time
           List<HoodieLogFile> logFiles =

Review comment:
       > What do you mean by `avoid rewriting the data in the compact process` 
here ? Shouldn't the reader have the same merged content no matter what the 
read sequence is for log files ?
   
   @danny0405  Thanks for your quick reply.
   
   What I am talking about here is that in the delta log files reading stage, 
we can put the latest data into the ExternalSpillableMap of 
HoodieMergedLogRecordScanner#records in advance。
   Briefly explain:
   
   If we have a record in basefile, recordKey = 1, and preCombineField = 1, and 
some other fields.
   **first commit: recordKey = 1,preCombineField=2, and some other update 
fields. Generate delta log1.**
   **second commit: recordKey = 1,preCombineField=3, and some other update 
fields.Generate delta log2.**
   **third commit: recordKey = 1,preCombineField=4, and some other update 
fields.Generate delta log3.**
   Three delta log files will be generated after three commits.
   
    When the compact operation is triggered,if the delta log files are sorted 
according to natural order. When reading the delta log1 file, we will first put 
the record with recordKey = 1 and preCombineField = 2 into the 
map(HoodieMergedLogRecordScanner#records). When reading the delta log2 file,  
the record(recordKey = 1, preCombineField=3) will overwrite the record of delta 
log1 (recordKey =1, preCombineField=2),and so on.
   
   However,If the delta log files are sorted in reverse order by 
instancetime.We will first put the data of delta log3 (recordKey=1, 
preCombineField=4) into the map of ExternalSpillableMap. Even if we read delta 
log2 and delta log1 next, they will not be selected by 
HoodieRecordPayload#preCombine because their preCombineField is smaller and the 
data is older.  
   This is what I said "avoid rewriting the data in the compact process".
   In addition, reducing the amount of rewriting data will save a lot of time 
when the ExternalSpillableMap overflows its memory size and spills to disk.
   
   
   
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to