scxwhite commented on a change in pull request #5030:
URL: https://github.com/apache/hudi/pull/5030#discussion_r825649696
##########
File path:
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/HoodieCompactor.java
##########
@@ -280,8 +281,11 @@ HoodieCompactionPlan generateCompactionPlan(
.getLatestFileSlices(partitionPath)
.filter(slice ->
!fgIdsInPendingCompactionAndClustering.contains(slice.getFileGroupId()))
.map(s -> {
+ // In most business scenarios, the latest data is in the latest
delta log file, so we sort it from large
+ // to small according to the instance time, which can largely avoid
rewriting the data in the
+ // compact process, and then optimize the compact time
List<HoodieLogFile> logFiles =
Review comment:
> What do you mean by `avoid rewriting the data in the compact process`
here ? Shouldn't the reader have the same merged content no matter what the
read sequence is for log files ?
@danny0405 Thanks for your quick reply.
What I am talking about here is that in the delta log files reading stage,
we can put the latest data into the ExternalSpillableMap of
HoodieMergedLogRecordScanner#records in advance。
Briefly explain:
If we have a record in basefile, recordKey = 1, and preCombineField = 1, and
some other fields.
**first commit: recordKey = 1,preCombineField=2, and some other update
fields. Generate delta log1.**
**second commit: recordKey = 1,preCombineField=3, and some other update
fields.Generate delta log2.**
**third commit: recordKey = 1,preCombineField=4, and some other update
fields.Generate delta log3.**
Three delta log files will be generated after three commits.
When the compact operation is triggered,if the delta log files are sorted
according to natural order. When reading the delta log1 file, we will first put
the record with recordKey = 1 and preCombineField = 2 into the
map(HoodieMergedLogRecordScanner#records). When reading the delta log2 file,
the record(recordKey = 1, preCombineField=3) will overwrite the record of delta
log1 (recordKey =1, preCombineField=2),and so on.
However,If the delta log files are sorted in reverse order by
instancetime.We will first put the data of delta log3 (recordKey=1,
preCombineField=4) into the map of ExternalSpillableMap. Even if we read delta
log2 and delta log1 next, they will not be selected by
HoodieRecordPayload#preCombine because their preCombineField is smaller and the
data is older.
This is what I said "avoid rewriting the data in the compact process".
In addition, reducing the amount of rewriting data will save a lot of time
when the ExternalSpillableMap overflows its memory size and spills to disk.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]