Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

via GitHub Fri, 22 Mar 2024 07:42:53 -0700


bhat-vinay commented on code in PR #10876:
URL: https://github.com/apache/hudi/pull/10876#discussion_r1535704794



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##########
@@ -230,6 +236,10 @@ protected Partitioner getPartitioner(WorkloadProfile 
profile) {
   }
 
   private HoodieData<WriteStatus> 
mapPartitionsAsRDD(HoodieData<HoodieRecord<T>> dedupedRecords, Partitioner 
partitioner) {
+    if (operationRequiresSorting()) {

Review Comment:
   What does sorting mean for 'upsert' operation. If the record is really being 
updated, wont there be a index lookup which routes the record to its specific 
filegroup? Or is there benefit of supporting sorting when an upsert batch 
contains new records that are getting written for the first time? This PR 
allows sorting only for INSERT operation. 
`BaseSparkCommitActionExecutor::operationRequiresSorting(...)` takes care of 
that. If the config needs to be made ambiguity-proof for future usecases, 
should I rename it to `WRITE_SORT_MODE`, `WRITE_SORT_OPERATIONS` and 
`WRITE_USER_DEFINED_PARTITIONER_SORT_COLUMNS`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7512] sort input records for insert operation [hudi]

Reply via email to