Re: [PR] [HUDI-8474] Metadata table upsert prepped optimized [hudi]

via GitHub Thu, 08 May 2025 21:01:22 -0700


nsivabalan commented on code in PR #13005:
URL: https://github.com/apache/hudi/pull/13005#discussion_r2080875445



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##########
@@ -410,6 +410,19 @@ public void bootstrap(Option<Map<String, String>> 
extraMetadata) {
    */
   public abstract O upsertPreppedRecords(I preppedRecords, final String 
instantTime);
 
+  /**
+   * Upserts the given prepared records into the Hoodie table, at the supplied 
instantTime.
+   * <p>
+   * This implementation requires that the input records are already tagged, 
and de-duped if needed.
+   *
+   * @param preppedRecords Prepared HoodieRecords to upsert
+   * @param instantTime Instant time of the commit
+   * @return Collection of WriteStatus to inspect errors and counts
+   */
+  public O upsertPreppedRecords(I preppedRecords, final String instantTime, 
Option<List<Pair<String, String>>> partitionFileIdPairsHolderOpt) {

Review Comment:
   if we compare to what we do today (prior to this patch), 
   we use UpsertPartitioner. 
   
https://github.com/apache/hudi/blob/71b223dd432181030be24a783269625875769d58/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L86
   
   We already store BucketInfo for every spark partition, where in bucketInfo 
is fileId prefix, partition path and bucket type. 
   
   And we actually send [entire 
SparkPartitioner](https://github.com/apache/hudi/blob/71b223dd432181030be24a783269625875769d58/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkHoodiePartitioner.java#L29)
 which contains HoodieTable object to the executors. So, comparatively, this 
new flow should avoid all those unnecessary transfers. 
   
   and coming back to this patch, 
   we need this to initialize the partitioner for the metadata table. So, not 
sure what value we might get in lazily doing it. We need the list for sure and 
there won't be any code path which will not use it. 
   
   or may be I am missing to see what you are suggesting. can you throw some 
light. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-8474] Metadata table upsert prepped optimized [hudi]

Reply via email to