nsivabalan commented on code in PR #13005:
URL: https://github.com/apache/hudi/pull/13005#discussion_r2080875445
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##########
@@ -410,6 +410,19 @@ public void bootstrap(Option<Map<String, String>>
extraMetadata) {
*/
public abstract O upsertPreppedRecords(I preppedRecords, final String
instantTime);
+ /**
+ * Upserts the given prepared records into the Hoodie table, at the supplied
instantTime.
+ * <p>
+ * This implementation requires that the input records are already tagged,
and de-duped if needed.
+ *
+ * @param preppedRecords Prepared HoodieRecords to upsert
+ * @param instantTime Instant time of the commit
+ * @return Collection of WriteStatus to inspect errors and counts
+ */
+ public O upsertPreppedRecords(I preppedRecords, final String instantTime,
Option<List<Pair<String, String>>> partitionFileIdPairsHolderOpt) {
Review Comment:
if we compare to what we do today (prior to this patch),
we use UpsertPartitioner.
https://github.com/apache/hudi/blob/71b223dd432181030be24a783269625875769d58/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/UpsertPartitioner.java#L86
We already store BucketInfo for every spark partition, where in bucketInfo
is fileId prefix, partition path and bucket type.
And we actually send [entire
SparkPartitioner](https://github.com/apache/hudi/blob/71b223dd432181030be24a783269625875769d58/hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkHoodiePartitioner.java#L29)
which contains HoodieTable object to the executors. So, comparatively, this
new flow should avoid all those unnecessary transfers.
and coming back to this patch,
we need this to initialize the partitioner for the metadata table. So, not
sure what value we might get in lazily doing it. We need the list for sure and
there won't be any code path which will not use it.
or may be I am missing to see what you are suggesting. can you throw some
light.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]