[GitHub] [hudi] nsivabalan commented on a diff in pull request #8914: [HUDI-6344] Flink MDT bulk_insert for initial commit

via GitHub Sat, 10 Jun 2023 08:24:36 -0700


nsivabalan commented on code in PR #8914:
URL: https://github.com/apache/hudi/pull/8914#discussion_r1225422681



##########
hudi-client/hudi-flink-client/src/main/java/org/apache/hudi/metadata/FlinkHoodieBackedTableMetadataWriter.java:
##########
@@ -149,9 +161,9 @@ protected void commit(String instantTime, 
Map<MetadataPartitionType, HoodieData<
         writeClient.getHeartbeatClient().start(instantTime);
       }
 
-      List<WriteStatus> statuses = preppedRecordList.size() > 0
-          ? writeClient.upsertPreppedRecords(preppedRecordList, instantTime)
-          : Collections.emptyList();
+      List<WriteStatus> statuses = isInitializing
+          ? writeClient.bulkInsertPreppedRecords(preppedRecordList, 
instantTime, Option.empty())

Review Comment:
   records to file group mapping is deterministic and we can have only one file 
written per file group. for eg, if we instanttiate col stats with 4 file 
groups, we should spin up 4 spark tasks and each spark task should get records 
pertaining to the file group of interest (remember records are mapped to file 
group based on hashing). So, if one spark task gets records for all file 
groups, then we might end up w/ n*m files (where n is no of spark tasks and m 
is number of file groups) which may not work. we need only m files created and 
m spark tasks should spin up where each spark tasks writes to just 1 file 
group. 
   hope that makes sense. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a diff in pull request #8914: [HUDI-6344] Flink MDT bulk_insert for initial commit

Reply via email to