wombatu-kun commented on code in PR #19052:
URL: https://github.com/apache/hudi/pull/19052#discussion_r3463905303


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SevenToEightUpgradeHandler.java:
##########
@@ -257,23 +257,25 @@ static void upgradeToLSMTimeline(HoodieTable table, 
HoodieEngineContext engineCo
       LegacyArchivedMetaEntryReader reader = new 
LegacyArchivedMetaEntryReader(table.getMetaClient());
       StoragePath archivePath = new 
StoragePath(table.getMetaClient().getMetaPath(), "timeline/history");
       LSMTimelineWriter lsmTimelineWriter = 
LSMTimelineWriter.getInstance(config, table, Option.of(archivePath));
-      int batchSize = config.getCommitArchivalBatchSize();
+      // Use a dedicated, larger batch size for the one-time migration to 
minimize the number of parquet
+      // files created on remote storage. Each write() call involves multiple 
remote storage operations
+      // (exists check, parquet write, manifest update); using the regular 
archival batch size (default 10)
+      // with hundreds of actions creates excessive I/O that significantly 
increases the migration time.
+      int batchSize = config.getMigrationCommitArchivalBatchSize();

Review Comment:
   With both compactAndClean(engineContext) calls removed, the engineContext 
parameter of upgradeToLSMTimeline is no longer referenced anywhere in the 
method. Consider dropping it (and updating the single caller) so the signature 
does not carry a dead argument.



##########
hudi-client/hudi-client-common/src/test/java/org/apache/hudi/table/upgrade/TestSevenToEightUpgradeHandler.java:
##########
@@ -153,6 +173,46 @@ void testUpgradeMergeMode(String payloadClass, String 
preCombineField, String ex
     }
   }
 
+  @Test
+  void testUpgradeToLSMTimelineUsesMigrationBatchSize() throws Exception {

Review Comment:
   This case uses 4 actions with a batch size of 500, so only the 
final-remainder write() runs; the in-loop branch where activeActionsBatch 
reaches batchSize is never exercised, and times(1) only proves the size used is 
>= 4 rather than the configured 500. Add a case with totalActions greater than 
the migration batch size (e.g. size 2 over 4 actions, expecting write() 
times(2)) to cover the primary batching branch and pin the configured size.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to