Re: [PR] perf: Add dedicated batch size config for LSM timeline migration on u… [hudi]

via GitHub Tue, 23 Jun 2026 05:07:53 -0700


hudi-agent commented on code in PR #19052:
URL: https://github.com/apache/hudi/pull/19052#discussion_r3459565193



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/upgrade/SevenToEightUpgradeHandler.java:
##########
@@ -257,23 +257,25 @@ static void upgradeToLSMTimeline(HoodieTable table, 
HoodieEngineContext engineCo
       LegacyArchivedMetaEntryReader reader = new 
LegacyArchivedMetaEntryReader(table.getMetaClient());
       StoragePath archivePath = new 
StoragePath(table.getMetaClient().getMetaPath(), "timeline/history");
       LSMTimelineWriter lsmTimelineWriter = 
LSMTimelineWriter.getInstance(config, table, Option.of(archivePath));
-      int batchSize = config.getCommitArchivalBatchSize();
+      // Use a dedicated, larger batch size for the one-time migration to 
minimize the number of parquet
+      // files created on remote storage. Each write() call involves multiple 
remote storage operations
+      // (exists check, parquet write, manifest update); using the regular 
archival batch size (default 10)

Review Comment:
   🤖 nit: `(default 10)` hardcodes the default of a separate config that could 
drift out of sync — dropping the parenthetical or replacing it with something 
like "(its default is much smaller than what migration needs)" would keep the 
explanation durable.
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -88,6 +88,15 @@ public class HoodieArchivalConfig extends HoodieConfig {
       .withDocumentation("Archiving of instants is batched in best-effort 
manner, to pack more instants into a single"
           + " archive log. This config controls such archival batch size.");
 
+  public static final ConfigProperty<Integer> 
MIGRATION_COMMITS_ARCHIVAL_BATCH_SIZE = ConfigProperty
+      .key("hoodie.migration.commits.archival.batch")
+      .defaultValue(500)

Review Comment:
   🤖 This bumps the in-memory batch default 50x (10 to 500). Buffered 
ActiveActions hold full instant metadata bytes, so 500 wide-table commits could 
risk OOM on the driver during the one-time upgrade. Would a smaller default 
like 100 better balance the I/O win against peak heap?
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/utils/TestFlinkWriteClients.java:
##########
@@ -121,6 +122,17 @@ void 
testUserConfiguredGlobalRecordIndexMinFileGroupCountIsNotOverridden() {
     assertEquals(12, writeConfig.getGlobalRecordLevelIndexMinFileGroupCount());
   }
 
+  @Test
+  void testUserConfiguredMigrationCommitArchivalBatchSizeIsPropagated() {
+    // A raw hoodie.* property set on the Flink configuration must be 
propagated to the write config
+    // (and therefore reach the upgrade handler that reads it during the v7 -> 
v8 LSM timeline migration).
+    
conf.setString(HoodieArchivalConfig.MIGRATION_COMMITS_ARCHIVAL_BATCH_SIZE.key(),
 "123");
+    HoodieWriteConfig writeConfig = 
FlinkWriteClients.getHoodieClientConfig(conf, false, false);
+    assertEquals(123, writeConfig.getMigrationCommitArchivalBatchSize());
+    // The regular archival batch size must stay independent at its own 
default.
+    assertEquals(10, writeConfig.getCommitArchivalBatchSize());

Review Comment:
   🤖 nit: the magic number `10` here is the default of 
`COMMITS_ARCHIVAL_BATCH_SIZE` — could you use 
`HoodieArchivalConfig.COMMITS_ARCHIVAL_BATCH_SIZE.defaultValue()` instead? That 
way the assertion won't silently break if the default ever changes.
   
   <sub><i>⚠️ AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] perf: Add dedicated batch size config for LSM timeline migration on u… [hudi]

Reply via email to