prashantwason opened a new pull request, #18411:
URL: https://github.com/apache/hudi/pull/18411

   ## Summary
   - Move table version upgrade (`tryUpgrade()`, `initMetadataTable()`, 
`restoreEvents()`) from the Pekko dispatcher thread to the coordinator's 
single-threaded FIFO executor in `StreamWriteOperatorCoordinator.start()`, 
preventing heartbeat timeout during long-running upgrades
   - Optimize `SevenToEightUpgradeHandler.upgradeToLSMTimeline()` to use a 
larger batch size (500 vs default 10) and call `compactAndClean()` once at the 
end instead of after every batch
   - Fixes https://github.com/apache/hudi/issues/18410
   
   ## Problem
   When upgrading a Hudi table with many archived timeline actions (e.g., v7→v8 
LSM timeline migration), the upgrade runs synchronously on the Pekko dispatcher 
thread in `StreamWriteOperatorCoordinator.start()`. Each batch of 10 actions 
triggers ~5 remote storage operations (parquet write, manifest update, 
compaction), and with hundreds of actions, the dispatcher thread is blocked for 
90+ seconds. This prevents heartbeat responses, causing the ResourceManager to 
disconnect the JobManager.
   
   ## Solution
   1. **Threading fix**: Create the `NonThrownExecutor` before the upgrade and 
submit the heavy initialization as the first FIFO task. Since all event 
handling also goes through this executor, the upgrade is guaranteed to complete 
before any events are processed. The Pekko dispatcher thread returns 
immediately, allowing heartbeats to flow.
   
   2. **I/O optimization**: Use batch size of 500 (vs default 10) and single 
`compactAndClean()` at end, reducing remote storage operations from ~250 to ~6.
   
   ## Test plan
   - [x] `TestStreamWriteOperatorCoordinator` — all 36 tests pass
   - [x] Verified `setExecutor()` in test helper calls `executor.close()` which 
waits for task completion (`waitForTasksFinish=true`), so the upgrade task 
completes before the mock executor replaces it
   - [x] Confirmed the fix is needed on apache/master — identical vulnerable 
code present
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to