prashantwason opened a new pull request, #18904: URL: https://github.com/apache/hudi/pull/18904
### Describe the issue this Pull Request addresses Closes #18903 `HoodieHeartbeatClient` can permanently stop generating heartbeats for an in-flight instant, causing a later commit to abort with `HoodieException: Heartbeat for instant <t> has expired` even though the writer is still alive. Two independent causes, both in `updateHeartbeat()`: 1. The heartbeat file is written synchronously on the `Timer` thread. Since the timer uses `scheduleAtFixedRate`, a slow or hung storage write blocks the thread and freezes all subsequent heartbeats for the instant. 2. When a refresh is delayed past the tolerable interval, `updateHeartbeat()` calls `Thread.currentThread().interrupt()`, which permanently kills the timer thread — turning a transient delay (GC pause, driver stall, single slow write) into a permanent blackout. ### Summary and Changelog - Perform the heartbeat file write on a bounded daemon executor and wait with a timeout (`Future.get(heartbeatWriteTimeoutMs)`), so a slow or hung storage call can no longer block the timer thread. The write timeout is one heartbeat interval; a timed-out write does not advance the last-heartbeat time and is retried on the next tick. A cached thread pool is used so that if one write hangs, subsequent ticks proceed on a fresh thread. - Remove the self-interrupt in `updateHeartbeat()`. Instead of `Thread.currentThread().interrupt()`, log a warning and continue refreshing. The commit-time check `HeartbeatUtils.abortIfHeartbeatExpired()` remains the sole enforcement point for staleness. - Shut the executor down in `close()`. - Add `TestHoodieHeartbeatClient.testTimerSurvivesHungHeartbeatWrite`, which blocks the first heartbeat write and asserts the timer keeps generating heartbeats (covering both fixes). ### Impact No public API or config change. Heartbeat refresh becomes resilient to transient storage latency and driver pauses: a transient stall no longer permanently disables heartbeats for an instant. Staleness is still enforced at commit time, so correctness of the concurrency guard is unchanged. ### Risk Level low Behavior change is confined to `HoodieHeartbeatClient`. Existing `TestHoodieHeartbeatClient` tests pass and a new regression test was added. ### Documentation Update none ### Contributor's checklist - [x] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [x] Enough context is provided in the sections above - [x] Adequate tests were added if applicable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
