prashantwason opened a new pull request, #18904:
URL: https://github.com/apache/hudi/pull/18904

   ### Describe the issue this Pull Request addresses
   
   Closes #18903
   
   `HoodieHeartbeatClient` can permanently stop generating heartbeats for an 
in-flight instant, causing a later commit to abort with `HoodieException: 
Heartbeat for instant <t> has expired` even though the writer is still alive. 
Two independent causes, both in `updateHeartbeat()`:
   
   1. The heartbeat file is written synchronously on the `Timer` thread. Since 
the timer uses `scheduleAtFixedRate`, a slow or hung storage write blocks the 
thread and freezes all subsequent heartbeats for the instant.
   2. When a refresh is delayed past the tolerable interval, 
`updateHeartbeat()` calls `Thread.currentThread().interrupt()`, which 
permanently kills the timer thread — turning a transient delay (GC pause, 
driver stall, single slow write) into a permanent blackout.
   
   ### Summary and Changelog
   
   - Perform the heartbeat file write on a bounded daemon executor and wait 
with a timeout (`Future.get(heartbeatWriteTimeoutMs)`), so a slow or hung 
storage call can no longer block the timer thread. The write timeout is one 
heartbeat interval; a timed-out write does not advance the last-heartbeat time 
and is retried on the next tick. A cached thread pool is used so that if one 
write hangs, subsequent ticks proceed on a fresh thread.
   - Remove the self-interrupt in `updateHeartbeat()`. Instead of 
`Thread.currentThread().interrupt()`, log a warning and continue refreshing. 
The commit-time check `HeartbeatUtils.abortIfHeartbeatExpired()` remains the 
sole enforcement point for staleness.
   - Shut the executor down in `close()`.
   - Add `TestHoodieHeartbeatClient.testTimerSurvivesHungHeartbeatWrite`, which 
blocks the first heartbeat write and asserts the timer keeps generating 
heartbeats (covering both fixes).
   
   ### Impact
   
   No public API or config change. Heartbeat refresh becomes resilient to 
transient storage latency and driver pauses: a transient stall no longer 
permanently disables heartbeats for an instant. Staleness is still enforced at 
commit time, so correctness of the concurrency guard is unchanged.
   
   ### Risk Level
   
   low
   
   Behavior change is confined to `HoodieHeartbeatClient`. Existing 
`TestHoodieHeartbeatClient` tests pass and a new regression test was added.
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to