balodesecurity opened a new pull request, #8315: URL: https://github.com/apache/hadoop/pull/8315
## Problem In `BlockReceiver.flushOrSync()`, flush and sync durations are accumulated into a single `flushTotalNanos` counter. When the total duration exceeds the slow-IO threshold, the WARN log only reports the combined value: ``` Slow flushOrSync took 120ms ..., flushTotalNanos=120000000ns ``` This makes it impossible to tell whether the latency originates from the flush step or the fsync step, hindering production diagnosis. ## Fix Track flush and sync durations in separate counters (`flushTotalNanos`, `syncTotalNanos`). The slow-IO WARN log now reports them independently: ``` Slow flushOrSync took 120ms ..., flushNanos=5000000ns, syncNanos=115000000ns ``` This lets operators immediately determine whether a bottleneck is in the page-cache flush or the disk fsync. ## Testing - Added `TestBlockReceiverSlowLog#testFlushOrSyncSlowLogContainsSeparateFlushAndSyncNanos`: starts a single-DN MiniDFSCluster with slow-IO threshold set to 0 ms (triggers the log on every call), writes a file and calls `hsync()`, captures the WARN log output, and asserts both `flushNanos=` and `syncNanos=` are present. - Test passes locally. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
