balodesecurity opened a new pull request, #8315:
URL: https://github.com/apache/hadoop/pull/8315

   ## Problem
   
   In `BlockReceiver.flushOrSync()`, flush and sync durations are accumulated 
into a single `flushTotalNanos` counter. When the total duration exceeds the 
slow-IO threshold, the WARN log only reports the combined value:
   
   ```
   Slow flushOrSync took 120ms ..., flushTotalNanos=120000000ns
   ```
   
   This makes it impossible to tell whether the latency originates from the 
flush step or the fsync step, hindering production diagnosis.
   
   ## Fix
   
   Track flush and sync durations in separate counters (`flushTotalNanos`, 
`syncTotalNanos`). The slow-IO WARN log now reports them independently:
   
   ```
   Slow flushOrSync took 120ms ..., flushNanos=5000000ns, syncNanos=115000000ns
   ```
   
   This lets operators immediately determine whether a bottleneck is in the 
page-cache flush or the disk fsync.
   
   ## Testing
   
   - Added 
`TestBlockReceiverSlowLog#testFlushOrSyncSlowLogContainsSeparateFlushAndSyncNanos`:
 starts a single-DN MiniDFSCluster with slow-IO threshold set to 0 ms (triggers 
the log on every call), writes a file and calls `hsync()`, captures the WARN 
log output, and asserts both `flushNanos=` and `syncNanos=` are present.
   - Test passes locally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to