nzw921rx commented on issue #10997:
URL: https://github.com/apache/seatunnel/issues/10997#issuecomment-4610069096

   > 
   > Thanks [@nzw921rx](https://github.com/nzw921rx). I don't think 
[#10836](https://github.com/apache/seatunnel/pull/10836) applies to my case.
   > 
   > [#10836](https://github.com/apache/seatunnel/pull/10836) only triggers 
during master failover, and the fix persists the readyToCloseStartingTask set 
that a new master loses on takeover.
   > 
   > But no master switch happened in my case. As reported, the 
worker-to-master TCP connection was re-initialized 6 times without any cluster 
membership change — same master throughout, no leader election. The failover 
path that [#10836](https://github.com/apache/seatunnel/pull/10836) fixes is 
never hit here.
   
   1. you are right, Please try adjusting the following parameters to see if 
they can alleviate the issue
   
   ```yaml
   job-metrics-backup-interval: 300           # 60s → 300s, reducing scheduled 
reporting by 80%
   hazelcast.operation.generic.thread.count: 100 # 50 → 100, expanding thread 
pool
   ```
   It is also recommended to configure 'checkpoint. interval' (such as 
300000ms) for BATCH job, so that the final checkpoint will have timeout 
protection, and even if the barrier is lost, it will not be permanently 
suspended.
   
   2. I analyzed the root cause from the source code level. The core issue is 
that the `ReportMetricsOperation` implementation exhausts the Hazelcast 
generic-operation thread pool under large-scale workloads, which in turn 
triggers connection flapping.
   
      The specific mechanism: `TaskExecutionService.collectLocalMetricsMap()` 
serializes and sends both `finishedExecutionContexts` (completed tasks) and 
`executionContexts` (running tasks) in full to the master on every report, 
using a blocking `invoke.get()` call. Under your 1000+ job high-frequency 
submission/completion scenario, the 8 workers' scheduled reports (every 60s) 
plus the event-driven reports triggered on each task completion result in 40-60 
large-payload `ReportMetricsOperation`s hitting the master's 50 generic threads 
per minute. These operations execute slowly due to distributed IMap writes, 
saturating the thread pool. Once full, critical operations like 
`CheckpointBarrierTriggerOperation` and `BarrierFlowOperation` queue up and 
time out, ultimately triggering connection rebuilds that discard all in-flight 
operations.
   
       your logs — 1419 slow operation warnings in one hour and 6 connection 
rebuilds — perfectly confirm this chain.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to