nzw921rx commented on issue #10997: URL: https://github.com/apache/seatunnel/issues/10997#issuecomment-4610069096
> > Thanks [@nzw921rx](https://github.com/nzw921rx). I don't think [#10836](https://github.com/apache/seatunnel/pull/10836) applies to my case. > > [#10836](https://github.com/apache/seatunnel/pull/10836) only triggers during master failover, and the fix persists the readyToCloseStartingTask set that a new master loses on takeover. > > But no master switch happened in my case. As reported, the worker-to-master TCP connection was re-initialized 6 times without any cluster membership change — same master throughout, no leader election. The failover path that [#10836](https://github.com/apache/seatunnel/pull/10836) fixes is never hit here. 1. you are right, Please try adjusting the following parameters to see if they can alleviate the issue ```yaml job-metrics-backup-interval: 300 # 60s → 300s, reducing scheduled reporting by 80% hazelcast.operation.generic.thread.count: 100 # 50 → 100, expanding thread pool ``` It is also recommended to configure 'checkpoint. interval' (such as 300000ms) for BATCH job, so that the final checkpoint will have timeout protection, and even if the barrier is lost, it will not be permanently suspended. 2. I analyzed the root cause from the source code level. The core issue is that the `ReportMetricsOperation` implementation exhausts the Hazelcast generic-operation thread pool under large-scale workloads, which in turn triggers connection flapping. The specific mechanism: `TaskExecutionService.collectLocalMetricsMap()` serializes and sends both `finishedExecutionContexts` (completed tasks) and `executionContexts` (running tasks) in full to the master on every report, using a blocking `invoke.get()` call. Under your 1000+ job high-frequency submission/completion scenario, the 8 workers' scheduled reports (every 60s) plus the event-driven reports triggered on each task completion result in 40-60 large-payload `ReportMetricsOperation`s hitting the master's 50 generic threads per minute. These operations execute slowly due to distributed IMap writes, saturating the thread pool. Once full, critical operations like `CheckpointBarrierTriggerOperation` and `BarrierFlowOperation` queue up and time out, ultimately triggering connection rebuilds that discard all in-flight operations. your logs — 1419 slow operation warnings in one hour and 6 connection rebuilds — perfectly confirm this chain. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
