GitHub user BilgeKaanGencdogan edited a discussion: Task dispatch fails with 
connection refused error to worker host ip:1234

Before I began to describe the situation, I'd like to give the technical 
details about the system;
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
* Dolphinscheduler version; 3.2.0 and it is standalone, not cluster, not 
running on docker or k8s, AND THIS PROD
``` 
* 
[user@user ~]$ free -h
              total        used        free      shared  buff/cache   available
Mem:          192Gi        64Gi       105Gi       2.4Gi        23Gi       124Gi
Swap:         8.0Gi          0B       8.0Gi

```
```
* 
[user@user ~]$ df -h
Filesystem                 Size  Used Avail Use% Mounted on
devtmpfs                    97G     0   97G   0% /dev
tmpfs                       97G  1.1M   97G   1% /dev/shm
tmpfs                       97G  2.3G   94G   3% /run
tmpfs                       97G     0   97G   0% /sys/fs/cgroup
/dev/mapper/rhel-root      202G   16G  187G   8% /
/dev/mapper/rhel-usr        10G  5.3G  4.8G  53% /usr
/dev/mapper/vgdata-lvdata  400G   20G  381G   5% /data
/dev/sda2                  2.0G  439M  1.6G  22% /boot
/dev/sda1                  2.0G  5.9M  2.0G   1% /boot/efi
tmpfs                       20G     0   20G   0% /run/user/1007
tmpfs                       20G  8.0K   20G   1% /run/user/1006
```
```
*
[user@user ~]$ java --version
openjdk 11.0.24 2024-07-16 LTS
OpenJDK Runtime Environment (Red_Hat-11.0.24.0.8-2) (build 11.0.24+8-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-11.0.24.0.8-2) (build 11.0.24+8-LTS, mixed 
mode, sharing)
```
* Both master and worker server's jvm_args_env.sh;
```
*
[root@user bin]#  cat /data/dolphin/master-server/bin/jvm_args_env.sh
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

-Xms32g
-Xmx32g
-Xmn16g

-XX:+IgnoreUnrecognizedVMOptions
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-Xloggc:gc.log

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=dump.hprof

-Duser.timezone=${SPRING_JACKSON_TIME_ZONE}
[root@user bin]# cat /data/dolphin/worker-server/bin/jvm_args_env.sh
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

-Xms32g
-Xmx32g
-Xmn16g

-XX:+IgnoreUnrecognizedVMOptions
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-Xloggc:gc.log

-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=dump.hprof

-Duser.timezone=${SPRING_JACKSON_TIME_ZONE}
```
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
* Now, I am gonna provide 2 logs from both master-server and worker-server;
`First one is master-server's log:`
```
[WARN] 2025-10-02 09:01:26.576 +0300 
org.apache.dolphinscheduler.remote.NettyRemotingClient:[321] - 
[WorkflowInstance-0][TaskInstance-0] - connect to Host(ip=10.200.109.63, 
port=1234) error
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) 
failed: Connection refused: /10.200.109.63:1234
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection 
refused
        at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
        at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
        at 
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673)
        at 
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
        at 
io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530)
        at 
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465)
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.base/java.lang.Thread.run(Thread.java:829)
[ERROR] 2025-10-02 09:01:26.576 +0300 
org.apache.dolphinscheduler.server.master.runner.GlobalTaskDispatchWaitingQueueLooper:[87]
 - [WorkflowInstance-0][TaskInstance-0] - Dispatch task failed
org.apache.dolphinscheduler.server.master.exception.TaskDispatchException: 
Dispatch task to 10.200.109.63:1234 failed
        at 
org.apache.dolphinscheduler.server.master.runner.BaseTaskDispatcher.doDispatch(BaseTaskDispatcher.java:101)
        at 
org.apache.dolphinscheduler.server.master.runner.BaseTaskDispatcher.dispatchTask(BaseTaskDispatcher.java:74)
        at 
org.apache.dolphinscheduler.server.master.runner.GlobalTaskDispatchWaitingQueueLooper.run(GlobalTaskDispatchWaitingQueueLooper.java:79)
Caused by: org.apache.dolphinscheduler.remote.exceptions.RemotingException: 
connect to : Host(ip=10.200.109.63, port=1234) fail
        at 
org.apache.dolphinscheduler.remote.NettyRemotingClient.sendSync(NettyRemotingClient.java:210)
        at 
org.apache.dolphinscheduler.server.master.rpc.MasterRpcClient.sendSyncCommand(MasterRpcClient.java:49)
        at 
org.apache.dolphinscheduler.server.master.runner.BaseTaskDispatcher.doDispatch(BaseTaskDispatcher.java:87)
        ... 2 common frames omitted
```
`Second one is worker-server's log:`
```
[INFO] 2025-10-02 09:01:25.764 +0300 
org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable:[289]
 - [WorkflowInstance-59856][TaskInstance-507499] - The current execute mode 
isn't develop mode, will clear the task execute file: 
/data/dolphin/exec/process/default/15236034355840/15257910743307_11/59856/507499
[INFO] 2025-10-02 09:01:25.765 +0300 
org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable:[304]
 - [WorkflowInstance-59856][TaskInstance-507499] - Success clear the task 
execute file: 
/data/dolphin/exec/process/default/15236034355840/15257910743307_11/59856/507499
[INFO] 2025-10-02 09:01:25.765 +0300 
org.apache.dolphinscheduler.server.worker.runner.WorkerTaskExecuteRunnable:[330]
 - [WorkflowInstance-59856][TaskInstance-507499] - FINALIZE_SESSION
[INFO] 2025-10-02 09:01:52.136 +0300 org.apache.zookeeper.ClientCnxn:[1171] - 
[WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server 
localhost/0:0:0:0:0:0:0:1:2181.
[INFO] 2025-10-02 09:01:52.136 +0300 org.apache.zookeeper.ClientCnxn:[1173] - 
[WorkflowInstance-0][TaskInstance-0] - SASL config status: Will not attempt to 
authenticate using SASL (unknown error)
[INFO] 2025-10-02 09:01:59.281 +0300 org.apache.zookeeper.ClientCnxn:[1005] - 
[WorkflowInstance-0][TaskInstance-0] - Socket connection established, 
initiating session, client: /0:0:0:0:0:0:0:1:40934, server: 
localhost/0:0:0:0:0:0:0:1:2181
[INFO] 2025-10-02 09:01:59.282 +0300 org.apache.zookeeper.ClientCnxn:[1444] - 
[WorkflowInstance-0][TaskInstance-0] - Session establishment complete on server 
localhost/0:0:0:0:0:0:0:1:2181, session id = 0x10000001be900ac, negotiated 
timeout = 30000
[INFO] 2025-10-02 09:01:59.282 +0300 
org.apache.curator.framework.state.ConnectionStateManager:[252] - 
[WorkflowInstance-0][TaskInstance-0] - State change: RECONNECTED
[INFO] 2025-10-02 09:01:59.580 +0300 
org.apache.dolphinscheduler.server.worker.processor.WorkerTaskUpdatePidAckProcessor:[59]
 - [WorkflowInstance-0][TaskInstance-507499] - task execute update pid ack 
command : TaskUpdateRuntimeAckMessage(success=true, taskInstanceId=507499)
[INFO] 2025-10-02 09:01:59.580 +0300 
org.apache.dolphinscheduler.server.worker.processor.WorkerTaskExecuteResultAckProcessor:[58]
 - [WorkflowInstance-0][TaskInstance-507499] - Receive task execute response 
ack command : 
TaskExecuteResultMessageAck(super=BaseMessage(messageSenderAddress=10.200.109.63:5678,
 messageReceiverAddress=10.200.109.63:1234, messageSendTime=1759384886490), 
taskInstanceId=507499, success=true)
```

### **WHAT HAPPENED ?**

The DolphinScheduler worker service on the machine experienced a critical 
failure on October 2, 2025 at 09:40 AM, causing port 1234 to stop listening and 
resulting in "Connection refused" errors from the master server. This 
connection problem made the CPU hit the even 100%, dolphinscheduler jobs did 
not finish properly and hung in the air. Eventually, so to speak there is 
traffic jam. However, All the services were up all the time.

### **REASONABLE FINDINGS FROM US** 

The root cause was catastrophic thread leak, not memory shortage. The worker 
accumulated 21,466+ threads (growing at ~100 threads/minute) over 77.7 days of 
operation, consuming approximately 21 GB of RAM for thread stacks alone. This 
caused garbage collection pauses to degrade from 80ms to over 1,100ms, making 
the system unresponsive. Eventually, the Netty event executor terminated, port 
1234 stopped listening, and the worker became non-functional. The system was 
manually restarted at 11:38 AM and has been running since, but the thread leak 
is still active and growing, making another failure inevitable within days or 
weeks.

### **REASONABLE SOLUTIONS FROM US**

`* Reduce Young Generation Size (Prevent long GC pauses)`
```
# Edit worker JVM configuration
vim /data/dolphin/worker-server/bin/jvm_args_env.sh

# Change from:
-Xmn16g

# To:
-Xmn8g

# Restart worker
```
`* Reduce Concurrent Task Limit`
```
# Edit worker configuration
vim /data/dolphin/worker-server/conf/application.yaml

# Find and change:
worker:
  exec-threads: 100  # Change to 50

# Add new lines:
  max-cpu-load-avg: 0.7
  reserved-memory: 0.3

# Restart worker
```
```
* Investigate DolphinScheduler Thread Pool Bug
   This is the actual root cause that must be fixed.
```

```
- Bug in DolphinScheduler's thread pool implementation
- Task completion handlers not cleaning up threads
- Executor service not properly bounded
- Thread factory creating threads without limit
```

###  **WHAT TO EXPECT** 

I just want you to read carefully all the information that I provided and also 
please assess the solutions that are decided by us, because this system is very 
critical to us, before we implement those soluotions we want to process very 
cautiously. Can these problems solve the problem here? And also When I searched 
through web, JVM Heap memory management can play critical role in here, because 
of that I want you to guide me also about the JVM Heap management for the 
performance issue. 

Thanks in advance 

GitHub link: https://github.com/apache/dolphinscheduler/discussions/17571

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]

Reply via email to