[GitHub] [dolphinscheduler] minyk opened a new issue, #13913: [Bug] [MasterFailoverService] syncMasterNodes does not make current slot correctly after zookeeper disconnect

via GitHub Tue, 11 Apr 2023 18:19:46 -0700


minyk opened a new issue, #13913:
URL: https://github.com/apache/dolphinscheduler/issues/13913


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/dolphinscheduler/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### What happened
   
   We use Dolphinscheduler on K8S and deploy 3 masters. Version is 3.0.5
   
   After zookeeper disconnect, registry could reconnect successfully. but 
current slot is miscalculated. below is our logs(edited):
   
   master0:
   ```
   19:50:00.961 MasterSchedulerBootstrap:[233] - Master schedule bootstrap loop 
command success, command size: 6, current slot: 2, total slot size: 3
   19:50:21.305 ServerNodeManager:[302] - master node : 
/nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 
down.
   19:50:21.316 ServerNodeManager:[364] - update master nodes, master size: 3, 
slot: 0, addr: dolphinscheduler-master-0.dolphinscheduler-master-headless:5678
   19:50:21.323 ServerNodeManager:[298] - master node : 
/nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 
added.
   19:50:21.330 ServerNodeManager:[364] - update master nodes, master size: 3, 
slot: 0, addr: dolphinscheduler-master-0.dolphinscheduler-master-headless:5678
   19:50:22.019 MasterSchedulerBootstrap:[233] - Master schedule bootstrap loop 
command success, command size: 1, current slot: 0, total slot size: 3
   ```
   
   master1:
   ```
   19:50:21.304 ServerNodeManager:[302] - [WorkflowInstance-0][TaskInstance-0] 
- master node : 
/nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 
down.
   19:50:21.312 ServerNodeManager:[364] - [WorkflowInstance-0][TaskInstance-0] 
- update master nodes, master size: 3, slot: 1, addr: 
dolphinscheduler-master-1.dolphinscheduler-master-headless:5678
   19:50:21.319 ServerNodeManager:[298] - [WorkflowInstance-0][TaskInstance-0] 
- master node : 
/nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 
added.
   19:50:21.327 ServerNodeManager:[364] - [WorkflowInstance-0][TaskInstance-0] 
- update master nodes, master size: 3, slot: 1, addr: 
dolphinscheduler-master-1.dolphinscheduler-master-headless:5678
   ```
   
   master2:
   ```
   19:50:01.097 MasterSchedulerBootstrap:[233] - Master schedule bootstrap loop 
command success, command size: 7, current slot: 0, total slot size: 3
   19:50:21.305 ServerNodeManager:[302] - master node : 
/nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 
down.
   19:50:21.324 ServerNodeManager:[364] - update master nodes, master size: 0, 
slot: 0, addr: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
   19:50:21.428 ServerNodeManager:[298] - master node : 
/nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 
added.
   19:50:21.434 ServerNodeManager:[364] - update master nodes, master size: 0, 
slot: 0, addr: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
   ```
   
   master2 is briefly disconnected and reconnected. after disconnect master0 
changed slot from 2 to 0. then reconnect, master2 is still slot 0, not 2 
causing our 1/3 requests are not handled by any master.
   
   master2 detailed logs are here:
   ```
   [INFO] 2023-04-09 19:50:20.701 +0900 org.apache.zookeeper.ClientCnxn:[1158] 
- [WorkflowInstance-0][TaskInstance-0] - Unable to read additional data from 
server sessionid 0x3032c34cb87185b, likely server has closed socket, closing 
socket connection and attempting reconnect
   [WARN] 2023-04-09 19:50:20.719 +0900 
org.apache.dolphinscheduler.server.master.runner.WorkflowExecuteThreadPool:[109]
 - [WorkflowInstance-577314][TaskInstance-0] - The workflow has been executed 
by another thread
   [INFO] 2023-04-09 19:50:20.804 +0900 
org.apache.curator.framework.state.ConnectionStateManager:[251] - 
[WorkflowInstance-0][TaskInstance-0] - State change: SUSPENDED
   [WARN] 2023-04-09 19:50:20.808 +0900 
org.apache.dolphinscheduler.plugin.registry.zookeeper.ZookeeperConnectionStateListener:[52]
 - [WorkflowInstance-0][TaskInstance-0] - Registry suspended
   [WARN] 2023-04-09 19:50:20.811 +0900 
org.apache.dolphinscheduler.server.master.registry.MasterConnectionStateListener:[48]
 - [WorkflowInstance-0][TaskInstance-0] - registry connection state is 
SUSPENDED, ready to retry connection
   [INFO] 2023-04-09 19:50:21.292 +0900 org.apache.zookeeper.ClientCnxn:[1025] 
- [WorkflowInstance-0][TaskInstance-0] - Opening socket connection to server 
zookeeper-1.zookeeper-headless/10.233.125.25:2181. Will not attempt to 
authenticate using SASL (unknown error)
   [INFO] 2023-04-09 19:50:21.295 +0900 org.apache.zookeeper.ClientCnxn:[879] - 
[WorkflowInstance-0][TaskInstance-0] - Socket connection established to 
zookeeper-1.zookeeper-headless/10.233.125.25:2181, initiating session
   [INFO] 2023-04-09 19:50:21.299 +0900 org.apache.zookeeper.ClientCnxn:[1299] 
- [WorkflowInstance-0][TaskInstance-0] - Session establishment complete on 
server zookeeper-1.zookeeper-headless/10.233.125.25:2181, sessionid = 
0x3032c34cb87185b, negotiated timeout = 30000
   [INFO] 2023-04-09 19:50:21.300 +0900 
org.apache.curator.framework.state.ConnectionStateManager:[251] - 
[WorkflowInstance-0][TaskInstance-0] - State change: RECONNECTED
   [INFO] 2023-04-09 19:50:21.305 +0900 
org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[302] - 
[WorkflowInstance-0][TaskInstance-0] - master node : 
/nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 
down.
   [INFO] 2023-04-09 19:50:21.304 +0900 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryClient:[124] - 
[WorkflowInstance-0][TaskInstance-0] - MASTER node deleted : 
/nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
   [WARN] 2023-04-09 19:50:21.323 +0900 
org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[362] - 
[WorkflowInstance-0][TaskInstance-0] - current 
addr:dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 is not in 
active master list
   [INFO] 2023-04-09 19:50:21.324 +0900 
org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[364] - 
[WorkflowInstance-0][TaskInstance-0] - update master nodes, master size: 0, 
slot: 0, addr: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
   [INFO] 2023-04-09 19:50:21.428 +0900 
org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[298] - 
[WorkflowInstance-0][TaskInstance-0] - master node : 
/nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 
added.
   [WARN] 2023-04-09 19:50:21.434 +0900 
org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[362] - 
[WorkflowInstance-0][TaskInstance-0] - current 
addr:dolphinscheduler-master-2.dolphinscheduler-master-headless:5678 is not in 
active master list
   [INFO] 2023-04-09 19:50:21.434 +0900 
org.apache.dolphinscheduler.server.master.registry.ServerNodeManager:[364] - 
[WorkflowInstance-0][TaskInstance-0] - update master nodes, master size: 0, 
slot: 0, addr: dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
   [INFO] 2023-04-09 19:50:29.353 +0900 
org.apache.dolphinscheduler.server.master.service.FailoverService:[55] - 
[WorkflowInstance-0][TaskInstance-0] - Master failover finished, masterServer: 
dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
   [INFO] 2023-04-09 19:50:29.353 +0900 
org.apache.dolphinscheduler.server.master.registry.MasterRegistryDataListener:[63]
 - [WorkflowInstance-0][TaskInstance-0] - master node added : 
/nodes/master/dolphinscheduler-master-2.dolphinscheduler-master-headless:5678
   ``` 
   
   ### What you expected to happen
   
   The reconnected master2 should calculate current slot 2 not 0.
   
   ### How to reproduce
   
   I'm not sure how to reproduce this problem. But in our envs, this happens 
once every 3 - 4 days.
   
   ### Anything else
   
   _No response_
   
   ### Version
   
   3.0.x
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [dolphinscheduler] minyk opened a new issue, #13913: [Bug] [MasterFailoverService] syncMasterNodes does not make current slot correctly after zookeeper disconnect

Reply via email to