icodening opened a new issue, #10181: URL: https://github.com/apache/dubbo/issues/10181
### Phenomenon 当``Provider``正常上下线时,``Consumer/Provider 1``也可能``长时间下线``导致``Consumer 2``发起调用时出现无服务提供者异常. 调用关系如下图  ### Environment - Dubbo version: 2.7.16-SNAPSHOT - Operating System version: macOS 10.15.7 - Java version: 1.8 ### Steps to reproduce this issue 1. 启动zookeeper 2. 启动一个``provider``,这个provider暴露100个接口 3. 启动一个``consumer``,这个consumer配置``zk.session.expire=5000``(方便测试),然后consumer暴露1个接口,并引用provider的100个接口 4. 再次启动一个``provider``,并在``consumer``回调ZK监听器时模拟网络短时间抖动直到session timeout(PS.这里我是把Zookeeper与consumer部署在不同的机器上通过断网的方式模拟) Pls. provide [GitHub address] to reproduce this issue. ### Expected Behavior 当``consumer``暴露的接口被Zookeeper摘除后,能在网络恢复后不久重新注册到Zookeeper,能正常提供服务 ### Actual Behavior 当``consumer``暴露的接口被Zookeeper摘除后,并且网络已恢复,但``consumer``发布的服务仍长时间没有重新注册 ### Analysis 1.当一个``provider``节点上线时,所有消费该provider的consumer会回调监听器(``CuratorZookeeperClient:325``) 2.如果在这一处代码访问ZK时报错,那么``SendThread``会发布一个``Disconnected``事件,并在``EventThread``中按照写死的重试策略``RetryPolicy(最大重试1次且时长为1000ms)``进行重试(``CuratorZookeeperClient:69``) 3.假设provider发布100个接口,那么consumer就会回调100次(``CuratorZookeeperClient.java:325``)的逻辑,即``EventThread``会重试100次,即最长阻塞100秒 4.当网络恢复后,consumer中ZKClient的重新连接ZK server成功,会回调onConnected``SendThread:1275``,这里会检查重连的session是否已经超时(``negotiatedSessionTimeout <= 0``),如果超时则把当前ZK client状态置为``CLOSE``,该行为会销毁``SendThread``这条心跳线程``SendThread:1044``,并往``EventThread``的事件队列发布``Expired``事件,该事件会被Curator消费并重建Zookeeper客户端重新连接ZK server,源码位置``org.apache.curator.ConnectionState:315``) 5.由于``SendThread``停止,心跳也跟着停止,并且也关闭了与之关联的NIO``selector``(``SendThread:1187``),使得后续的``getChildren``行为均是失败的,然后会进行重试 6.重试时阻塞的是``EventThread``事件线程,而在第4点中由于SessionExpired而发布了``Expired``事件到队列上,但是Curator内置的监听器却无法及时消费该事件并对Zookeeper客户端进行重建(``ConnectionState:315``),只能排队直到之前的事件都完成后(即那100个接口的回调都完成)才能消费。 ### Flow Chart <img width="725" alt="image" src="https://user-images.githubusercontent.com/42876375/174644437-59d8bfe1-2f8a-4fa0-b8bf-07a7173d75ff.png"> ### Thread stack ``` "main-EventThread" #17 daemon prio=5 os_prio=31 tid=0x00007fb4792fa000 nid=0xa203 waiting on condition [0x0000700004c0b000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at java.lang.Thread.sleep(Thread.java:340) at java.util.concurrent.TimeUnit.sleep(TimeUnit.java:386) at org.apache.curator.RetryLoop$1.sleepFor(RetryLoop.java:75) at org.apache.curator.retry.SleepingRetry.allowRetry(SleepingRetry.java:46) at org.apache.curator.retry.RetryNTimes.allowRetry(RetryNTimes.java:24) at org.apache.curator.RetryLoop.takeException(RetryLoop.java:174) at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:70) at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) at org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:228) at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:219) at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:41) at org.apache.dubbo.remoting.zookeeper.curator.CuratorZookeeperClient$CuratorWatcherImpl.process(CuratorZookeeperClient.java:341) at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:83) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:533) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:508) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
