viongpanzi opened a new issue #7893: Will some PathChildrenCacheEvent be missed after the connection to zk disconnected URL: https://github.com/apache/incubator-druid/issues/7893 hi, all~ We have a problem! The information about our prod cluster: version: 0.13.0 number of segments: more than 6 million GC: g1 gc (time cost in one fgc is more than 120 secs.) incremental poll is enabled After each fgc (take more than 120 seconds), the connection of one coordinator to the zookeeper is disconnected due to timeout. Soon the another coordinator becomes the leader, and a new fgc happens after polling all data segments from metadata. Again the connection to the zookeeper discoonectted and these two coordinators trap in a loop. However, if we restart these two coordinator service, they can work well for days. In order to find the cause, we use MAT(Eclipse Memory Analyzer Tool) to analyze the dumped heap of one of those two coordinators, and it reports the following infos:  After tracing the call stack to zNodes and checking the logs of the coordinator service, some logs about zookeeper node event may be have some problem. ``` 09/Jun/2019 20:49:42,970 [ServerInventoryView-0] WARN org.apache.druid.curator.inventory.CuratorInventoryManager - CuratorInventoryManager: Exception while getting data for node /druid/seg ments/host:8101/host:8101_indexer-executor__default_tier_2019-06-09T18:45:43.225Z_e8206828c4ba4a5799956bde201eceb60 org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /druid/segments/host:8101/host:8101_indexer-executor__default_tier_2019- 06-09T18:45:43.225Z_e8206828c4ba4a5799956bde201eceb60 at org.apache.zookeeper.KeeperException.create(KeeperException.java:114) ~[zookeeper-3.4.11.jar:3.4.11-37e277162d567b55a07d1755f0b31c32e93c01a0] at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) ~[zookeeper-3.4.11.jar:3.4.11-37e277162d567b55a07d1755f0b31c32e93c01a0] at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1215) ~[zookeeper-3.4.11.jar:3.4.11-37e277162d567b55a07d1755f0b31c32e93c01a0] at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327) ~[curator-framework-4.1.0.jar:4.1.0] at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316) ~[curator-framework-4.1.0.jar:4.1.0] at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64) ~[curator-client-4.1.0.jar:?] at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) ~[curator-client-4.1.0.jar:?] at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313) ~[curator-framework-4.1.0.jar:4.1.0] at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304) ~[curator-framework-4.1.0.jar:4.1.0] at org.apache.curator.framework.imps.GetDataBuilderImpl$1.forPath(GetDataBuilderImpl.java:107) ~[curator-framework-4.1.0.jar:4.1.0] at org.apache.curator.framework.imps.GetDataBuilderImpl$1.forPath(GetDataBuilderImpl.java:67) ~[curator-framework-4.1.0.jar:4.1.0] at org.apache.druid.curator.inventory.CuratorInventoryManager.getZkDataForNode(CuratorInventoryManager.java:177) [druid-server-0.13.0-ad.jar:0.13.0-ad] at org.apache.druid.curator.inventory.CuratorInventoryManager.access$400(CuratorInventoryManager.java:58) [druid-server-0.13.0-ad.jar:0.13.0-ad] at org.apache.druid.curator.inventory.CuratorInventoryManager$ContainerCacheListener$InventoryCacheListener.childEvent(CuratorInventoryManager.java:402) [druid-server-0.13.0-ad.jar: 0.13.0-ad] at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:538) [curator-recipes-4.1.0.jar:4.1.0] at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:532) [curator-recipes-4.1.0.jar:4.1.0] at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-4.1.0.jar:4.1.0] at org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:435) [curator-client-4.1.0.jar:?] at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) [curator-framework-4.1.0.jar:4.1.0] at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:530) [curator-recipes-4.1.0.jar:4.1.0] at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-4.1.0.jar:4.1.0] at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:808) [curator-recipes-4.1.0.jar:4.1.0] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_131] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_131] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131] 09/Jun/2019 20:49:42,970 [ServerInventoryView-0] INFO org.apache.druid.curator.inventory.CuratorInventoryManager - CuratorInventoryManager: Ignoring event: Type - CHILD_UPDATED , Path - /d ruid/segments/host:8101/host:8101_indexer-executor__default_tier_2019-06-09T18:45:43.225Z_e8206828c4ba4a5799956bde201eceb60 , Version - 4 ``` Will some PathChildrenCacheEvent be missed after the connection to zk disconnected? If not, how to explain the exception above that coordinator attempt to update a node that does not exist?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
