viongpanzi opened a new issue #7893: Will some PathChildrenCacheEvent be missed 
after the connection to zk disconnected
URL: https://github.com/apache/incubator-druid/issues/7893
 
 
   hi, all~
   
   We have a problem!
   
   The information about our prod cluster:
   
   version: 0.13.0
   number of segments: more than 6 million
   GC: g1 gc (time cost in one fgc is more than 120 secs.)
   incremental poll is enabled
   
   
   After each fgc (take more than 120 seconds), the connection of one 
coordinator to the zookeeper is disconnected due to timeout. Soon the another 
coordinator becomes the leader, and a new fgc happens after polling all data 
segments from metadata. Again the connection to the zookeeper discoonectted and 
these two coordinators trap in a loop. However, if we restart these two 
coordinator service, they can work well for days.
   
   In order to find the cause, we use MAT(Eclipse Memory Analyzer Tool) to 
analyze the dumped heap of one of those two coordinators, and it reports the 
following infos:
   
   
![image](https://user-images.githubusercontent.com/8834263/59492148-eba48600-8eba-11e9-91c6-30e0d0a465f7.png)
   
   
   After tracing the call stack to zNodes and checking the logs of the 
coordinator service, some logs about zookeeper node event may be have some 
problem.
   
   ```
   09/Jun/2019 20:49:42,970 [ServerInventoryView-0] WARN  
org.apache.druid.curator.inventory.CuratorInventoryManager - 
CuratorInventoryManager: Exception while getting data for node /druid/seg
   
ments/host:8101/host:8101_indexer-executor__default_tier_2019-06-09T18:45:43.225Z_e8206828c4ba4a5799956bde201eceb60
   org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = 
NoNode for 
/druid/segments/host:8101/host:8101_indexer-executor__default_tier_2019-
   06-09T18:45:43.225Z_e8206828c4ba4a5799956bde201eceb60
           at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:114) 
~[zookeeper-3.4.11.jar:3.4.11-37e277162d567b55a07d1755f0b31c32e93c01a0]
           at 
org.apache.zookeeper.KeeperException.create(KeeperException.java:54) 
~[zookeeper-3.4.11.jar:3.4.11-37e277162d567b55a07d1755f0b31c32e93c01a0]
           at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1215) 
~[zookeeper-3.4.11.jar:3.4.11-37e277162d567b55a07d1755f0b31c32e93c01a0]
           at 
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:327)
 ~[curator-framework-4.1.0.jar:4.1.0]
           at 
org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:316)
 ~[curator-framework-4.1.0.jar:4.1.0]
           at 
org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
 ~[curator-client-4.1.0.jar:?]
           at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100) 
~[curator-client-4.1.0.jar:?]
           at 
org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:313)
 ~[curator-framework-4.1.0.jar:4.1.0]
           at 
org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:304)
 ~[curator-framework-4.1.0.jar:4.1.0]
           at 
org.apache.curator.framework.imps.GetDataBuilderImpl$1.forPath(GetDataBuilderImpl.java:107)
 ~[curator-framework-4.1.0.jar:4.1.0]
           at 
org.apache.curator.framework.imps.GetDataBuilderImpl$1.forPath(GetDataBuilderImpl.java:67)
 ~[curator-framework-4.1.0.jar:4.1.0]
           at 
org.apache.druid.curator.inventory.CuratorInventoryManager.getZkDataForNode(CuratorInventoryManager.java:177)
 [druid-server-0.13.0-ad.jar:0.13.0-ad]
           at 
org.apache.druid.curator.inventory.CuratorInventoryManager.access$400(CuratorInventoryManager.java:58)
 [druid-server-0.13.0-ad.jar:0.13.0-ad]
           at 
org.apache.druid.curator.inventory.CuratorInventoryManager$ContainerCacheListener$InventoryCacheListener.childEvent(CuratorInventoryManager.java:402)
 [druid-server-0.13.0-ad.jar:
   0.13.0-ad]
           at 
org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:538)
 [curator-recipes-4.1.0.jar:4.1.0]
           at 
org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:532)
 [curator-recipes-4.1.0.jar:4.1.0]
           at 
org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
 [curator-framework-4.1.0.jar:4.1.0]
           at 
org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$DirectExecutor.execute(MoreExecutors.java:435)
 [curator-client-4.1.0.jar:?]
           at 
org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
 [curator-framework-4.1.0.jar:4.1.0]
           at 
org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:530)
 [curator-recipes-4.1.0.jar:4.1.0]
           at 
org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35)
 [curator-recipes-4.1.0.jar:4.1.0]
           at 
org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:808)
 [curator-recipes-4.1.0.jar:4.1.0]
           at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[?:1.8.0_131]
           at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
[?:1.8.0_131]
           at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[?:1.8.0_131]
           at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
[?:1.8.0_131]
           at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[?:1.8.0_131]
           at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[?:1.8.0_131]
           at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
   09/Jun/2019 20:49:42,970 [ServerInventoryView-0] INFO  
org.apache.druid.curator.inventory.CuratorInventoryManager - 
CuratorInventoryManager: Ignoring event: Type - CHILD_UPDATED , Path - /d
   
ruid/segments/host:8101/host:8101_indexer-executor__default_tier_2019-06-09T18:45:43.225Z_e8206828c4ba4a5799956bde201eceb60
 , Version - 4
   ```
   
   Will some PathChildrenCacheEvent be missed after the connection to zk 
disconnected? If not, how to explain the exception above that coordinator 
attempt to update a node that does not exist?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to