[ https://issues.apache.org/jira/browse/ZOOKEEPER-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444568#comment-16444568 ]
Daniel C commented on ZOOKEEPER-3018: ------------------------------------- [~andorm] Uploaded ZK logs (2 servers cluster) for the first incident: [^zk-3018.zip] > Ephemeral node not deleted after session is gone > ------------------------------------------------ > > Key: ZOOKEEPER-3018 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3018 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.4.6 > Environment: Linux 4.1.12-112.14.10.el6uek.x86_64 #2 SMP x86_64 > GNU/Linux > Reporter: Daniel C > Priority: Major > Attachments: zk-3018.zip > > > We have a live Zookeeper environment (quorum size is 2) and observed a > strange behavior: > * Kafka created 2 ephemeral nodes /brokers/ids/822712429 and > /brokers/ids/707577499 on 2018-03-12 03:30:36.933 > * The Kafka clients were long gone but as of today (20+ days after), the two > ephemeral nodes are still present > > Troubleshooting: > 1) Lists the outstanding sessions and ephemeral nodes > > {noformat} > $ echo dump | nc $SERVER1 2181 > SessionTracker dump: > org.apache.zookeeper.server.quorum.LearnerSessionTracker@6d7fd863 > ephemeral nodes dump: > Sessions with Ephemerals (2): > 0x162183ea9f70003: > /brokers/ids/822712429 > 0x162183ea9f70002: > /brokers/ids/707577499 > /controller > {noformat} > > > 2) stat on /brokers/ids/822712429 > > {noformat} > zk> stat /brokers/ids/822712429 > czxid: 4294967344 > mzxid: 4294967344 > pzxid: 4294967344 > ctime: 1520825436933 (2018-03-11T20:30:36.933-0700) > mtime: 1520825436933 (2018-03-11T20:30:36.933-0700) > version: 0 > cversion: 0 > aversion: 0 > owner: 99668799174148099 > datalen: 102 > children: 0 > {noformat} > > > 3) List full connection/session details for all clients connected > > {noformat} > $ echo cons | nc $SERVER1 2181 > /10.247.114.70:30401[0](queued=0,recved=1,sent=0) > > /10.248.88.235:40430[1](queued=0,recved=345,sent=345,sid=0x162183ea9f70c22,lop=PING,est=1522713395028,to=40000,lcxid=0x12,lzxid=0xffffffffffffffff,lresp=1522717802117,llat=0,minlat=0,avglat=0,maxlat=31) > {noformat} > > > > {noformat} > $ echo cons | nc $SERVER2 2181 > /10.196.18.61:28173[0](queued=0,recved=1,sent=0) > > /10.247.114.69:42679[1](queued=0,recved=73800,sent=73800,sid=0x262183eaa21da96,lop=PING,est=1522651352906,to=9000,lcxid=0xe49f,lzxid=0x10004683d,lresp=1522717854847,llat=0,minlat=0,avglat=0,maxlat=1235) > {noformat} > > > 4) health > > {noformat} > $ echo mntr | nc $SERVER1 2181 > zk_version 3.4.6-1569965, built on 02/20/2014 09:09 GMT > zk_avg_latency 0 > zk_max_latency 443 > zk_min_latency 0 > zk_packets_received 11158019 > zk_packets_sent 11158244 > zk_num_alive_connections 2 > zk_outstanding_requests 0 > zk_server_state follower > zk_znode_count 344 > zk_watch_count 0 > zk_ephemerals_count 3 > zk_approximate_data_size 36654 > zk_open_file_descriptor_count 33 > zk_max_file_descriptor_count 65536 > {noformat} > > > 5) Server logs with related sessions: > {noformat} > Only found these logs from Server1 related to the sessions (0x162183ea9f70002 > and 0x162183ea9f70003): > 2018-03-12 03:28:35,127 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - > Accepted socket connection from /10.196.18.60:26775 > 2018-03-12 03:28:35,131 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@822] - Connection > request from old client /10.196.18.60:26775; will be dropped if server is in > r-o mode > 2018-03-12 03:28:35,131 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client > attempting to establish new session at /10.196.18.60:26775 > 2018-03-12 03:28:35,137 [myid:1] - INFO > [CommitProcessor:1:ZooKeeperServer@617] - Established session > 0x162183ea9f70002 with negotiated timeout 9000 for client /10.196.18.60:26775 > > 2018-03-12 03:30:36,415 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - > Accepted socket connection from /10.247.114.70:39260 > 2018-03-12 03:30:36,422 [myid:1] - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@822] - Connection > request from old client /10.247.114.70:39260; will be dropped if server is in > r-o mode > 2018-03-12 03:30:36,423 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@868] - Client > attempting to establish new session at /10.247.114.70:39260 > 2018-03-12 03:30:36,428 [myid:1] - INFO > [CommitProcessor:1:ZooKeeperServer@617] - Established session > 0x162183ea9f70003 with negotiated timeout 9000 for client /10.247.114.70:39260 > > 2018-03-31 01:29:58,865 [myid:1] - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1007] - Closed > socket connection for client /10.247.114.70:39260 which had sessionid > 0x162183ea9f70003{noformat} > 6) Txn logs on the two ephemeral nodes /brokers/ids/707577499 and > /brokers/ids/822712429: > {noformat} > 3/11/18 8:28:35 PM PDT session 0x162183ea9f70002 cxid 0x6 zxid 0x10000001b > create '/brokers/ids,,v{s{31,s{'world,'anyone}}},F,1 > 3/11/18 8:28:35 PM PDT session 0x162183ea9f70002 cxid 0x2c zxid 0x100000028 > create > '/brokers/ids/707577499,#7b226a6d785f706f7274223a31303130332c2274696d657374616d70223a2231353230383235333135363931222c22686f7374223a22736c6331336e79692e75732e6f7261636c652e636f6d222c2276657273696f6e223a312c22706f7274223a393039327d,v{s{31,s{'world,'anyone}}},T,1 > 3/11/18 8:30:36 PM PDT session 0x162183ea9f70003 cxid 0x14 zxid 0x100000030 > create > '/brokers/ids/822712429,#7b226a6d785f706f7274223a31303130332c2274696d657374616d70223a2231353230383235343336393139222c22686f7374223a22736c6331336e796a2e75732e6f7261636c652e636f6d222c2276657273696f6e223a312c22706f7274223a393039327d,v{s{31,s{'world,'anyone}}},T,2{noformat} > > 7) Additional questions from [~andorm] > {noformat} > 1) Why is the session closed, the client closed it or the cluster expired it? > [Daniel Chan] in this case, the client got killed and we expect the session > would be expired by the cluster > > 2) which server was the session attached to - the first (44sec max > lat) or one of the others? Which server was the leader? > [Daniel Chan] The sessions creating the ephemeral nodes were attached to > Server1 (443 max latency) while Server2 is the leader > > 3) the znode exists on all 4 servers, is that right? > [Daniel Chan] The cluster has 2 members not 4, and the ephemeral nodes are > present on both servers > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)