[ https://issues.apache.org/jira/browse/ZOOKEEPER-737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102882#comment-13102882 ]
Camille Fournier commented on ZOOKEEPER-737: -------------------------------------------- Yeah, I don't know anything about the difference between nc or telnet, or what zkdashboard is using, but this is with telnet interactive (reproducing a problem we see in zkdashboard). It's reproducible but tricky. If I run stat/dump from a remote server into a leader with a lot of connections/ephemerals, it reliably fails. It prints out some of the data and closes the connection suddenly. For example, the end of a dump command: 14 expire at Mon Sep 12 14:14:04 EDT 2011: 0x1325208b8c90089 0x2325208b8c30099 0x4325208b8ca0113 0x1325208b8c900cd 0x2325208b8c3009d 0x5325211469900ae 0x2325208b8c30091 0x4325208b8ca00b7 0x1325208b8c90094 0x1325208b8c90090 0x5325211469900d7 0x5325211469900a4 0x2325208b8c3009e 0x2325208b8c3009c 0 expire at Mon Sep 12 14:14:10 EDT 2011:Connection closed by foreign host. I tried a bit of debugging against the running server. Breakpoints anywhere inside the dump thread before the exit of closeSock() will cause the problem not to occur. But a breakpoint at the exit of closeSock() will still show the problem. This is "a lot" of ephemerals/sessions in that we're talking about ~120 sessions and 88 ephemerals. Hardly thousands. > some 4 letter words may fail with netcat (nc) > --------------------------------------------- > > Key: ZOOKEEPER-737 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-737 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.3.0 > Reporter: Patrick Hunt > Assignee: Mahadev konar > Priority: Blocker > Fix For: 3.3.1, 3.4.0 > > Attachments: ZOOKEEPER-737.patch, ZOOKEEPER-737.patch, > ZOOKEEPER-737.patch, ZOOKEEPER-737.patch, ZOOKEEPER-737.patch, > ZOOKEEPER-737.patch, ZOOKEEPER-737.patch > > > nc closes the write channel as soon as it's sent it's information, for > example "echo stat|nc localhost 2181" > in general this is fine, however the server code will close the socket as > soon as it receives notice that nc has > closed it's write channel. if not all the 4 letter word result has been > written back to the client yet, this will cause > some or all of the result to be lost - ie the client will not see the full > result. this was introduced in 3.3.0 as part > of a change to reduce blocking of the selector by long running 4letter words. > here's an example of the logs from the server during this > echo -n stat | nc localhost 2181 > 2010-04-09 21:55:36,124 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - > Accepted socket connection from /127.0.0.1:42179 > 2010-04-09 21:55:36,124 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@968] - Processing > stat command from /127.0.0.1:42179 > 2010-04-09 21:55:36,125 - WARN > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@606] - > EndOfStreamException: Unable to read additional data from client sessionid > 0x0, likely client has closed socket > 2010-04-09 21:55:36,125 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1286] - Closed > socket connection for client /127.0.0.1:42179 (no session established for > client) > [phunt@gsbl90850 zookeeper-3.3.0]$ 2010-04-09 21:55:36,126 - ERROR > [Thread-15:NIOServerCnxn@422] - Unexpected Exception: > java.nio.channels.CancelledKeyException > at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55) > at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59) > at > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:395) > at > org.apache.zookeeper.server.NIOServerCnxn$SendBufferWriter.checkFlush(NIOServerCnxn.java:907) > at > org.apache.zookeeper.server.NIOServerCnxn$SendBufferWriter.flush(NIOServerCnxn.java:945) > at java.io.BufferedWriter.flush(BufferedWriter.java:236) > at java.io.PrintWriter.flush(PrintWriter.java:276) > at > org.apache.zookeeper.server.NIOServerCnxn$2.run(NIOServerCnxn.java:1089) > 2010-04-09 21:55:36,126 - ERROR [Thread-15:NIOServerCnxn$Factory$1@82] - > Thread Thread[Thread-15,5,main] died > java.nio.channels.CancelledKeyException > at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55) > at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:64) > at > org.apache.zookeeper.server.NIOServerCnxn$SendBufferWriter.wakeup(NIOServerCnxn.java:927) > at > org.apache.zookeeper.server.NIOServerCnxn$SendBufferWriter.checkFlush(NIOServerCnxn.java:909) > at > org.apache.zookeeper.server.NIOServerCnxn$SendBufferWriter.flush(NIOServerCnxn.java:945) > at java.io.BufferedWriter.flush(BufferedWriter.java:236) > at java.io.PrintWriter.flush(PrintWriter.java:276) > at > org.apache.zookeeper.server.NIOServerCnxn$2.run(NIOServerCnxn.java:1089) -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira