[
https://issues.apache.org/jira/browse/ZOOKEEPER-737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102882#comment-13102882
]
Camille Fournier commented on ZOOKEEPER-737:
--------------------------------------------
Yeah, I don't know anything about the difference between nc or telnet, or what
zkdashboard is using, but this is with telnet interactive (reproducing a
problem we see in zkdashboard). It's reproducible but tricky. If I run
stat/dump from a remote server into a leader with a lot of
connections/ephemerals, it reliably fails. It prints out some of the data and
closes the connection suddenly. For example, the end of a dump command:
14 expire at Mon Sep 12 14:14:04 EDT 2011:
0x1325208b8c90089
0x2325208b8c30099
0x4325208b8ca0113
0x1325208b8c900cd
0x2325208b8c3009d
0x5325211469900ae
0x2325208b8c30091
0x4325208b8ca00b7
0x1325208b8c90094
0x1325208b8c90090
0x5325211469900d7
0x5325211469900a4
0x2325208b8c3009e
0x2325208b8c3009c
0 expire at Mon Sep 12 14:14:10 EDT 2011:Connection closed by foreign host.
I tried a bit of debugging against the running server. Breakpoints anywhere
inside the dump thread before the exit of closeSock() will cause the problem
not to occur. But a breakpoint at the exit of closeSock() will still show the
problem.
This is "a lot" of ephemerals/sessions in that we're talking about ~120
sessions and 88 ephemerals. Hardly thousands.
> some 4 letter words may fail with netcat (nc)
> ---------------------------------------------
>
> Key: ZOOKEEPER-737
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-737
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.3.0
> Reporter: Patrick Hunt
> Assignee: Mahadev konar
> Priority: Blocker
> Fix For: 3.3.1, 3.4.0
>
> Attachments: ZOOKEEPER-737.patch, ZOOKEEPER-737.patch,
> ZOOKEEPER-737.patch, ZOOKEEPER-737.patch, ZOOKEEPER-737.patch,
> ZOOKEEPER-737.patch, ZOOKEEPER-737.patch
>
>
> nc closes the write channel as soon as it's sent it's information, for
> example "echo stat|nc localhost 2181"
> in general this is fine, however the server code will close the socket as
> soon as it receives notice that nc has
> closed it's write channel. if not all the 4 letter word result has been
> written back to the client yet, this will cause
> some or all of the result to be lost - ie the client will not see the full
> result. this was introduced in 3.3.0 as part
> of a change to reduce blocking of the selector by long running 4letter words.
> here's an example of the logs from the server during this
> echo -n stat | nc localhost 2181
> 2010-04-09 21:55:36,124 - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] -
> Accepted socket connection from /127.0.0.1:42179
> 2010-04-09 21:55:36,124 - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@968] - Processing
> stat command from /127.0.0.1:42179
> 2010-04-09 21:55:36,125 - WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@606] -
> EndOfStreamException: Unable to read additional data from client sessionid
> 0x0, likely client has closed socket
> 2010-04-09 21:55:36,125 - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1286] - Closed
> socket connection for client /127.0.0.1:42179 (no session established for
> client)
> [phunt@gsbl90850 zookeeper-3.3.0]$ 2010-04-09 21:55:36,126 - ERROR
> [Thread-15:NIOServerCnxn@422] - Unexpected Exception:
> java.nio.channels.CancelledKeyException
> at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
> at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59)
> at
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:395)
> at
> org.apache.zookeeper.server.NIOServerCnxn$SendBufferWriter.checkFlush(NIOServerCnxn.java:907)
> at
> org.apache.zookeeper.server.NIOServerCnxn$SendBufferWriter.flush(NIOServerCnxn.java:945)
> at java.io.BufferedWriter.flush(BufferedWriter.java:236)
> at java.io.PrintWriter.flush(PrintWriter.java:276)
> at
> org.apache.zookeeper.server.NIOServerCnxn$2.run(NIOServerCnxn.java:1089)
> 2010-04-09 21:55:36,126 - ERROR [Thread-15:NIOServerCnxn$Factory$1@82] -
> Thread Thread[Thread-15,5,main] died
> java.nio.channels.CancelledKeyException
> at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55)
> at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:64)
> at
> org.apache.zookeeper.server.NIOServerCnxn$SendBufferWriter.wakeup(NIOServerCnxn.java:927)
> at
> org.apache.zookeeper.server.NIOServerCnxn$SendBufferWriter.checkFlush(NIOServerCnxn.java:909)
> at
> org.apache.zookeeper.server.NIOServerCnxn$SendBufferWriter.flush(NIOServerCnxn.java:945)
> at java.io.BufferedWriter.flush(BufferedWriter.java:236)
> at java.io.PrintWriter.flush(PrintWriter.java:276)
> at
> org.apache.zookeeper.server.NIOServerCnxn$2.run(NIOServerCnxn.java:1089)
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira