This might help to explain the root cause. I found that two consumer parameters may be correlated with the broker issues. Our setup is inter-data-center and so we followed some of the advice on the mirroring wiki page, https://cwiki.apache.org/confluence/display/KAFKA/Kafka+mirroring. In particular, we increased "socket.buffersize" to 655360 and "fetch.size" to 3072000 on our consumers. Since disabling those two parameters, allowing the defaults to take effect, our brokers haven't died once in well over a day. Prior, they were dying every hour.
On Jun 11, 2012, at 10:28 AM, Aaron Rankin wrote: > Jun, > > I was using the github mirror, which appears to be active. The last commit > there is the same as with the Apache Git mirror > (2a59ad76c657e4aad8ee6ca67078f49d2f6017c9). > > > Aaron > > > On Jun 11, 2012, at 12:05 AM, Jun Rao wrote: > >> Aaron, >> >> Which Git did you try, github or the Apache git mirror? Kafka has moved to >> Apache. So please try the 0.7 release in Apache. >> >> The error you saw are from ZK. Do you see lots of ZK session expiration in >> your log? >> >> Thanks, >> >> Jun >> >> On Sat, Jun 9, 2012 at 8:34 AM, Aaron Rankin <aa...@sproutsocial.com> wrote: >> >>> Hi, >>> >>> We're testing Kafka and have found that the process dies often. There's >>> little to no indication of why. We're running the latest code from Git, >>> which we built using the instructions there. We're also running Zookeeper >>> 3.3.5. Our setup has three brokers, producers running on the same network >>> and consumers in another data center, a 30ms Internet ping away. >>> >>> Does anyone have some intuition about why this is happening? >>> >>> The only stack trace we're seeing is coming from Zookeeper: >>> >>> 1193285089 [CommitProcessor:2] ERROR >>> org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exception: >>> java.nio.channels.CancelledKeyException >>> at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:55) >>> at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:59) >>> at >>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418) >>> at >>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509) >>> at >>> org.apache.zookeeper.server.NIOServerCnxn.process(NIOServerCnxn.java:1545) >>> at >>> org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:115) >>> at >>> org.apache.zookeeper.server.WatchManager.triggerWatch(WatchManager.java:87) >>> at >>> org.apache.zookeeper.server.DataTree.deleteNode(DataTree.java:577) >>> at >>> org.apache.zookeeper.server.DataTree.killSession(DataTree.java:829) >>> at >>> org.apache.zookeeper.server.DataTree.processTxn(DataTree.java:804) >>> at >>> org.apache.zookeeper.server.ZKDatabase.processTxn(ZKDatabase.java:328) >>> at >>> org.apache.zookeeper.server.ZooKeeperServer.processTxn(ZooKeeperServer.java:715) >>> at >>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:107) >>> at >>> org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) >>> >>> >>> Also, we constantly are seeing these in the logs: >>> >>> 1193365748 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO >>> org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for >>> client /127.0.0.1:53426 (no session established for client) >>> 1193425755 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] INFO >>> org.apache.zookeeper.server.NIOServerCnxn - Accepted socket connection >>> from /127.0.0.1:53428 >>> 1193425755 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] WARN >>> org.apache.zookeeper.server.NIOServerCnxn - EndOfStreamException: Unable >>> to read additional data from client sessionid 0x0, likely client has closed >>> socket >>> >>> >>> >>> >