Hi, I've crafted a Flume source and a Flume sink that can consumer/produce events from/to Kafka.
I have a 5 node kafka cluster and Flume is happily consuming from it. Recently I did some maintenance on one of the Kafka nodes which involved a shutdown/restart. The following Error appears in the Flume logs: 2011-10-12 15:01:58,117 INFO kafka.consumer.SimpleConsumer: multifetch reconnect due to java.io.EOFException: Received -1 when reading from channel, socket has likely been closed. 2011-10-12 15:01:58,122 ERROR kafka.consumer.FetcherRunnable: error in FetcherRunnable java.net.ConnectException: Connection refused at sun.nio.ch.Net.connect(Native Method) at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507) at kafka.consumer.SimpleConsumer.connect(SimpleConsumer.scala:51) at kafka.consumer.SimpleConsumer.liftedTree2$1(SimpleConsumer.scala:127) at kafka.consumer.SimpleConsumer.multifetch(SimpleConsumer.scala:119) at kafka.consumer.FetcherRunnable.run(FetcherRunnable.scala:63) 2011-10-12 15:01:58,136 INFO kafka.consumer.FetcherRunnable: stopping fetcher FetchRunnable-4 to host aaa.bbb.ccc.ddd so it seems the Kafka consumer attempted to reconnect to the Kafka node (aaa.bbb.ccc.ddd) but this failed (because I had shut down the node...). Instead of entering a retry loop, the fetcher exists and will never reconnect to the node when it comes back. This has the immediate effect for the Flume source to miss all messages sent to the recently restarted Kafka node. What is the correct way of handling such problems? Is there a flaw in the way reconnection is attempted in FetcherRunnable? Mathas.