Hi, I am using jar of storm-kafka-client by building the package since i want support of kafka 0.10.1.1. I am testing how my system will behave if one of the worker gets killed while my topology is running.
Configuration used: topology where we have storm-kafka-client as spout and another bolt to process those messages. Topology runs on two workers(say W1 and W2). Spout is running on W1 and bolt is running on W2. Tuple timeout is 30secs. For testing purpose, Storm-kafka-client's max_uncommitted_offset has been set to 40 and max_spout_pending in storm has been set to 20. As we start topology, spout starts reading messages from kafka and forward it to bolt. As my bolt start processing the tuple, i killed W2 purposely to simulate crash. As W2 restarts, spout keeps reading messages from kafka and emit to storm. So when W2 comes up, there are certain messages that got lost(E.g. from tuple 2-5 got lost). Now as next tuples (6 onwards) reach to bolt, bolt acknowledge them. Before we get a failure of lost tuples, spout reaches its max_uncommitted_offset and keep printing log that it reaches maximum uncommitted limit and won’t read any new message from kafka. Now even if it gets a failure of tuple from storm, it never replay them again. Also spout won’t read new message from kafka. This is a big issue for us because my topology is stuck till i restart it. Another issue is that although i kept max_spout_pending to half of the value of max_uncommitted_offset, storm-kafka-client got records of number equal to max_uncommitted_offset instead of max_spout_pending. Can anyone please help me with this. PS: While debugging we found that if number of uncommitted offset becomes equal or greater to max uncommitted offset then spout stops polling kafka and in poll method only it ask retry service whether there are any messages to replay by using method “doSeekRetriableTopicPartitions”. Regards, Punit Tiwan
