Github user jfenc91 commented on a diff in the pull request:

    https://github.com/apache/storm/pull/1679#discussion_r79662312
  
    --- Diff: 
external/storm-kafka-client/src/main/java/org/apache/storm/kafka/spout/KafkaSpout.java
 ---
    @@ -479,16 +487,17 @@ public OffsetAndMetadata findNextCommitOffset() {
                 KafkaSpoutMessageId nextCommitMsg = null;     // this is a 
convenience variable to make it faster to create OffsetAndMetadata
     
                 for (KafkaSpoutMessageId currAckedMsg : ackedMsgs) {  // 
complexity is that of a linear scan on a TreeMap
    -                if ((currOffset = currAckedMsg.offset()) == 
initialFetchOffset || currOffset == nextCommitOffset + 1) {            // found 
the next offset to commit
    +                if ((currOffset = currAckedMsg.offset()) == 
nextCommitOffset + 1) {            // found the next offset to commit
                         found = true;
                         nextCommitMsg = currAckedMsg;
                         nextCommitOffset = currOffset;
                     } else if (currAckedMsg.offset() > nextCommitOffset + 1) { 
   // offset found is not continuous to the offsets listed to go in the next 
commit, so stop search
                         LOG.debug("topic-partition [{}] has non-continuous 
offset [{}]. It will be processed in a subsequent batch.", tp, currOffset);
                         break;
                     } else {
    -                    LOG.debug("topic-partition [{}] has unexpected offset 
[{}].", tp, currOffset);
    -                    break;
    +                    //Received a redundant ack. Ignore and continue 
processing.
    --- End diff --
    
    I saw this a couple of times before I figured it out. I have not taken the 
time to reproduce this in a toy/test case, but given the error message this is 
clearly a storm or storm-kafka-client issue. I got to this state in about 30 
minutes of running a topology processing 800k-300k tuples a minute with about 
10s latency. The input to the topology was on the order of 2k-10k tuples per 
minute with a bolt that separated each input into multiple tuples. At startup 
there was a high amount of failures after the separation (I was making requests 
against an unwarmed ELB). I would guess that that is enough to reproduce with 
random data/failures. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to