Dmitry created KAFKA-10013:
------------------------------

             Summary: Consumer hang-up in case of unclean leader election
                 Key: KAFKA-10013
                 URL: https://issues.apache.org/jira/browse/KAFKA-10013
             Project: Kafka
          Issue Type: Bug
          Components: consumer
    Affects Versions: 2.4.1, 2.5.0, 2.3.1, 2.4.0, 2.3.0
            Reporter: Dmitry


Starting from kafka 2.3 new offset reset negotiation algorithm added 
(org.apache.kafka.clients.consumer.internals.Fetcher#validateOffsetsAsync)

During this validation, Fetcher 
`org.apache.kafka.clients.consumer.internals.SubscriptionState` is held in 
`AWAIT_VALIDATION` fetch state.

This effectively means that fetch requests are not issued and consumption 
stopped.
In case if unclean leader election is happening during this time, 
`LogTruncationException` is thrown from future listener in method 
`validateOffsetsAsync` (probably in order to turn on the logic defined by 
`auto.offset.reset` parameter).

The main problem is that this exception (thrown from listener of future) is 
effectively swallowed by 
`org.apache.kafka.clients.consumer.internals.AsyncClient#sendAsyncRequest`
by this part of code


} catch (RuntimeException e) {
  if (!future.isDone()) {
    future.raise(e);
  }
}

In the end the result is: The only way to get out of AWAIT_VALIDATION and 
continue consumption is to successfully finish validation, but it can not be 
finished.
However - consumer is alive, but is consuming nothing. The only way to resume 
consumption is to terminate consumer and start another one.

We discovered this situation by means of kstreams application, where valid 
value of `auto.offset.reset` provided by our code is replaced by `None` value 
for a purpose of position reset 
(org.apache.kafka.streams.processor.internals.StreamThread#create).
And with kstreams it is even worse, as application may be working, logging warn 
messages of format `Truncation detected for partition ...,` but data is not 
generated for a long time and in the end is lost, making kstreams application 
unreliable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to