[ 
https://issues.apache.org/jira/browse/KAFKA-12951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damien Gasparina updated KAFKA-12951:
-------------------------------------
    Description: 
We encountered an issue a few time in some of our Kafka Streams application.
 After an unexpected restart of our applications, some instances have not been 
able to resume operating.

They got stuck while trying to restore the state store of a GlobalKTable. The 
only way to resume operating was to manually delete their `state.dir`.

We observed the following timeline:
 * After the restart of the Kafka Streams application, it tries to restore its 
GlobalKTable
 * It seeks to the last checkpoint available on the {{state.dir}}: 382 
([https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L259])
 * The watermark ({{endOffset}} results) returned the offset 383 
{code:java}
handling ListOffsetResponse response for XX. Fetched offset 383, timestamp 
-1{code}

 * We enter the loop: 
[https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L279]
 * Then we invoked the {{poll()}}, but the poll returns nothing, so we enter: 
[https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L306]
 and we crash (x)
{code:java}
Global task did not make progress to restore state within 300000 ms.{code}

 * The POD restart, and we encounter the same issue until we manually delete 
the {{state.dir}}

 

Regarding the topic, by leveraging the {{DumpLogSegment}} tool, I can see:
 * {{Offset 381}} - Last business message received
 * {{Offset 382}} - Txn COMMIT (last message)

I think the real culprit is that the checkpoint is {{383}} instead of being 
{{382}}. For information, this is a compacted topic, and just before the 
outage, we encountered some ISR shrinking and leader changes.

While experimenting with the API, it seems that the {{consumer.position()}} 
call is a bit tricky, after a {{seek()}} and a {{poll()}}, it seems that the 
{{position()}} is actually returning the seek position. After the {{poll()}} 
call, even if no data is returned, the {{position()}} is returning the LSO. I 
did an example on 
[https://gist.github.com/Dabz/9aa0b4d1804397af6e7b6ad8cba82dcb] .

  was:
We encountered an issue a few time in some of our Kafka Streams application.
 After an unexpected restart of our application, some instances have not been 
able to resume operating.

They got stuck while trying to restore the state store of a GlobalKTable. The 
only way to resume operating was to manually delete their `state.dir`.

We observed the following timeline:
 * After the restart of the Kafka Streams application, it tries to restore its 
GlobalKTable
 * It seeks to the last checkpoint available on the {{state.dir}}: 382 
([https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L259])
 * The watermark ({{endOffset}} results) returned the offset 383 
{code:java}
handling ListOffsetResponse response for XX. Fetched offset 383, timestamp 
-1{code}

 * We enter the loop: 
[https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L279]
 * Then we invoked the {{poll()}}, but the poll returns nothing, so we enter: 
[https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L306]
 and we crash (x)
{code:java}
Global task did not make progress to restore state within 300000 ms.{code}

 * The POD restart, and we encounter the same issue until we manually delete 
the {{state.dir}}

 

Regarding the topic, by leveraging the {{DumpLogSegment}} tool, I can see:
 * {{Offset 381}} - Last business message received
 * {{Offset 382}} - Txn COMMIT (last message)

I think the real culprit is that the checkpoint is {{383}} instead of being 
{{382}}. For information, this is a compacted topic, and just before the 
outage, we encountered some ISR shrinking and leader changes.

While experimenting with the API, it seems that the {{consumer.position()}} 
call is a bit tricky, after a {{seek()}} and a {{poll()}}, it seems that the 
{{position()}} is actually returning the seek position. After the {{poll()}} 
call, even if no data is returned, the {{position()}} is returning the LSO. I 
did an example on 
[https://gist.github.com/Dabz/9aa0b4d1804397af6e7b6ad8cba82dcb] .


> Infinite loop while restoring a GlobalKTable
> --------------------------------------------
>
>                 Key: KAFKA-12951
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12951
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Damien Gasparina
>            Priority: Major
>
> We encountered an issue a few time in some of our Kafka Streams application.
>  After an unexpected restart of our applications, some instances have not 
> been able to resume operating.
> They got stuck while trying to restore the state store of a GlobalKTable. The 
> only way to resume operating was to manually delete their `state.dir`.
> We observed the following timeline:
>  * After the restart of the Kafka Streams application, it tries to restore 
> its GlobalKTable
>  * It seeks to the last checkpoint available on the {{state.dir}}: 382 
> ([https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L259])
>  * The watermark ({{endOffset}} results) returned the offset 383 
> {code:java}
> handling ListOffsetResponse response for XX. Fetched offset 383, timestamp 
> -1{code}
>  * We enter the loop: 
> [https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L279]
>  * Then we invoked the {{poll()}}, but the poll returns nothing, so we enter: 
> [https://github.com/apache/kafka/blob/2.7.0/streams/src/main/java/org/apache/kafka/streams/processor/internals/GlobalStateManagerImpl.java#L306]
>  and we crash (x)
> {code:java}
> Global task did not make progress to restore state within 300000 ms.{code}
>  * The POD restart, and we encounter the same issue until we manually delete 
> the {{state.dir}}
>  
> Regarding the topic, by leveraging the {{DumpLogSegment}} tool, I can see:
>  * {{Offset 381}} - Last business message received
>  * {{Offset 382}} - Txn COMMIT (last message)
> I think the real culprit is that the checkpoint is {{383}} instead of being 
> {{382}}. For information, this is a compacted topic, and just before the 
> outage, we encountered some ISR shrinking and leader changes.
> While experimenting with the API, it seems that the {{consumer.position()}} 
> call is a bit tricky, after a {{seek()}} and a {{poll()}}, it seems that the 
> {{position()}} is actually returning the seek position. After the {{poll()}} 
> call, even if no data is returned, the {{position()}} is returning the LSO. I 
> did an example on 
> [https://gist.github.com/Dabz/9aa0b4d1804397af6e7b6ad8cba82dcb] .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to