[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16044885#comment-16044885
 ] 

Guozhang Wang edited comment on KAFKA-5415 at 6/9/17 10:59 PM:
---------------------------------------------------------------

[~apurva] [~hachikuji] The root cause for KAFKA-5416 (not this JIRA, but I'd 
still just keep it here for the record) is in a couple folds:

1) First in {{TransactionStateManager#appendTransactionToLog}}, when we 
received an error while trying to append to log (in our case it is 
`NotEnoughReplicas`), we will reset the pending state: {{metadata.pendingState 
= None}}. This behavior is correct for all its callers EXCEPT the one from 
{{TransactionMarkerChannelManager#tryAppendToLog}}, in which {{appendCallback}} 
will retry appending again but this time the pending state has been reset.

So when the retry finally succeed, it will call 
{{TransactionMetadata#completeTransitionTo}} we will throw an exception since 
the pending state is `None`.

2) This exception is thrown all the way to {{KafkaApi#handleFetchRequest}}, 
because it is the thread that handles the follower fetch request, and then 
increment the HW, and then call {{tryCompleteDelayedRequests}} which will throw 
the exception. This exception will be handled in {{handleError}}, in which we 
do not print ANY logging but simply returns an error to the follower fetcher. 
And hence from the logs it is a mystery that it seems the callback was not 
called; in fact it did get called by throw an exception that get silently 
swallowed.

The key log entries the disclose this is in worker7: 

{code}
[2017-06-09 01:16:54,132] DEBUG Partition [__transaction_state,37] on broker 1: 
High watermark for partition [__transaction_state,37] updated to 86 [0 : 14667] 
(kafka.cluster.Partition)
[2017-06-09 01:16:54,132] DEBUG [Replica Manager on Broker 1]: Request key 
__transaction_state-37 unblocked 0 fetch requests. (kafka.server.ReplicaManager)
{code}

Note it does not have any "unblocked x produce requests" after "unblocked 0 
fetch requests", while in {{tryCompleteDelayedRequests}} we always try to 
complete fetch, then produce, then delete requests; and in worker2:

{code}
[2017-06-09 01:16:54,136] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,22] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,136] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,8] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,21] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,4] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,7] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,9] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,46] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,25] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,35] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,41] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,33] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,23] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,138] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,49] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
......
{code}

It is the error code that {{handleError}} decided to sent back while silently 
swallowed the exception.


was (Author: guozhang):
[~apurva] [~hachikuji] The root cause is in a couple folds:

1) First in {{TransactionStateManager#appendTransactionToLog}}, when we 
received an error while trying to append to log (in our case it is 
`NotEnoughReplicas`), we will reset the pending state: {{metadata.pendingState 
= None}}. This behavior is correct for all its callers EXCEPT the one from 
{{TransactionMarkerChannelManager#tryAppendToLog}}, in which {{appendCallback}} 
will retry appending again but this time the pending state has been reset.

So when the retry finally succeed, it will call 
{{TransactionMetadata#completeTransitionTo}} we will throw an exception since 
the pending state is `None`.

2) This exception is thrown all the way to {{KafkaApi#handleFetchRequest}}, 
because it is the thread that handles the follower fetch request, and then 
increment the HW, and then call {{tryCompleteDelayedRequests}} which will throw 
the exception. This exception will be handled in {{handleError}}, in which we 
do not print ANY logging but simply returns an error to the follower fetcher. 
And hence from the logs it is a mystery that it seems the callback was not 
called; in fact it did get called by throw an exception that get silently 
swallowed.

The key log entries the disclose this is in worker7: 

{code}
[2017-06-09 01:16:54,132] DEBUG Partition [__transaction_state,37] on broker 1: 
High watermark for partition [__transaction_state,37] updated to 86 [0 : 14667] 
(kafka.cluster.Partition)
[2017-06-09 01:16:54,132] DEBUG [Replica Manager on Broker 1]: Request key 
__transaction_state-37 unblocked 0 fetch requests. (kafka.server.ReplicaManager)
{code}

Note it does not have any "unblocked x produce requests" after "unblocked 0 
fetch requests", while in {{tryCompleteDelayedRequests}} we always try to 
complete fetch, then produce, then delete requests; and in worker2:

{code}
[2017-06-09 01:16:54,136] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,22] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,136] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,8] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,21] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,4] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,7] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,9] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,46] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,25] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,35] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,41] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,33] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,23] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,138] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,49] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
......
{code}

It is the error code that {{handleError}} decided to sent back while silently 
swallowed the exception.

> TransactionCoordinator doesn't complete transition to PrepareCommit state
> -------------------------------------------------------------------------
>
>                 Key: KAFKA-5415
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5415
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Apurva Mehta
>            Assignee: Apurva Mehta
>            Priority: Blocker
>              Labels: exactly-once
>             Fix For: 0.11.0.0
>
>         Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to