[
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16044885#comment-16044885
]
Guozhang Wang edited comment on KAFKA-5415 at 6/9/17 10:59 PM:
---------------------------------------------------------------
[~apurva] [~hachikuji] The root cause for KAFKA-5416 (not this JIRA, but I'd
still just keep it here for the record) is in a couple folds:
1) First in {{TransactionStateManager#appendTransactionToLog}}, when we
received an error while trying to append to log (in our case it is
`NotEnoughReplicas`), we will reset the pending state: {{metadata.pendingState
= None}}. This behavior is correct for all its callers EXCEPT the one from
{{TransactionMarkerChannelManager#tryAppendToLog}}, in which {{appendCallback}}
will retry appending again but this time the pending state has been reset.
So when the retry finally succeed, it will call
{{TransactionMetadata#completeTransitionTo}} we will throw an exception since
the pending state is `None`.
2) This exception is thrown all the way to {{KafkaApi#handleFetchRequest}},
because it is the thread that handles the follower fetch request, and then
increment the HW, and then call {{tryCompleteDelayedRequests}} which will throw
the exception. This exception will be handled in {{handleError}}, in which we
do not print ANY logging but simply returns an error to the follower fetcher.
And hence from the logs it is a mystery that it seems the callback was not
called; in fact it did get called by throw an exception that get silently
swallowed.
The key log entries the disclose this is in worker7:
{code}
[2017-06-09 01:16:54,132] DEBUG Partition [__transaction_state,37] on broker 1:
High watermark for partition [__transaction_state,37] updated to 86 [0 : 14667]
(kafka.cluster.Partition)
[2017-06-09 01:16:54,132] DEBUG [Replica Manager on Broker 1]: Request key
__transaction_state-37 unblocked 0 fetch requests. (kafka.server.ReplicaManager)
{code}
Note it does not have any "unblocked x produce requests" after "unblocked 0
fetch requests", while in {{tryCompleteDelayedRequests}} we always try to
complete fetch, then produce, then delete requests; and in worker2:
{code}
[2017-06-09 01:16:54,136] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,22] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,136] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,8] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,21] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,4] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,7] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,9] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,46] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,25] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,35] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,41] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,33] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,23] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,138] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,49] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
......
{code}
It is the error code that {{handleError}} decided to sent back while silently
swallowed the exception.
was (Author: guozhang):
[~apurva] [~hachikuji] The root cause is in a couple folds:
1) First in {{TransactionStateManager#appendTransactionToLog}}, when we
received an error while trying to append to log (in our case it is
`NotEnoughReplicas`), we will reset the pending state: {{metadata.pendingState
= None}}. This behavior is correct for all its callers EXCEPT the one from
{{TransactionMarkerChannelManager#tryAppendToLog}}, in which {{appendCallback}}
will retry appending again but this time the pending state has been reset.
So when the retry finally succeed, it will call
{{TransactionMetadata#completeTransitionTo}} we will throw an exception since
the pending state is `None`.
2) This exception is thrown all the way to {{KafkaApi#handleFetchRequest}},
because it is the thread that handles the follower fetch request, and then
increment the HW, and then call {{tryCompleteDelayedRequests}} which will throw
the exception. This exception will be handled in {{handleError}}, in which we
do not print ANY logging but simply returns an error to the follower fetcher.
And hence from the logs it is a mystery that it seems the callback was not
called; in fact it did get called by throw an exception that get silently
swallowed.
The key log entries the disclose this is in worker7:
{code}
[2017-06-09 01:16:54,132] DEBUG Partition [__transaction_state,37] on broker 1:
High watermark for partition [__transaction_state,37] updated to 86 [0 : 14667]
(kafka.cluster.Partition)
[2017-06-09 01:16:54,132] DEBUG [Replica Manager on Broker 1]: Request key
__transaction_state-37 unblocked 0 fetch requests. (kafka.server.ReplicaManager)
{code}
Note it does not have any "unblocked x produce requests" after "unblocked 0
fetch requests", while in {{tryCompleteDelayedRequests}} we always try to
complete fetch, then produce, then delete requests; and in worker2:
{code}
[2017-06-09 01:16:54,136] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,22] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,136] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,8] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,21] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,4] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,7] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,9] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,46] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,25] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,35] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,41] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,33] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,23] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,138] ERROR [ReplicaFetcherThread-0-1]: Error for partition
[__consumer_offsets,49] to broker
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced
an unexpected error when processing the request
(kafka.server.ReplicaFetcherThread)
......
{code}
It is the error code that {{handleError}} decided to sent back while silently
swallowed the exception.
> TransactionCoordinator doesn't complete transition to PrepareCommit state
> -------------------------------------------------------------------------
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
> Issue Type: Bug
> Reporter: Apurva Mehta
> Assignee: Apurva Mehta
> Priority: Blocker
> Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins.
> The transaction coordinator seems to get into a path during the handling of
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or
> COORDINATOR_NOT_AVAILABLE error, to be revealed by
> https://github.com/apache/kafka/pull/3278) to the client. However, due to
> network instability, the producer is disconnected before it receives this
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future
> `EndTxn` requests sent by the client after reconnecting result in a
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the
> transaction never finishes, as expiration isn't done from a PrepareXX state.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)