[ https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16044885#comment-16044885 ]
Guozhang Wang edited comment on KAFKA-5415 at 6/9/17 10:59 PM: --------------------------------------------------------------- [~apurva] [~hachikuji] The root cause for KAFKA-5416 (not this JIRA, but I'd still just keep it here for the record) is in a couple folds: 1) First in {{TransactionStateManager#appendTransactionToLog}}, when we received an error while trying to append to log (in our case it is `NotEnoughReplicas`), we will reset the pending state: {{metadata.pendingState = None}}. This behavior is correct for all its callers EXCEPT the one from {{TransactionMarkerChannelManager#tryAppendToLog}}, in which {{appendCallback}} will retry appending again but this time the pending state has been reset. So when the retry finally succeed, it will call {{TransactionMetadata#completeTransitionTo}} we will throw an exception since the pending state is `None`. 2) This exception is thrown all the way to {{KafkaApi#handleFetchRequest}}, because it is the thread that handles the follower fetch request, and then increment the HW, and then call {{tryCompleteDelayedRequests}} which will throw the exception. This exception will be handled in {{handleError}}, in which we do not print ANY logging but simply returns an error to the follower fetcher. And hence from the logs it is a mystery that it seems the callback was not called; in fact it did get called by throw an exception that get silently swallowed. The key log entries the disclose this is in worker7: {code} [2017-06-09 01:16:54,132] DEBUG Partition [__transaction_state,37] on broker 1: High watermark for partition [__transaction_state,37] updated to 86 [0 : 14667] (kafka.cluster.Partition) [2017-06-09 01:16:54,132] DEBUG [Replica Manager on Broker 1]: Request key __transaction_state-37 unblocked 0 fetch requests. (kafka.server.ReplicaManager) {code} Note it does not have any "unblocked x produce requests" after "unblocked 0 fetch requests", while in {{tryCompleteDelayedRequests}} we always try to complete fetch, then produce, then delete requests; and in worker2: {code} [2017-06-09 01:16:54,136] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,22] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,136] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,8] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,21] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,4] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,7] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,9] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,46] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,25] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,35] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,41] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,33] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,23] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,138] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,49] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) ...... {code} It is the error code that {{handleError}} decided to sent back while silently swallowed the exception. was (Author: guozhang): [~apurva] [~hachikuji] The root cause is in a couple folds: 1) First in {{TransactionStateManager#appendTransactionToLog}}, when we received an error while trying to append to log (in our case it is `NotEnoughReplicas`), we will reset the pending state: {{metadata.pendingState = None}}. This behavior is correct for all its callers EXCEPT the one from {{TransactionMarkerChannelManager#tryAppendToLog}}, in which {{appendCallback}} will retry appending again but this time the pending state has been reset. So when the retry finally succeed, it will call {{TransactionMetadata#completeTransitionTo}} we will throw an exception since the pending state is `None`. 2) This exception is thrown all the way to {{KafkaApi#handleFetchRequest}}, because it is the thread that handles the follower fetch request, and then increment the HW, and then call {{tryCompleteDelayedRequests}} which will throw the exception. This exception will be handled in {{handleError}}, in which we do not print ANY logging but simply returns an error to the follower fetcher. And hence from the logs it is a mystery that it seems the callback was not called; in fact it did get called by throw an exception that get silently swallowed. The key log entries the disclose this is in worker7: {code} [2017-06-09 01:16:54,132] DEBUG Partition [__transaction_state,37] on broker 1: High watermark for partition [__transaction_state,37] updated to 86 [0 : 14667] (kafka.cluster.Partition) [2017-06-09 01:16:54,132] DEBUG [Replica Manager on Broker 1]: Request key __transaction_state-37 unblocked 0 fetch requests. (kafka.server.ReplicaManager) {code} Note it does not have any "unblocked x produce requests" after "unblocked 0 fetch requests", while in {{tryCompleteDelayedRequests}} we always try to complete fetch, then produce, then delete requests; and in worker2: {code} [2017-06-09 01:16:54,136] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,22] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,136] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,8] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,21] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,4] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,7] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,9] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,46] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,25] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,35] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,41] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,33] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,23] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) [2017-06-09 01:16:54,138] ERROR [ReplicaFetcherThread-0-1]: Error for partition [__consumer_offsets,49] to broker 1:org.apache.kafka.common.errors.UnknownServerException: The server experienced an unexpected error when processing the request (kafka.server.ReplicaFetcherThread) ...... {code} It is the error code that {{handleError}} decided to sent back while silently swallowed the exception. > TransactionCoordinator doesn't complete transition to PrepareCommit state > ------------------------------------------------------------------------- > > Key: KAFKA-5415 > URL: https://issues.apache.org/jira/browse/KAFKA-5415 > Project: Kafka > Issue Type: Bug > Reporter: Apurva Mehta > Assignee: Apurva Mehta > Priority: Blocker > Labels: exactly-once > Fix For: 0.11.0.0 > > Attachments: 6.tgz > > > This has been revealed by the system test failures on jenkins. > The transaction coordinator seems to get into a path during the handling of > the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or > COORDINATOR_NOT_AVAILABLE error, to be revealed by > https://github.com/apache/kafka/pull/3278) to the client. However, due to > network instability, the producer is disconnected before it receives this > error. > As a result, the transaction remains in a `PrepareXX` state, and future > `EndTxn` requests sent by the client after reconnecting result in a > `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the > transaction never finishes, as expiration isn't done from a PrepareXX state. -- This message was sent by Atlassian JIRA (v6.3.15#6346)