[jira] [Commented] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommit state

2017-06-09 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045051#comment-16045051
 ] 

Apurva Mehta commented on KAFKA-5415:
-

[~guozhang]'s explanation above actually applies to KAFKA-5416.

> TransactionCoordinator gets stuck in PrepareCommit state
> 
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommit state

2017-06-09 Thread Guozhang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044885#comment-16044885
 ] 

Guozhang Wang commented on KAFKA-5415:
--

[~apurva] [~hachikuji] The root cause is in a couple folds:

1) First in {{TransactionStateManager#appendTransactionToLog}}, when we 
received an error while trying to append to log (in our case it is 
`NotEnoughReplicas`), we will reset the pending state: {{metadata.pendingState 
= None}}. This behavior is correct for all its callers EXCEPT the one from 
{{TransactionMarkerChannelManager#tryAppendToLog}}, in which {{appendCallback}} 
will retry appending again but this time the pending state has been reset.

So when the retry finally succeed, it will call 
{{TransactionMetadata#completeTransitionTo}} we will throw an exception since 
the pending state is `None`.

2) This exception is thrown all the way to {{KafkaApi#handleFetchRequest}}, 
because it is the thread that handles the follower fetch request, and then 
increment the HW, and then call {{tryCompleteDelayedRequests}} which will throw 
the exception. This exception will be handled in {{handleError}}, in which we 
do not print ANY logging but simply returns an error to the follower fetcher. 
And hence from the logs it is a mystery that it seems the callback was not 
called; in fact it did get called by throw an exception that get silently 
swallowed.

The key log entries the disclose this is in worker7: 

{code}
[2017-06-09 01:16:54,132] DEBUG Partition [__transaction_state,37] on broker 1: 
High watermark for partition [__transaction_state,37] updated to 86 [0 : 14667] 
(kafka.cluster.Partition)
[2017-06-09 01:16:54,132] DEBUG [Replica Manager on Broker 1]: Request key 
__transaction_state-37 unblocked 0 fetch requests. (kafka.server.ReplicaManager)
{code}

Note it does not have any "unblocked x produce requests" after "unblocked 0 
fetch requests", while in {{tryCompleteDelayedRequests}} we always try to 
complete fetch, then produce, then delete requests; and in worker2:

{code}
[2017-06-09 01:16:54,136] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,22] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,136] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,8] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,21] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,4] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,7] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,9] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,46] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,25] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,35] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for partition 
[__consumer_offsets,41] to broker 
1:org.apache.kafka.common.errors.UnknownServerException: The server experienced 
an unexpected error when processing the request 
(kafka.server.ReplicaFetcherThread)
[2017-06-09 01:16:54,137] ERROR [ReplicaFetcherThread-0-1]: Error for 

[jira] [Commented] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommit state

2017-06-09 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044861#comment-16044861
 ] 

Apurva Mehta commented on KAFKA-5415:
-

[~guozhang] has found the bug and will file the PR. Assigning to him.

> TransactionCoordinator gets stuck in PrepareCommit state
> 
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Guozhang Wang
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommit state

2017-06-08 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043758#comment-16043758
 ] 

Apurva Mehta commented on KAFKA-5415:
-

The interesting thing is that it reproduces consistently on jenkins only when 
it is part of a multi test run. Running this test in isolation makes it look 
stable.

> TransactionCoordinator gets stuck in PrepareCommit state
> 
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)