[jira] [Created] (KAFKA-6053) NoSuchMethodError when creating ProducerRecord in upgrade system tests

2017-10-11 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-6053:
---

 Summary: NoSuchMethodError when creating ProducerRecord in upgrade 
system tests
 Key: KAFKA-6053
 URL: https://issues.apache.org/jira/browse/KAFKA-6053
 Project: Kafka
  Issue Type: Bug
Reporter: Apurva Mehta
Assignee: Apurva Mehta


This patch https://github.com/apache/kafka/pull/4029 used a new constructor for 
{{ProducerRecord}} which doesn't exist in older clients. Hence system tests 
which use older clients fail with: 

{noformat}
Exception in thread "main" java.lang.NoSuchMethodError: 
org.apache.kafka.clients.producer.ProducerRecord.(Ljava/lang/String;Ljava/lang/Integer;Ljava/lang/Long;Ljava/lang/Object;Ljava/lang/Object;)V
at 
org.apache.kafka.tools.VerifiableProducer.send(VerifiableProducer.java:232)
at 
org.apache.kafka.tools.VerifiableProducer.run(VerifiableProducer.java:462)
at 
org.apache.kafka.tools.VerifiableProducer.main(VerifiableProducer.java:500)
{"timestamp":1507711495458,"name":"shutdown_complete"}
{"timestamp":1507711495459,"name":"tool_data","sent":0,"acked":0,"target_throughput":1,"avg_throughput":0.0}
amehta-macbook-pro:worker6 apurva$
{noformat}

A trivial fix is to only use the new constructor if a message create time is 
explicitly passed to the VerifiableProducer, since older versions of the test 
would never use it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6016) Use the idempotent producer in the reassign_partitions_test

2017-10-04 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-6016:
---

 Summary: Use the idempotent producer in the 
reassign_partitions_test
 Key: KAFKA-6016
 URL: https://issues.apache.org/jira/browse/KAFKA-6016
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.11.0.1, 1.0.0
Reporter: Apurva Mehta
Assignee: Apurva Mehta
 Fix For: 1.1.0


Currently, the reassign partitions test doesn't use the idempotent producer. 
This means that bugs like KAFKA-6003 have gone unnoticed. We should update the 
test to use the idempotent producer and recreate that bug on a regular basis so 
that we are fully testing all code paths.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6015) NPE in RecordAccumulator

2017-10-04 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-6015:
---

 Summary: NPE in RecordAccumulator
 Key: KAFKA-6015
 URL: https://issues.apache.org/jira/browse/KAFKA-6015
 Project: Kafka
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Apurva Mehta
Assignee: Apurva Mehta
Priority: Blocker
 Fix For: 1.0.0


I found this inadvertently while trying to create a system test to reproduce  
KAFKA-6003

{noformat}java.lang.NullPointerException
at 
org.apache.kafka.clients.producer.internals.RecordAccumulator.drain(RecordAccumulator.java:542)
at 
org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:270)
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:238)
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:748)
{noformat}

The problem is with this line

{code:java}
if (first.hasSequence() && first.baseSequence() != 
transactionManager.nextBatchBySequence(first.topicPartition).baseSequence())
{code}

It is possible for the producer state to be reset (for instance if retries are 
expired), in which case the transaction manager will drop the in flight batches 
it is tracking. However, these batches will continue to be in the accumulator 
with a sequence, causing an NPE in the background thread on this line.

It would be better to drain the batches with the old Pid/Sequence in this case. 
Either they are accepted, or they will be returned to the user with an 
{{OutOfOrderSequenceException}} which is clearer.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KAFKA-5552) testTransactionalProducerTopicAuthorizationExceptionInCommit fails

2017-09-28 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta resolved KAFKA-5552.
-
   Resolution: Cannot Reproduce
Fix Version/s: (was: 1.1.0)

> testTransactionalProducerTopicAuthorizationExceptionInCommit fails 
> ---
>
> Key: KAFKA-5552
> URL: https://issues.apache.org/jira/browse/KAFKA-5552
> Project: Kafka
>  Issue Type: Bug
>  Components: core
>Affects Versions: 0.11.0.0
>Reporter: Eno Thereska
>
> Got a unit test error: 
> https://builds.apache.org/job/kafka-pr-jdk7-scala2.11/5877/
> Error Message
> org.apache.kafka.common.KafkaException: Cannot execute transactional method 
> because we are in an error state
> Stacktrace
> org.apache.kafka.common.KafkaException: Cannot execute transactional method 
> because we are in an error state
>   at 
> org.apache.kafka.clients.producer.internals.TransactionManager.maybeFailWithError(TransactionManager.java:524)
>   at 
> org.apache.kafka.clients.producer.internals.TransactionManager.beginCommit(TransactionManager.java:190)
>   at 
> org.apache.kafka.clients.producer.KafkaProducer.commitTransaction(KafkaProducer.java:583)
>   at 
> kafka.api.AuthorizerIntegrationTest.testTransactionalProducerTopicAuthorizationExceptionInCommit(AuthorizerIntegrationTest.scala:1027)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.runTestClass(JUnitTestClassExecuter.java:114)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassExecuter.execute(JUnitTestClassExecuter.java:57)
>   at 
> org.gradle.api.internal.tasks.testing.junit.JUnitTestClassProcessor.processTestClass(JUnitTestClassProcessor.java:66)
>   at 
> org.gradle.api.internal.tasks.testing.SuiteTestClassProcessor.processTestClass(SuiteTestClassProcessor.java:51)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:24)
>   at 
> org.gradle.internal.dispatch.ContextClassLoaderDispatch.dispatch(ContextClassLoaderDispatch.java:32)
>   at 
> org.gradle.internal.dispatch.ProxyDispatchAdapter$DispatchingInvocationHandler.invoke(ProxyDispatchAdapter.java:93)
>   at com.sun.proxy.$Proxy2.processTestClass(Unknown Source)
>   at 
> org.gradle.api.internal.tasks.testing.worker.TestWorker.processTestClass(TestWorker.java:109)
>   at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.gradle.internal.dispatch.ReflectionDispatch.dispatch(ReflectionDispatch.java:35)
>   at 
> 

[jira] [Resolved] (KAFKA-5865) Expiring batches with idempotence enabled could cause data loss.

2017-09-25 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta resolved KAFKA-5865.
-
Resolution: Fixed

This is fixed in 1.0.0 by the changes in 
https://github.com/apache/kafka/pull/3743

> Expiring batches with idempotence enabled could cause data loss.
> 
>
> Key: KAFKA-5865
> URL: https://issues.apache.org/jira/browse/KAFKA-5865
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
> Fix For: 1.0.0
>
>
> Currently we have a problem with this line:
> https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L282
> Because we can reset the producer id and return after draining batches, it 
> means that we can drain batches for some partitions, then find a batch has 
> expired, and then return. But the batches which were drained are now no 
> longer in the producer queue, and haven't been sent. Thus they are totally 
> lost, and the call backs will never be invoked.
> This is already fixed in https://github.com/apache/kafka/pull/3743 , but 
> opening this in case we want to fix it in 0.11.0.2 as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5914) Return MessageFormatVersion and MessageMaxBytes in MetadataResponse

2017-09-15 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5914:
---

 Summary: Return MessageFormatVersion and MessageMaxBytes in 
MetadataResponse
 Key: KAFKA-5914
 URL: https://issues.apache.org/jira/browse/KAFKA-5914
 Project: Kafka
  Issue Type: Sub-task
Reporter: Apurva Mehta
Assignee: Apurva Mehta
 Fix For: 1.0.0


As part of KIP-192, we want to send two additional fields in the 
{{TopicMetadata}} which is part of the {{MetadataResponse}}. These fields are 
the {{MessageFormatVersion}} and the {{MessageMaxBytes}}.

The {{MessageFormatVersion}} is required to implement 
https://issues.apache.org/jira/browse/KAFKA-5794 . The latter will be 
implemented in a future release, but with the changes proposed here, the said 
future release will be backward compatible with 1.0.0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5913) Add RecordMetadataNotAvailableException to indicate that ProduceResponse did not contain offset and timestamp information

2017-09-15 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5913:
---

 Summary: Add RecordMetadataNotAvailableException to indicate that 
ProduceResponse did not contain offset and timestamp information
 Key: KAFKA-5913
 URL: https://issues.apache.org/jira/browse/KAFKA-5913
 Project: Kafka
  Issue Type: Sub-task
Reporter: Apurva Mehta
Assignee: Apurva Mehta
 Fix For: 1.0.0


One of the changes in KIP-192 is to send a {{DUPLICATE_SEQUENCE}} error code 
with a {{ProduceResponse}} when we detect a duplicate on the broker but don't 
have the batch metadata for the sequence in question in memory.

To handle this on the client, we mark the batch as successful, but cannot 
return the offset and timestamp information in the {{RecordMetadata}} returned 
in the produce future. Instead of returning implicit invalid values (like -1), 
we should throw a {{RecordMetadataNotAvailableException}} to ensure that 
applications don't suffer from faulty processing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5912) Trogdor AgentTest.testAgentActivatesFaults is flaky

2017-09-15 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5912:
---

 Summary: Trogdor AgentTest.testAgentActivatesFaults is flaky
 Key: KAFKA-5912
 URL: https://issues.apache.org/jira/browse/KAFKA-5912
 Project: Kafka
  Issue Type: Bug
Reporter: Apurva Mehta
Assignee: Colin P. McCabe


I have seen the the following failures occasionally in the PR builder.

{noformat}
Error Message

java.lang.AssertionError: Condition not met within timeout 15000. Timed out 
waiting for expected fault specs {bar: {state: 
{"stateName":"done","doneMs":7,"errorStr":""}}, baz: {state: 
{"stateName":"running","startedMs":7}}, foo: {state: 
{"stateName":"done","doneMs":3,"errorStr":""}}}
Stacktrace

java.lang.AssertionError: Condition not met within timeout 15000. Timed out 
waiting for expected fault specs {bar: {state: 
{"stateName":"done","doneMs":7,"errorStr":""}}, baz: {state: 
{"stateName":"running","startedMs":7}}, foo: {state: 
{"stateName":"done","doneMs":3,"errorStr":""}}}
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:275)
at org.apache.kafka.test.TestUtils.waitForCondition(TestUtils.java:253)
at 
org.apache.kafka.trogdor.common.ExpectedFaults.waitFor(ExpectedFaults.java:119)
at 
org.apache.kafka.trogdor.common.ExpectedFaults.waitFor(ExpectedFaults.java:109)
at 
org.apache.kafka.trogdor.agent.AgentTest.testAgentActivatesFaults(AgentTest.java:152)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:748)

{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5897) The producerId can be reset unnecessarily

2017-09-14 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5897:
---

 Summary: The producerId can be reset unnecessarily
 Key: KAFKA-5897
 URL: https://issues.apache.org/jira/browse/KAFKA-5897
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.11.0.0, 1.0.0
Reporter: Apurva Mehta


Currently, we expire batches and reset the producer id in cases where we don't 
need to.

For instance, if a batch which has been previously sent is expired in the 
accumulator, the producerId is reset (or the transaction aborted) 
unconditionally. However, if the batch failed with certain error codes like 
{{NOT_LEADER_FOR_PARTITION}}, etc., which definitively indicate that the write 
never succeeded, we don't need to reset the producer state since the status of 
the batch is not in doubt. 

Essentially, we would like an 'reset based on failure mode' logic which would 
be a bit smarter.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5888) Transactions system test should check for message order

2017-09-13 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5888:
---

 Summary: Transactions system test should check for message order
 Key: KAFKA-5888
 URL: https://issues.apache.org/jira/browse/KAFKA-5888
 Project: Kafka
  Issue Type: Bug
Reporter: Apurva Mehta
Assignee: Apurva Mehta
 Fix For: 1.0.0


Currently, the transactions system test doesn't check for correct ordering of 
the messages in a transaction. With KAFKA-5494, we can have multiple inflight 
requests for a single transaction, which could yield to out of order messages 
in the log if there are bugs. So we should assert that order is maintained in 
our system tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KAFKA-5621) The producer should retry expired batches when retries are enabled

2017-09-12 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta resolved KAFKA-5621.
-
Resolution: Won't Fix

> The producer should retry expired batches when retries are enabled
> --
>
> Key: KAFKA-5621
> URL: https://issues.apache.org/jira/browse/KAFKA-5621
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Sumant Tambe
> Fix For: 1.0.0
>
>
> Today, when a batch is expired in the accumulator, a {{TimeoutException}} is 
> raised to the user.
> It might be better the producer to retry the expired batch rather up to the 
> configured number of retries. This is more intuitive from the user's point of 
> view. 
> Further the proposed behavior makes it easier for applications like mirror 
> maker to provide ordering guarantees even when batches expire. Today, they 
> would resend the expired batch and it would get added to the back of the 
> queue, causing the output ordering to be different from the input ordering.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5870) Idempotent producer: a producerId reset causes undesirable behavior for inflight batches to other partitions

2017-09-11 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5870:
---

 Summary: Idempotent producer: a producerId reset causes 
undesirable behavior for inflight batches to other partitions
 Key: KAFKA-5870
 URL: https://issues.apache.org/jira/browse/KAFKA-5870
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.11.0.0
Reporter: Apurva Mehta
Assignee: Apurva Mehta
 Fix For: 1.0.0


Currently, if we have to reset the producer id for any reason (for instance if 
batches to a partition get expired, if we get an 
{{OutOfOrderSequenceException}}, etc) we could cause batches to other 
--healthy-- partitions to fail with a spurious {{OutOfOrderSequenceException}}.

This is detailed in this PR discussion: 
https://github.com/apache/kafka/pull/3743#discussion_r137907630

Ideally, we would want all inflight batches to be handled to completion rather 
than potentially failing them prematurely. Further, since we want to tighten up 
the semantics of the {{OutOfOrderSequenceException}}, at the very least we 
should raise another exception in this case, because there is no data loss on 
the broker when the client gives up. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5865) Expiring batches with idempotence enabled could cause data loss.

2017-09-08 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5865:
---

 Summary: Expiring batches with idempotence enabled could cause 
data loss.
 Key: KAFKA-5865
 URL: https://issues.apache.org/jira/browse/KAFKA-5865
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.11.0.0
Reporter: Apurva Mehta
Assignee: Apurva Mehta
 Fix For: 1.0.0


Currently we have a problem with this line:
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java#L282

Because we can reset the producer id and return after draining batches, it 
means that we can drain batches for some partitions, then find a batch has 
expired, and then return. But the batches which were drained are now no longer 
in the producer queue, and haven't been sent. Thus they are totally lost, and 
the call backs will never be invoked.

This is already fixed in https://github.com/apache/kafka/pull/3743 , but 
opening this in case we want to fix it in 0.11.0.2 as well.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KAFKA-5640) Look into making acks=all the default setting

2017-08-25 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta resolved KAFKA-5640.
-
Resolution: Duplicate

This is a dup of https://issues.apache.org/jira/browse/KAFKA-5796

> Look into making acks=all the default setting
> -
>
> Key: KAFKA-5640
> URL: https://issues.apache.org/jira/browse/KAFKA-5640
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
> Fix For: 1.0.0
>
>
> KAFKA-5494 proposed dropping the requirement for 
> {{max.inflight.requests.per.connection=1}} for the idempotent producer. 
> That is a stepping stone to enabling the idempotent producer by default 
> without sacrificing performance.
> A further step would be making {{acks=all}} the default setting as well. 
> Then, with {{enable.idempotence=true}}, 
> {{max.inflight.requests.per.connection=5}}, {{acks=all}}, 
> {{retries=MAX_INT}}, we would have exactly once semantics with strong 
> durability guarantees. 
> This particular ticket is about investigating the performance degradation 
> caused by {{acks=all}}. How much does throughput degrade? If it is 
> significant, are there low hanging fruit in terms of code or config changes 
> which would allow us to bridge most of the gap?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5796) Understand performance implications of acks=all and potential ways to reduce it

2017-08-25 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5796:
---

 Summary: Understand performance implications of acks=all and 
potential ways to reduce it
 Key: KAFKA-5796
 URL: https://issues.apache.org/jira/browse/KAFKA-5796
 Project: Kafka
  Issue Type: Sub-task
 Environment: To get exactly once semantics, we need acks=all. However, 
we know that there is a latency and throughput impact with acks=all when 
compared with acks=1. 

The impact is quantified here:
https://cwiki.apache.org/confluence/display/KAFKA/An+analysis+of+the+impact+of+max.in.flight.requests.per.connection+and+acks+on+Producer+performance

However, we can't explain some of that data. Nor do we know the causes for some 
of the degradation. We would like to understand the performance of acks=all at 
the very minimum before making it the default producer setting.

Reporter: Apurva Mehta
Assignee: Apurva Mehta






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5795) Make the idempotent producer the default producer setting

2017-08-25 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5795:
---

 Summary: Make the idempotent producer the default producer setting
 Key: KAFKA-5795
 URL: https://issues.apache.org/jira/browse/KAFKA-5795
 Project: Kafka
  Issue Type: Improvement
Reporter: Apurva Mehta
Assignee: Apurva Mehta
 Fix For: 1.0.0


We would like to turn on idempotence by default. The KIP is here: 
https://cwiki.apache.org/confluence/display/KAFKA/KIP-185%3A+Make+exactly+once+in+order+delivery+per+partition+the+default+producer+setting



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5794) Introduce new idempotence mode to gracefully deal with topics on the older message format

2017-08-25 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5794:
---

 Summary: Introduce new idempotence mode to gracefully deal with 
topics on the older message format
 Key: KAFKA-5794
 URL: https://issues.apache.org/jira/browse/KAFKA-5794
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.11.0.0
Reporter: Apurva Mehta
Assignee: Apurva Mehta
 Fix For: 1.0.0


In the discussion of KIP-185: Make exactly once in order delivery per partition 
the default producer setting, it was realized that we don't have graceful 
handling when an idempotence-enabled producer is writing to a broker with a 
message format older than v2 (ie. the 0.11.0 message format). 

In particular, if we enable idempotence, any produce requests to topics with an 
older message format will fail with an UnsupportedVersionException. Thus if the 
idempotent producer was to be made the default, the out of the box producer 
would fail to produce when used with clusters which haven't upgraded the 
message format yet.

This is particularly problematic since the recommended upgrade path is to 
upgrade broker code while keeping the message format at the older version, then 
upgrade all clients, and only finally upgrade the message format on the server. 
With the current behavior, the middle step is actually untenable if we enable 
idempotence as the default.

More details available at: 
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Exactly+Once+-+Dealing+with+older+message+formats+when+idempotence+is+enabled



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5793) Tighten up situations where OutOfOrderSequence may be returned

2017-08-25 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5793:
---

 Summary: Tighten up situations where OutOfOrderSequence may be 
returned
 Key: KAFKA-5793
 URL: https://issues.apache.org/jira/browse/KAFKA-5793
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.11.0.0
Reporter: Apurva Mehta
Assignee: Apurva Mehta
 Fix For: 1.0.0


Details of the problem are provided here: 
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Exactly+Once+-+Solving+the+problem+of+spurious+OutOfOrderSequence+errors

A quick summary follows:

In the discussion of KIP-185: Make exactly once in order delivery per partition 
the default producer setting, the following point regarding the 
OutOfOrderSequenceException was raised:

1. The OutOfOrderSequenceException indicates that there has been data loss on 
the broker.. ie. a previously acknowledged message no longer exists. For most 
part, this should only occur in rare situations (simultaneous power outages, 
multiple disk losses, software bugs resulting in data corruption, etc.).
2. However, there is another perfectly normal scenario where data is removed: 
in particular, data could be deleted because it is old and crosses the 
retention threshold.
Hence, if a producer remains inactive for longer than a topic's retention 
period, we could get an OutOfOrderSequence which is a false positive: the data 
is removed through valid processes, and this isn't an error.
3. We would like to eliminate the possibility of getting spurious 
OutOfOrderSequenceExceptions – when you get it, it should always mean data loss 
and should be taken very seriously. 




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5792) Transient failure in KafkaAdminClientTest.testHandleTimeout

2017-08-25 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5792:
---

 Summary: Transient failure in 
KafkaAdminClientTest.testHandleTimeout
 Key: KAFKA-5792
 URL: https://issues.apache.org/jira/browse/KAFKA-5792
 Project: Kafka
  Issue Type: Bug
Reporter: Apurva Mehta


The {{KafkaAdminClientTest.testHandleTimeout}} test occasionally fails with the 
following:

{noformat}
java.util.concurrent.ExecutionException: 
org.apache.kafka.common.errors.TimeoutException: Timed out waiting for a node 
assignment.
at 
org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
at 
org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
at 
org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
at 
org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:213)
at 
org.apache.kafka.clients.admin.KafkaAdminClientTest.testHandleTimeout(KafkaAdminClientTest.java:356)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kafka.common.errors.TimeoutException: Timed out waiting 
for a node assignment.
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5679) Add logging to distinguish between internally and externally initiated shutdown of Kafka

2017-07-30 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5679:
---

 Summary: Add logging to distinguish between internally and 
externally initiated shutdown of Kafka
 Key: KAFKA-5679
 URL: https://issues.apache.org/jira/browse/KAFKA-5679
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.11.0.0
Reporter: Apurva Mehta
Assignee: Apurva Mehta
 Fix For: 1.0.0


Currently, if there is an internal error that triggers a shutdown of the Kafka 
server, the {{Exit}} class is used, which begins the shutdown procedure. The 
other way a shutdown is triggered is by {{SIGTERM}} or some other signal.

We would like to distinguish between shutdown due to internal errors and 
external signals. This helps when debugging. Particularly, a natural question 
when a broker shuts down unexpectedly is:  "did the deployment system send the 
signal or is there some un logged fatal error in the broker"? 

Today, we rely on callers of {{Exit}} to log the error before making the call. 
However, this won't always have 100% coverage. It would be good to add a log 
message in {{Exit}} to record that an exit method was invoked explicitly. 

We could also add a signal handler to log when {{SIGTERM}}, {{SIGKILL}} etc. 
are received.

This would make operating Kafka a bit easier.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5663) LogDirFailureTest system test fails

2017-07-26 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5663:
---

 Summary: LogDirFailureTest system test fails
 Key: KAFKA-5663
 URL: https://issues.apache.org/jira/browse/KAFKA-5663
 Project: Kafka
  Issue Type: Bug
Reporter: Apurva Mehta
Assignee: Dong Lin


The recently added JBOD system test failed last night.

{noformat}
Producer failed to produce messages for 20s.
Traceback (most recent call last):
  File 
"/home/jenkins/workspace/system-test-kafka-trunk/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
 line 123, in run
data = self.run_test()
  File 
"/home/jenkins/workspace/system-test-kafka-trunk/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
 line 176, in run_test
return self.test_context.function(self.test)
  File 
"/home/jenkins/workspace/system-test-kafka-trunk/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
 line 321, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File 
"/home/jenkins/workspace/system-test-kafka-trunk/kafka/tests/kafkatest/tests/core/log_dir_failure_test.py",
 line 166, in test_replication_with_disk_failure
self.start_producer_and_consumer()
  File 
"/home/jenkins/workspace/system-test-kafka-trunk/kafka/tests/kafkatest/tests/produce_consume_validate.py",
 line 75, in start_producer_and_consumer
self.producer_start_timeout_sec)
  File 
"/home/jenkins/workspace/system-test-kafka-trunk/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/utils/util.py",
 line 36, in wait_until
raise TimeoutError(err_msg)
TimeoutError: Producer failed to produce messages for 20s.
{noformat}

Complete logs here:

http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2017-07-26--001.1501074756--apache--trunk--91c207c/LogDirFailureTest/test_replication_with_disk_failure/bounce_broker=False.security_protocol=PLAINTEXT.broker_type=follower/48.tgz



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5662) We should be able to specify min.insync.replicas for the __consumer_offsets topic

2017-07-26 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5662:
---

 Summary: We should be able to specify min.insync.replicas for the 
__consumer_offsets topic
 Key: KAFKA-5662
 URL: https://issues.apache.org/jira/browse/KAFKA-5662
 Project: Kafka
  Issue Type: Bug
Reporter: Apurva Mehta


The transaction log has a {{transaction.state.log.min.isr}} setting to control 
the min.isr for the transaction log (by default the min.isr is 2 and 
replication.factor is 3).

Unfortunately, we don't have a similar setting for the offsets topic. We should 
add the following {{offsets.topic.min.isr}} setting and default that to 2 so 
that we have durability on the offsets topic. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5661) Develop and understanding of how to tune transactions for optimal performance

2017-07-26 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5661:
---

 Summary: Develop and understanding of how to tune transactions for 
optimal performance
 Key: KAFKA-5661
 URL: https://issues.apache.org/jira/browse/KAFKA-5661
 Project: Kafka
  Issue Type: Sub-task
Affects Versions: 0.11.0.0
Reporter: Apurva Mehta
Assignee: Apurva Mehta
 Fix For: 1.0.0


Currently, we don't have an idea of the throughput curve for transactions 
across a different range of workloads. 

Thus we would like to understand how to tune transactions so that they are 
viable across a broad range of work loads. For instance, what knobs can you 
tweak if you use small messages to yet get acceptable transactional 
performance? We don't understand the performance curve across variables like 
message size, batch size, transaction duration, linger.ms, etc., and it would 
be good to get an understanding of this area and publish recommended 
configurations for different workloads.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5640) Look into making acks=all the default setting.

2017-07-25 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5640:
---

 Summary: Look into making acks=all the default setting. 
 Key: KAFKA-5640
 URL: https://issues.apache.org/jira/browse/KAFKA-5640
 Project: Kafka
  Issue Type: Sub-task
Reporter: Apurva Mehta
Assignee: Apurva Mehta
 Fix For: 1.0.0


KAFKA-5494 proposed dropping the requirement for 
{{max.inflight.requests.per.connection=1}} for the idempotent producer. 

That is a stepping stone to enabling the idempotent producer by default without 
sacrificing performance.

A further step would be making {{acks=all}} the default setting as well. Then, 
with {{enable.idempotence=true}}, {{max.inflight.requests.per.connection=5}}, 
{{acks=all}}, {{retries=MAX_INT}}, we would have exactly once semantics with 
strong durability guarantees. 

This particular ticket is about investigating the performance degradation 
caused by {{acks=all}}. How much does throughput degrade? If it is significant, 
are there low hanging fruit in terms of code or config changes which would 
allow us to bridge most of the gap?




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5621) The producer should retry expired batches when retries are enabled

2017-07-20 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5621:
---

 Summary: The producer should retry expired batches when retries 
are enabled
 Key: KAFKA-5621
 URL: https://issues.apache.org/jira/browse/KAFKA-5621
 Project: Kafka
  Issue Type: Bug
Reporter: Apurva Mehta


Today, when a batch is expired in the accumulator, a {{TimeoutException}} is 
raised to the user.

It might be better the producer to retry the expired batch rather up to the 
configured number of retries. This is more intuitive from the user's point of 
view. 

Further the proposed behavior makes it easier for applications like mirror 
maker to provide ordering guarantees even when batches expire. Today, they 
would resend the expired batch and it would get added to the back of the queue, 
causing the output ordering to be different from the input ordering.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5610) KafkaApis.handleWriteTxnMarkerRequest can return UNSUPPORTED_FOR_MESSAGE_FORMAT error on partition emigration

2017-07-19 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5610:
---

 Summary: KafkaApis.handleWriteTxnMarkerRequest can return 
UNSUPPORTED_FOR_MESSAGE_FORMAT error on partition emigration
 Key: KAFKA-5610
 URL: https://issues.apache.org/jira/browse/KAFKA-5610
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.11.0.0
Reporter: Apurva Mehta
Assignee: Apurva Mehta
Priority: Critical
 Fix For: 0.11.0.1


This bug was revealed by the following system test failure 
http://confluent-systest.s3-website-us-west-2.amazonaws.com/confluent-kafka-system-test-results/?prefix=2017-07-18--001.1500383975--apache--trunk--28c83d9/

What happened was that a commit marker to the offsets topic (sent as part of 
the {{producer.sendOffsetsToTransaction}} method) was lost, causing data to be 
reprocessed, and hence causing the test to fail. 

The bug is that the wrong error code is returned from the handleWriteTxnMarker 
request when there is partition emigration. In particular, we have: 

{code:scala}
for (marker <- markers.asScala) {
  val producerId = marker.producerId
  val (goodPartitions, partitionsWithIncorrectMessageFormat) = 
marker.partitions.asScala.partition { partition =>
replicaManager.getMagic(partition) match {
  case Some(magic) if magic >= RecordBatch.MAGIC_VALUE_V2 => true
  case _ => false
}
  }

  if (partitionsWithIncorrectMessageFormat.nonEmpty) {
val currentErrors = new ConcurrentHashMap[TopicPartition, Errors]()
partitionsWithIncorrectMessageFormat.foreach { partition => 
currentErrors.put(partition, Errors.UNSUPPORTED_FOR_MESSAGE_FORMAT) }
updateErrors(producerId, currentErrors)
  }
{code}

But the {{replicaManager.getMagic()}} call will return {{None}} when the 
partition emigrates, causing the {{handleWriteTxnMarkersRequest}} to return an 
{{UNSUPPORTED_FOR_MESSAGE_FORMAT}} error. 

>From the log, we see that the partition did emigrate a few milliseconds before 
>the {{WriteTxnMarkerRequest}} was sent.

On the old leader, worker10:
{noformat}
./worker10/debug/server.log:32245:[2017-07-18 05:43:20,950] INFO 
[GroupCoordinator 2]: Unloading group metadata for 
transactions-test-consumer-group with generation 0 
(kafka.coordinator.group.GroupCoordinator)
{noformat}

On the client: 
{noformat}
[2017-07-18 05:43:20,959] INFO [Transaction Marker Request Completion Handler 
1]: Sending my-first-transactional-id's transaction marker from partition 
__consumer_offsets-47 has failed with  UNSUPPORTED_FOR_MESSAGE_FORMAT. This 
partition will be removed from the set of partitions waiting for completion 
(kafka.coordinator.transaction.TransactionMarkerRequestCompletionHandler)
{noformat}

As you can see, the client received the response 9 ms after the emigration was 
initiated on the server.

Since it is perfectly acceptable for the LeaderISR metadata to be propagated 
asynchronously, we should have more robust handling of emgiration in KafkaApis 
so that it returns the right error code when handling a request for a partition 
for which the broker is no longer the leader.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5604) All producer methods should raise `ProducerFencedException` after the first time.

2017-07-17 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5604:
---

 Summary: All producer methods should raise 
`ProducerFencedException` after the first time.
 Key: KAFKA-5604
 URL: https://issues.apache.org/jira/browse/KAFKA-5604
 Project: Kafka
  Issue Type: Bug
Reporter: Apurva Mehta
Assignee: Apurva Mehta


Currently, when a `ProducerFencedException` is raised from a transactional 
producer, the expectation is that the application should call `close` 
immediately. However, if the application calls other producer methods, they 
would get a `KafkaException`. This is a bit confusing, and results in tickets 
like : https://issues.apache.org/jira/browse/KAFKA-5603. 

We should update the producer so that calls to any method other than `close` 
should raise a `ProducerFencedException` after the first time it is raised.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5543) We don't remove the LastStableOffsetLag metric when a partition is moved away from a broker

2017-06-29 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5543:
---

 Summary: We don't remove the LastStableOffsetLag metric when a 
partition is moved away from a broker
 Key: KAFKA-5543
 URL: https://issues.apache.org/jira/browse/KAFKA-5543
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.11.0.0
Reporter: Apurva Mehta
Assignee: Apurva Mehta
 Fix For: 0.11.0.1


Reported by [~junrao], we have a small leak where the `LastStableOffsetLag` 
metric is not removed along with the other metrics in the 
`Partition.removeMetrics` method. This could create a leak when partitions are 
reassigned or a topic is deleted.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5494) Idempotent producer should not require max.in.flight.requests.per.connection=1 and acks=all

2017-06-21 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5494:
---

 Summary: Idempotent producer should not require 
max.in.flight.requests.per.connection=1 and acks=all
 Key: KAFKA-5494
 URL: https://issues.apache.org/jira/browse/KAFKA-5494
 Project: Kafka
  Issue Type: Improvement
Affects Versions: 0.11.0.0
Reporter: Apurva Mehta


Currently, the idempotent producer (and hence transactional producer) requires 
max.in.flight.requests.per.connection=1.

This was due to simplifying the implementation on the client and server. With 
some additional work, we can satisfy the idempotent guarantees even with any 
number of in flight requests. The changes on the client be summarized as 
follows:
 
# We increment sequence numbers when batches are drained.
# If for some reason, a batch fails with a retriable error, we know that all 
future batches would fail with an out of order sequence exception. 
# As such, the client should treat some OutOfOrderSequence errors as retriable. 
In particular, we should maintain the 'last acked sequnece'. If the batch 
succeeding the last ack'd sequence has an OutOfOrderSequence, that is a fatal 
error. If a future batch fails with OutOfOrderSequence they should be reenqeued.
# With the changes above, the the producer queues should become priority queues 
ordered by the sequence numbers. 
# The partition is not ready unless the front of the queue has the next 
expected sequence.

With the changes above, we would get the benefits of multiple inflights in 
normal cases. When there are failures, we automatically constrain to a single 
inflight until we get back in sequence. 

With multiple inflights, we now have the possibility of getting duplicates for 
batches other than the last appended batch. In order to return the record 
metadata (including offset) of the duplicates inside the log, we would require 
a log scan at the tail to get the metadata at the tail. This can be optimized 
by caching the metadata for the last 'n' batches. For instance, if the default 
max.inflight is 5, we could cache the record metadata of the last 5 batches, 
and fall back to a scan if the duplicate is not within those 5. 

* *

The reason to have acks=all is to protect against OutOfOrderSequence exceptions 
in the case where the leader fails before replication happens. In that case, 
the next batch sent by the producer would get an OutOfOrderSequence because the 
new leader would not have the last message. 

This may be OK: for applications which really care about avoiding duplicates, 
they have to handle fatal errors of this sort anyway. In particular, the 
recommendation is to close the producer in the callback on a fatal error and 
then check the tail of the log for the last committed message, and then start 
sending from there. 

By making acks=all, this application logic would just be exercised more 
frequently.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5491) The ProducerPerformance tool should support transactions

2017-06-21 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5491:
---

 Summary: The ProducerPerformance tool should support transactions
 Key: KAFKA-5491
 URL: https://issues.apache.org/jira/browse/KAFKA-5491
 Project: Kafka
  Issue Type: Bug
Reporter: Apurva Mehta


We should allow users of the ProducerPerformance tool to run transactional 
sends.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5482) A CONCURRENT_TRANASCTIONS error for the first AddPartitionsToTxn request slows down transactions significantly

2017-06-20 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5482:
---

 Summary: A CONCURRENT_TRANASCTIONS error for the first 
AddPartitionsToTxn request slows down transactions significantly
 Key: KAFKA-5482
 URL: https://issues.apache.org/jira/browse/KAFKA-5482
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.11.0.0
Reporter: Apurva Mehta
Assignee: Apurva Mehta
 Fix For: 0.11.0.1


Here is the issue.

# When we do a commit transaction, the producer sends an `EndTxn` request to 
the coordinator. The coordinator writes the `PrepareCommit` message to the 
transaction log and then returns the response the client. It writes the 
transaction markers and the final 'CompleteCommit' message asynchronously. 
# In the mean time, if the client starts another transaction, it will send an 
`AddPartitions` request on the next `Sender.run` loop. If the markers haven't 
been written yet, then the coordinator will return a retriable 
`CONCURRENT_TRANSACTIONS` error to the client.
# The current behavior in the producer is to sleep for `retryBackoffMs` before 
retrying the request. The current default for this is 100ms. So the producer 
will sleep for 100ms before sending the `AddPartitions` again. This puts a 
floor on the latency for back to back transactions.

This has been worked around in https://issues.apache.org/jira/browse/KAFKA-5477 
by reducing the retryBackoff for the first AddPartitions request. But we need a 
stronger solution: like having the commit block until the transaction is 
complete, or delaying the addPartitions until batches are actually ready to be 
sent to the transaction.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5477) TransactionalProducer sleeps unnecessarily long during back to back transactions

2017-06-19 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5477:
---

 Summary: TransactionalProducer sleeps unnecessarily long during 
back to back transactions
 Key: KAFKA-5477
 URL: https://issues.apache.org/jira/browse/KAFKA-5477
 Project: Kafka
  Issue Type: Bug
Affects Versions: 0.11.0.0
Reporter: Apurva Mehta
Assignee: Apurva Mehta


I am running some perf tests for EOS and there is a severe perf impact with our 
default configs. 

Here is the issue.

# When we do a commit transaction, the producer sends an `EndTxn` request to 
the coordinator. The coordinator writes the `PrepareCommit` message to the 
transaction log and then returns the response the client. It writes the 
transaction markers and the final 'CompleteCommit' message asynchronously. 
# In the mean time, if the client starts another transaction, it will send an 
`AddPartitions` request on the next `Sender.run` loop. If the markers haven't 
been written yet, then the coordinator will return a retriable 
`CONCURRENT_TRANSACTIONS` error to the client.
# The current behavior in the producer is to sleep for `retryBackoffMs` before 
retrying the request. The current default for this is 100ms. So the producer 
will sleep for 100ms before sending the `AddPartitions` again. This puts a 
floor on the latency for transactions.

Ideally, we don't want to sleep the full 100ms  in this particular case, 
because the retry is 'expected'.

The options are: 
# do nothing, let streams override the retry.backoff.ms in their producer to 10 
when EOS is enabled (since they have a HOTFIX patch out anyway).
# Introduce a special 'transactionRetryBackoffMs' non-configurable variable and 
hard code that to a low value which applies to all transactional requests.
# do nothing and fix it properly in 0.11.0.1 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5457) RecordAccumulator.hasRoomfor doesn't take into account the headers while computing available space

2017-06-15 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5457:
---

 Summary: RecordAccumulator.hasRoomfor doesn't take into account 
the headers while computing available space
 Key: KAFKA-5457
 URL: https://issues.apache.org/jira/browse/KAFKA-5457
 Project: Kafka
  Issue Type: Bug
Reporter: Apurva Mehta
Assignee: Apurva Mehta
Priority: Blocker
 Fix For: 0.11.0.0


This could result in a BufferFull or similar exception when headers are 
actually being used. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5449) Flaky test TransactionsTest.testReadCommittedConsumerShouldNotSeeUndecidedData

2017-06-15 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5449:

Priority: Blocker  (was: Major)

> Flaky test TransactionsTest.testReadCommittedConsumerShouldNotSeeUndecidedData
> --
>
> Key: KAFKA-5449
> URL: https://issues.apache.org/jira/browse/KAFKA-5449
> Project: Kafka
>  Issue Type: Bug
>  Components: core, unit tests
>Reporter: Matthias J. Sax
>Assignee: Jason Gustafson
>Priority: Blocker
>  Labels: exactly-once, test
>
> {noformat}
> java.lang.AssertionError: Consumed 6 records until timeout instead of the 
> expected 8 records
>   at kafka.utils.TestUtils$.fail(TestUtils.scala:333)
>   at kafka.utils.TestUtils$.waitUntilTrue(TestUtils.scala:847)
>   at kafka.utils.TestUtils$.consumeRecords(TestUtils.scala:1357)
>   at 
> kafka.api.TransactionsTest.testReadCommittedConsumerShouldNotSeeUndecidedData(TransactionsTest.scala:143)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KAFKA-5449) Flaky test TransactionsTest.testReadCommittedConsumerShouldNotSeeUndecidedData

2017-06-15 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta reassigned KAFKA-5449:
---

Assignee: Jason Gustafson  (was: Apurva Mehta)

> Flaky test TransactionsTest.testReadCommittedConsumerShouldNotSeeUndecidedData
> --
>
> Key: KAFKA-5449
> URL: https://issues.apache.org/jira/browse/KAFKA-5449
> Project: Kafka
>  Issue Type: Bug
>  Components: core, unit tests
>Reporter: Matthias J. Sax
>Assignee: Jason Gustafson
>  Labels: exactly-once, test
>
> {noformat}
> java.lang.AssertionError: Consumed 6 records until timeout instead of the 
> expected 8 records
>   at kafka.utils.TestUtils$.fail(TestUtils.scala:333)
>   at kafka.utils.TestUtils$.waitUntilTrue(TestUtils.scala:847)
>   at kafka.utils.TestUtils$.consumeRecords(TestUtils.scala:1357)
>   at 
> kafka.api.TransactionsTest.testReadCommittedConsumerShouldNotSeeUndecidedData(TransactionsTest.scala:143)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-5455) Update java docs for consumer and producer to be up to date for EOS

2017-06-15 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5455:
---

 Summary: Update java docs for consumer and producer to be up to 
date for EOS
 Key: KAFKA-5455
 URL: https://issues.apache.org/jira/browse/KAFKA-5455
 Project: Kafka
  Issue Type: Bug
Reporter: Apurva Mehta
Assignee: Apurva Mehta
 Fix For: 0.11.0.0


The current java docs are copy /pasted from the KIPs and are just place holders.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5455) Update java docs for consumer and producer to be up to date for EOS

2017-06-15 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5455:

Labels: exactly-once  (was: )

> Update java docs for consumer and producer to be up to date for EOS
> ---
>
> Key: KAFKA-5455
> URL: https://issues.apache.org/jira/browse/KAFKA-5455
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> The current java docs are copy /pasted from the KIPs and are just place 
> holders.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5449) Flaky test TransactionsTest.testReadCommittedConsumerShouldNotSeeUndecidedData

2017-06-14 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16049800#comment-16049800
 ] 

Apurva Mehta commented on KAFKA-5449:
-

With the latest patch, it appears that the producer which leaves its 
transaction open gets a producer fenced exception, while the produce which 
calls the commit gets the invalid transition exception: 

{noformat}
[2017-06-14 22:06:15,648] ERROR [TransactionalId transactional-producer] 
Aborting producer batches due to fatal error 
(org.apache.kafka.clients.producer.internals.Sender:384)
org.apache.kafka.common.errors.ProducerFencedException: Producer attempted an 
operation with an old epoch. Either there is a newer producer with the same 
transactionalId, or the producer's transaction has been expired by the broker
...
[2017-06-14 22:06:34,841] ERROR Uncaught error in kafka producer I/O thread:  
(org.apache.kafka.clients.producer.internals.Sender:164)
org.apache.kafka.common.KafkaException: [TransactionalId other] Invalid 
transition attempted from state READY to state ABORTABLE_ERROR
at 
org.apache.kafka.clients.producer.internals.TransactionManager.transitionTo(TransactionManager.java:477)
at 
org.apache.kafka.clients.producer.internals.TransactionManager.transitionToAbortableError(TransactionManager.java:289)
at 
org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:604)
at 
org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:272)
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:223)
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:162)
{noformat}

Very strange. Unfortunately I changed the wrong log4j.properties, so we didn't 
get the trace logging for that failure.

> Flaky test TransactionsTest.testReadCommittedConsumerShouldNotSeeUndecidedData
> --
>
> Key: KAFKA-5449
> URL: https://issues.apache.org/jira/browse/KAFKA-5449
> Project: Kafka
>  Issue Type: Bug
>  Components: core, unit tests
>Reporter: Matthias J. Sax
>Assignee: Apurva Mehta
>  Labels: exactly-once, test
>
> {noformat}
> java.lang.AssertionError: Consumed 6 records until timeout instead of the 
> expected 8 records
>   at kafka.utils.TestUtils$.fail(TestUtils.scala:333)
>   at kafka.utils.TestUtils$.waitUntilTrue(TestUtils.scala:847)
>   at kafka.utils.TestUtils$.consumeRecords(TestUtils.scala:1357)
>   at 
> kafka.api.TransactionsTest.testReadCommittedConsumerShouldNotSeeUndecidedData(TransactionsTest.scala:143)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KAFKA-5449) Flaky test TransactionsTest.testReadCommittedConsumerShouldNotSeeUndecidedData

2017-06-14 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta reassigned KAFKA-5449:
---

Assignee: Apurva Mehta

> Flaky test TransactionsTest.testReadCommittedConsumerShouldNotSeeUndecidedData
> --
>
> Key: KAFKA-5449
> URL: https://issues.apache.org/jira/browse/KAFKA-5449
> Project: Kafka
>  Issue Type: Bug
>  Components: core, unit tests
>Reporter: Matthias J. Sax
>Assignee: Apurva Mehta
>  Labels: exactly-once, test
>
> {noformat}
> java.lang.AssertionError: Consumed 6 records until timeout instead of the 
> expected 8 records
>   at kafka.utils.TestUtils$.fail(TestUtils.scala:333)
>   at kafka.utils.TestUtils$.waitUntilTrue(TestUtils.scala:847)
>   at kafka.utils.TestUtils$.consumeRecords(TestUtils.scala:1357)
>   at 
> kafka.api.TransactionsTest.testReadCommittedConsumerShouldNotSeeUndecidedData(TransactionsTest.scala:143)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KAFKA-5020) Update protocol documentation to mention message format v2

2017-06-13 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta reassigned KAFKA-5020:
---

Assignee: Apurva Mehta

> Update protocol documentation to mention message format v2
> --
>
> Key: KAFKA-5020
> URL: https://issues.apache.org/jira/browse/KAFKA-5020
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Ismael Juma
>Assignee: Apurva Mehta
>Priority: Critical
>  Labels: documentation, exactly-once
> Fix For: 0.11.0.0
>
>
> Sections 5.3, 5.4 and 5.5 should be updated:
> https://kafka.apache.org/documentation/#messages
> We may want to mention record batches along with message sets here:
> https://kafka.apache.org/protocol#protocol_message_sets
> And we should update the wiki page linked from the protocol documentation:
> https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-Messagesets



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (KAFKA-5317) Update KIP-98 to reflect changes during implementation.

2017-06-13 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta resolved KAFKA-5317.
-
Resolution: Fixed

> Update KIP-98 to reflect changes during implementation.
> ---
>
> Key: KAFKA-5317
> URL: https://issues.apache.org/jira/browse/KAFKA-5317
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: documentation, exactly-once
> Fix For: 0.11.0.0
>
>
> While implementing the EOS design, there are some minor (or major?) tweaks we 
> made as we hands-on the code bases. We will compile all these changes in a 
> single run at the end of the code-ready and this JIRA is for keeping track of 
> all the changes we made.
> 02/27/2017: collapse the two types of transactional log messages into a 
> single type with key as transactional id, and value as [pid, epoch, 
> transaction_timeout, transaction_state, [topic partition ] ]. Also using a 
> single memory map in cache instead of two on the TC.
> 03/01/2017: for pid expiration, we decided to use min(transactional id 
> expiration timeout, topic retention). For topics enabled for compaction only, 
> we just use the transactional timeout. If the retention setting is larger 
> than the transactional id expiration timeout, then the pid will be 
> "logically" expired (i.e. we will remove it from the cached pid mapping and 
> ignore it when rebuilding the cache from the log)
> 03/20/2017: add a new exception type in `o.a.k.common.errors` for invalid 
> transaction timeout values.
> 03/25/2017: extend WriteTxnMarkerRequest to contain multiple markers for 
> multiple PIDs with a single request.
> 04/20/2017: add transactionStartTime to TransactionMetadata
> 04/26/2017: added a new retriable error: Errors.CONCURRENT_TRANSACTIONS
> 04/01/2017: We also enforce acks=all on the client when idempotence is 
> enabled. Without this, we cannot again guarantee idemptoence.
> 04/10/2017: WE also don't pass the underlying exception to 
> `RetriableOffsetCommitFailedException` anymore: 
> https://issues.apache.org/jira/browse/KAFKA-5052



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-5317) Update KIP-98 to reflect changes during implementation.

2017-06-13 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16048515#comment-16048515
 ] 

Apurva Mehta commented on KAFKA-5317:
-

I updated both the wiki and the google doc to reflect the changes on this list. 
Closing this out.

> Update KIP-98 to reflect changes during implementation.
> ---
>
> Key: KAFKA-5317
> URL: https://issues.apache.org/jira/browse/KAFKA-5317
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: documentation, exactly-once
> Fix For: 0.11.0.0
>
>
> While implementing the EOS design, there are some minor (or major?) tweaks we 
> made as we hands-on the code bases. We will compile all these changes in a 
> single run at the end of the code-ready and this JIRA is for keeping track of 
> all the changes we made.
> 02/27/2017: collapse the two types of transactional log messages into a 
> single type with key as transactional id, and value as [pid, epoch, 
> transaction_timeout, transaction_state, [topic partition ] ]. Also using a 
> single memory map in cache instead of two on the TC.
> 03/01/2017: for pid expiration, we decided to use min(transactional id 
> expiration timeout, topic retention). For topics enabled for compaction only, 
> we just use the transactional timeout. If the retention setting is larger 
> than the transactional id expiration timeout, then the pid will be 
> "logically" expired (i.e. we will remove it from the cached pid mapping and 
> ignore it when rebuilding the cache from the log)
> 03/20/2017: add a new exception type in `o.a.k.common.errors` for invalid 
> transaction timeout values.
> 03/25/2017: extend WriteTxnMarkerRequest to contain multiple markers for 
> multiple PIDs with a single request.
> 04/20/2017: add transactionStartTime to TransactionMetadata
> 04/26/2017: added a new retriable error: Errors.CONCURRENT_TRANSACTIONS
> 04/01/2017: We also enforce acks=all on the client when idempotence is 
> enabled. Without this, we cannot again guarantee idemptoence.
> 04/10/2017: WE also don't pass the underlying exception to 
> `RetriableOffsetCommitFailedException` anymore: 
> https://issues.apache.org/jira/browse/KAFKA-5052



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5052) We shouldn't pass the underlying exception to RetriableCommitFailedException when an async offset commit fails.

2017-06-13 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5052:

Description: 
* This is a follow up to : https://issues.apache.org/jira/browse/KAFKA-3527

We currently wrap retriable exceptions encountered during offset commits in a 
`RetriableOffsetCommitException`. The problem is that today we also pass the 
underlying internal exception on to the user. There isn't a really good reason 
to do this, since the user will not handle each individual exception 
differently: they will just retry anyway.

We should not pass on the underlying internal exception. It makes the API 
simpler, and also allows us to change things underneath with more flexibility.

  was:
This is a follow up to : https://issues.apache.org/jira/browse/KAFKA-3527

We currently wrap retriable exceptions encountered during offset commits in a 
`RetriableOffsetCommitException`. The problem is that today we also pass the 
underlying internal exception on to the user. There isn't a really good reason 
to do this, since the user will not handle each individual exception 
differently: they will just retry anyway.

We should not pass on the underlying internal exception. It makes the API 
simpler, and also allows us to change things underneath with more flexibility.


> We shouldn't pass the underlying exception to RetriableCommitFailedException 
> when an async offset commit fails.
> ---
>
> Key: KAFKA-5052
> URL: https://issues.apache.org/jira/browse/KAFKA-5052
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
> Fix For: 0.11.0.0
>
>
> * This is a follow up to : https://issues.apache.org/jira/browse/KAFKA-3527
> We currently wrap retriable exceptions encountered during offset commits in a 
> `RetriableOffsetCommitException`. The problem is that today we also pass the 
> underlying internal exception on to the user. There isn't a really good 
> reason to do this, since the user will not handle each individual exception 
> differently: they will just retry anyway.
> We should not pass on the underlying internal exception. It makes the API 
> simpler, and also allows us to change things underneath with more flexibility.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5441) Fix transaction marker grouping by producerId in TC

2017-06-13 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5441:

Labels: exactly-once  (was: )

> Fix transaction marker grouping by producerId in TC
> ---
>
> Key: KAFKA-5441
> URL: https://issues.apache.org/jira/browse/KAFKA-5441
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Jason Gustafson
>  Labels: exactly-once
> Fix For: 0.11.0.1
>
>
> It seems in some cases the WriteTxnMarker request can be sent with multiple 
> entries for the same ProducerId. This is the cause of KAFKA-5438. This is not 
> necessarily a correctness problem, but it seems unintentional and should 
> probably be fixed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (KAFKA-5317) Update KIP-98 to reflect changes during implementation.

2017-06-12 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta reassigned KAFKA-5317:
---

Assignee: Apurva Mehta

> Update KIP-98 to reflect changes during implementation.
> ---
>
> Key: KAFKA-5317
> URL: https://issues.apache.org/jira/browse/KAFKA-5317
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: documentation, exactly-once
> Fix For: 0.11.0.0
>
>
> While implementing the EOS design, there are some minor (or major?) tweaks we 
> made as we hands-on the code bases. We will compile all these changes in a 
> single run at the end of the code-ready and this JIRA is for keeping track of 
> all the changes we made.
> 02/27/2017: collapse the two types of transactional log messages into a 
> single type with key as transactional id, and value as [pid, epoch, 
> transaction_timeout, transaction_state, [topic partition ] ]. Also using a 
> single memory map in cache instead of two on the TC.
> 03/01/2017: for pid expiration, we decided to use min(transactional id 
> expiration timeout, topic retention). For topics enabled for compaction only, 
> we just use the transactional timeout. If the retention setting is larger 
> than the transactional id expiration timeout, then the pid will be 
> "logically" expired (i.e. we will remove it from the cached pid mapping and 
> ignore it when rebuilding the cache from the log)
> 03/20/2017: add a new exception type in `o.a.k.common.errors` for invalid 
> transaction timeout values.
> 03/25/2017: extend WriteTxnMarkerRequest to contain multiple markers for 
> multiple PIDs with a single request.
> 04/20/2017: add transactionStartTime to TransactionMetadata
> 04/26/2017: added a new retriable error: Errors.CONCURRENT_TRANSACTIONS
> 04/01/2017: We also enforce acks=all on the client when idempotence is 
> enabled. Without this, we cannot again guarantee idemptoence.
> 04/10/2017: WE also don't pass the underlying exception to 
> `RetriableOffsetCommitFailedException` anymore: 
> https://issues.apache.org/jira/browse/KAFKA-5052



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5438) UnsupportedOperationException in WriteTxnMarkers handler

2017-06-12 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5438:

Labels: exactly-once  (was: )

> UnsupportedOperationException in WriteTxnMarkers handler
> 
>
> Key: KAFKA-5438
> URL: https://issues.apache.org/jira/browse/KAFKA-5438
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Jason Gustafson
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.1.0
>
>
> {code}
> [2017-06-10 19:16:36,280] ERROR [KafkaApi-2] Error when handling request 
> {replica_id=1,max_wait_time=500,min_bytes=1,max_bytes=10485760,isolation_level=0,topics=[{topic=__transaction_state,partitions=[{partition=45,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=15,f
> etch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=20,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=3,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=21,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=29,fetch_offset=0,lo
> g_start_offset=0,max_bytes=1048576},{partition=39,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=38,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=9,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=33,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=32,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=17,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=23,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=47,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=26,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=5,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=27,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=41,fetch_offset=48,log_start_offset=0,max_bytes=1048576},{partition=35,fetch_offset=0,log_start_offset=0,max_bytes=1048576}]},{topic=__consumer_offsets,partitions=[{partition=8,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=21,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=27,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=9,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=41,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=33,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=23,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=47,fetch_offset=41,log_start_offset=0,max_bytes=1048576},{partition=3,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=15,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=17,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=11,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=14,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=20,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=39,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=45,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=5,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=26,fetch_offset=0,log_start_offset=0,max_bytes=1048576},{partition=29,fetch_offset=0,log_start_offset=0,max_bytes=1048576}]},{topic=output-topic,partitions=[{partition=1,fetch_offset=4522,log_start_offset=0,max_bytes=1048576}]}]}
>  (kafka.server.KafkaApis)
> java.lang.UnsupportedOperationException
> at java.util.AbstractMap.put(AbstractMap.java:203)
> at java.util.AbstractMap.putAll(AbstractMap.java:273)
> at 
> kafka.server.KafkaApis.kafka$server$KafkaApis$$sendResponseCallback$13(KafkaApis.scala:1509)
> at 
> kafka.server.KafkaApis$$anonfun$handleWriteTxnMarkersRequest$2$$anonfun$apply$20.apply(KafkaApis.scala:1571)
> at 
> kafka.server.KafkaApis$$anonfun$handleWriteTxnMarkersRequest$2$$anonfun$apply$20.apply(KafkaApis.scala:1571)
> at kafka.server.DelayedProduce.onComplete(DelayedProduce.scala:131)
> at 
> kafka.server.DelayedOperation.forceComplete(DelayedOperation.scala:66)
> at kafka.server.DelayedProduce.tryComplete(DelayedProduce.scala:113)
> at 
> kafka.server.DelayedProduce.safeTryComplete(DelayedProduce.scala:76)
> at 
> kafka.server.DelayedOperationPurgatory$Watchers.tryCompleteWatched(DelayedOperation.scala:338)
> at 
> kafka.server.DelayedOperationPurgatory.checkAndComplete(DelayedOperation.scala:244)
> at 
> kafka.server.ReplicaManager.tryCompleteDelayedProduce(ReplicaManager.scala:250)
> at 
> kafka.cluster.Partition.tryCompleteDelayedRequests(Partition.scala:418)
> at 
> 

[jira] [Assigned] (KAFKA-5436) NullPointerException when loading producer snapshot

2017-06-12 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta reassigned KAFKA-5436:
---

Assignee: Jason Gustafson  (was: Apurva Mehta)

> NullPointerException when loading producer snapshot
> ---
>
> Key: KAFKA-5436
> URL: https://issues.apache.org/jira/browse/KAFKA-5436
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Apurva Mehta
>Assignee: Jason Gustafson
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> This was seen in the server.log of a recent system test failure:
> {noformat}
> java.lang.NullPointerException
> at kafka.log.Log.updateFirstUnstableOffset(Log.scala:689)
> at kafka.log.Log.loadProducerState(Log.scala:466)
> at kafka.log.Log.(Log.scala:185)
> at 
> kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$5$$anonfun$apply$12$$anonfun$apply$1.apply$mcV$sp(LogManager.scala:172)
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The actual snapshot file had these contents:
> {noformat}
> amehta-macbook-pro:kafka-data-logs apurva$ 
> ~/workspace/confluent/kafka/bin/kafka-run-class.sh 
> kafka.tools.DumpLogSegments -files 
> output-topic-1/5776.snapshot
> Dumping output-topic-1/5776.snapshot
> producerId: 1001 producerEpoch: 0 lastSequence: 3249 lastOffset: 6026 
> offsetDelta: 0 lastTimestamp: 1497127651606 coordinatorEpoch: 2 
> currentTxnFirstOffset: None
> producerId: 0 producerEpoch: 0 lastSequence: 2749 lastOffset: 6025 
> offsetDelta: 249 lastTimestamp: 1497127647269 coordinatorEpoch: 1 
> currentTxnFirstOffset: Some(5776)
> {noformat}
> The log file ended at offset 5775:
> {noformat}
> baseOffset: 4770 lastOffset: 4770 baseSequence: -1 lastSequence: -1 
> producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true 
> position: 58134 CreateTime: 1497127646732 isvalid: true size: 78 magic: 2 
> compresscodec: NONE crc: 3476040058
> baseOffset: 4771 lastOffset: 5020 baseSequence: 2750 lastSequence: 2999 
> producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: 
> true position: 58212 CreateTime: 1497127646719 isvalid: true size: 2997 
> magic: 2 compresscodec: NONE crc: 1010131725
> baseOffset: 5021 lastOffset: 5021 baseSequence: -1 lastSequence: -1 
> producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: 
> true position: 61209 CreateTime: 1497127646923 isvalid: true size: 78 magic: 
> 2 compresscodec: NONE crc: 4001115747
> baseOffset: 5022 lastOffset: 5271 baseSequence: 2000 lastSequence: 2249 
> producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true 
> position: 61287 CreateTime: 1497127646760 isvalid: true size: 2997 magic: 2 
> compresscodec: NONE crc: 2903697477
> baseOffset: 5272 lastOffset: 5272 baseSequence: -1 lastSequence: -1 
> producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true 
> position: 64284 CreateTime: 1497127646982 isvalid: true size: 78 magic: 2 
> compresscodec: NONE crc: 2599716061
> baseOffset: 5273 lastOffset: 5522 baseSequence: 3000 lastSequence: 3249 
> producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: 
> true position: 64362 CreateTime: 1497127646948 isvalid: true size: 3080 
> magic: 2 compresscodec: NONE crc: 3176858298
> baseOffset: 5523 lastOffset: 5523 baseSequence: -1 lastSequence: -1 
> producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: 
> true position: 67442 CreateTime: 1497127647152 isvalid: true size: 78 magic: 
> 2 compresscodec: NONE crc: 3464913172
> baseOffset: 5524 lastOffset: 5773 baseSequence: 2250 lastSequence: 2499 
> producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true 
> position: 67520 CreateTime: 1497127647001 isvalid: true size: 3081 magic: 2 
> compresscodec: NONE crc: 712707513
> baseOffset: 5774 lastOffset: 5774 baseSequence: -1 lastSequence: -1 
> producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true 
> position: 70601 CreateTime: 1497127647248 isvalid: true size: 78 magic: 2 
> compresscodec: NONE crc: 1043637525
> baseOffset: 5775 lastOffset: 5775 baseSequence: -1 lastSequence: -1 
> producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: 
> true position: 70679 CreateTime: 1497127647352 isvalid: true size: 78 magic: 
> 2 compresscodec: 

[jira] [Updated] (KAFKA-5436) NullPointerException when loading producer snapshot

2017-06-12 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5436:

Issue Type: Sub-task  (was: Bug)
Parent: KAFKA-4815

> NullPointerException when loading producer snapshot
> ---
>
> Key: KAFKA-5436
> URL: https://issues.apache.org/jira/browse/KAFKA-5436
> Project: Kafka
>  Issue Type: Sub-task
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> This was seen in the server.log of a recent system test failure:
> {noformat}
> java.lang.NullPointerException
> at kafka.log.Log.updateFirstUnstableOffset(Log.scala:689)
> at kafka.log.Log.loadProducerState(Log.scala:466)
> at kafka.log.Log.(Log.scala:185)
> at 
> kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$5$$anonfun$apply$12$$anonfun$apply$1.apply$mcV$sp(LogManager.scala:172)
> at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The actual snapshot file had these contents:
> {noformat}
> amehta-macbook-pro:kafka-data-logs apurva$ 
> ~/workspace/confluent/kafka/bin/kafka-run-class.sh 
> kafka.tools.DumpLogSegments -files 
> output-topic-1/5776.snapshot
> Dumping output-topic-1/5776.snapshot
> producerId: 1001 producerEpoch: 0 lastSequence: 3249 lastOffset: 6026 
> offsetDelta: 0 lastTimestamp: 1497127651606 coordinatorEpoch: 2 
> currentTxnFirstOffset: None
> producerId: 0 producerEpoch: 0 lastSequence: 2749 lastOffset: 6025 
> offsetDelta: 249 lastTimestamp: 1497127647269 coordinatorEpoch: 1 
> currentTxnFirstOffset: Some(5776)
> {noformat}
> The log file ended at offset 5775:
> {noformat}
> baseOffset: 4770 lastOffset: 4770 baseSequence: -1 lastSequence: -1 
> producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true 
> position: 58134 CreateTime: 1497127646732 isvalid: true size: 78 magic: 2 
> compresscodec: NONE crc: 3476040058
> baseOffset: 4771 lastOffset: 5020 baseSequence: 2750 lastSequence: 2999 
> producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: 
> true position: 58212 CreateTime: 1497127646719 isvalid: true size: 2997 
> magic: 2 compresscodec: NONE crc: 1010131725
> baseOffset: 5021 lastOffset: 5021 baseSequence: -1 lastSequence: -1 
> producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: 
> true position: 61209 CreateTime: 1497127646923 isvalid: true size: 78 magic: 
> 2 compresscodec: NONE crc: 4001115747
> baseOffset: 5022 lastOffset: 5271 baseSequence: 2000 lastSequence: 2249 
> producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true 
> position: 61287 CreateTime: 1497127646760 isvalid: true size: 2997 magic: 2 
> compresscodec: NONE crc: 2903697477
> baseOffset: 5272 lastOffset: 5272 baseSequence: -1 lastSequence: -1 
> producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true 
> position: 64284 CreateTime: 1497127646982 isvalid: true size: 78 magic: 2 
> compresscodec: NONE crc: 2599716061
> baseOffset: 5273 lastOffset: 5522 baseSequence: 3000 lastSequence: 3249 
> producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: 
> true position: 64362 CreateTime: 1497127646948 isvalid: true size: 3080 
> magic: 2 compresscodec: NONE crc: 3176858298
> baseOffset: 5523 lastOffset: 5523 baseSequence: -1 lastSequence: -1 
> producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: 
> true position: 67442 CreateTime: 1497127647152 isvalid: true size: 78 magic: 
> 2 compresscodec: NONE crc: 3464913172
> baseOffset: 5524 lastOffset: 5773 baseSequence: 2250 lastSequence: 2499 
> producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true 
> position: 67520 CreateTime: 1497127647001 isvalid: true size: 3081 magic: 2 
> compresscodec: NONE crc: 712707513
> baseOffset: 5774 lastOffset: 5774 baseSequence: -1 lastSequence: -1 
> producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true 
> position: 70601 CreateTime: 1497127647248 isvalid: true size: 78 magic: 2 
> compresscodec: NONE crc: 1043637525
> baseOffset: 5775 lastOffset: 5775 baseSequence: -1 lastSequence: -1 
> producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: 
> true position: 70679 CreateTime: 1497127647352 isvalid: true size: 78 magic: 
> 2 compresscodec: 

[jira] [Created] (KAFKA-5436) NullPointerException when loading producer snapshot

2017-06-12 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5436:
---

 Summary: NullPointerException when loading producer snapshot
 Key: KAFKA-5436
 URL: https://issues.apache.org/jira/browse/KAFKA-5436
 Project: Kafka
  Issue Type: Bug
Reporter: Apurva Mehta
Assignee: Apurva Mehta
Priority: Blocker
 Fix For: 0.11.0.0


This was seen in the server.log of a recent system test failure:

{noformat}
java.lang.NullPointerException
at kafka.log.Log.updateFirstUnstableOffset(Log.scala:689)
at kafka.log.Log.loadProducerState(Log.scala:466)
at kafka.log.Log.(Log.scala:185)
at 
kafka.log.LogManager$$anonfun$loadLogs$2$$anonfun$5$$anonfun$apply$12$$anonfun$apply$1.apply$mcV$sp(LogManager.scala:172)
at kafka.utils.CoreUtils$$anon$1.run(CoreUtils.scala:57)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}

The actual snapshot file had these contents:

{noformat}
amehta-macbook-pro:kafka-data-logs apurva$ 
~/workspace/confluent/kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments 
-files output-topic-1/5776.snapshot
Dumping output-topic-1/5776.snapshot
producerId: 1001 producerEpoch: 0 lastSequence: 3249 lastOffset: 6026 
offsetDelta: 0 lastTimestamp: 1497127651606 coordinatorEpoch: 2 
currentTxnFirstOffset: None
producerId: 0 producerEpoch: 0 lastSequence: 2749 lastOffset: 6025 offsetDelta: 
249 lastTimestamp: 1497127647269 coordinatorEpoch: 1 currentTxnFirstOffset: 
Some(5776)
{noformat}

The log file ended at offset 5775:
{noformat}
baseOffset: 4770 lastOffset: 4770 baseSequence: -1 lastSequence: -1 producerId: 
0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 
58134 CreateTime: 1497127646732 isvalid: true size: 78 magic: 2 compresscodec: 
NONE crc: 3476040058
baseOffset: 4771 lastOffset: 5020 baseSequence: 2750 lastSequence: 2999 
producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true 
position: 58212 CreateTime: 1497127646719 isvalid: true size: 2997 magic: 2 
compresscodec: NONE crc: 1010131725
baseOffset: 5021 lastOffset: 5021 baseSequence: -1 lastSequence: -1 producerId: 
1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 
61209 CreateTime: 1497127646923 isvalid: true size: 78 magic: 2 compresscodec: 
NONE crc: 4001115747
baseOffset: 5022 lastOffset: 5271 baseSequence: 2000 lastSequence: 2249 
producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true 
position: 61287 CreateTime: 1497127646760 isvalid: true size: 2997 magic: 2 
compresscodec: NONE crc: 2903697477
baseOffset: 5272 lastOffset: 5272 baseSequence: -1 lastSequence: -1 producerId: 
0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 
64284 CreateTime: 1497127646982 isvalid: true size: 78 magic: 2 compresscodec: 
NONE crc: 2599716061
baseOffset: 5273 lastOffset: 5522 baseSequence: 3000 lastSequence: 3249 
producerId: 1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true 
position: 64362 CreateTime: 1497127646948 isvalid: true size: 3080 magic: 2 
compresscodec: NONE crc: 3176858298
baseOffset: 5523 lastOffset: 5523 baseSequence: -1 lastSequence: -1 producerId: 
1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 
67442 CreateTime: 1497127647152 isvalid: true size: 78 magic: 2 compresscodec: 
NONE crc: 3464913172
baseOffset: 5524 lastOffset: 5773 baseSequence: 2250 lastSequence: 2499 
producerId: 0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true 
position: 67520 CreateTime: 1497127647001 isvalid: true size: 3081 magic: 2 
compresscodec: NONE crc: 712707513
baseOffset: 5774 lastOffset: 5774 baseSequence: -1 lastSequence: -1 producerId: 
0 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 
70601 CreateTime: 1497127647248 isvalid: true size: 78 magic: 2 compresscodec: 
NONE crc: 1043637525
baseOffset: 5775 lastOffset: 5775 baseSequence: -1 lastSequence: -1 producerId: 
1001 producerEpoch: 0 partitionLeaderEpoch: 1 isTransactional: true position: 
70679 CreateTime: 1497127647352 isvalid: true size: 78 magic: 2 compresscodec: 
NONE crc: 968257373
{noformat}

The suspicious bit is the `currentTxnFirstOffset: Some(5776)`. This indicates 
that snapshot didn't get truncated correctly since it retains an offset which 
doesn't exist in the truncated segment, causing an NPE.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5428) Transactional producer aborts batches incorrectly in abortable error state

2017-06-12 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5428:

Labels: exactly-once  (was: )

> Transactional producer aborts batches incorrectly in abortable error state
> --
>
> Key: KAFKA-5428
> URL: https://issues.apache.org/jira/browse/KAFKA-5428
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> We currently abort batches blindly if we are in any error state. We should 
> only do this if we are in a fatal error state. Otherwise, we risk 
> OutOfOrderSequence errors if a failed produce request had actually been 
> written successfully to the topic.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-5427) Transactional producer cannot find coordinator when trying to abort transaction after error

2017-06-12 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5427:

Labels: exactly-once  (was: )

> Transactional producer cannot find coordinator when trying to abort 
> transaction after error
> ---
>
> Key: KAFKA-5427
> URL: https://issues.apache.org/jira/browse/KAFKA-5427
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> It can happen that we receive an abortable error while we are already 
> aborting a transaction. In this case, we have an EndTxnRequest queued for 
> sending when we transition to ABORTABLE_ERROR. It could be that we need to 
> find the coordinator before sending this EndTxnRequest. The problem is that 
> we will fail even the FindCoordinatorRequest because we are in an error 
> state.  This causes the following endless loop:
> {code}
> [2017-06-10 19:29:33,436] DEBUG [TransactionalId my-first-transactional-id] 
> Enqueuing transactional request (type=FindCoordinatorRequest, 
> coordinatorKey=my-fi
> rst-transactional-id, coordinatorType=TRANSACTION) 
> (org.apache.kafka.clients.producer.internals.TransactionManager)
> [2017-06-10 19:29:33,436] DEBUG [TransactionalId my-first-transactional-id] 
> Enqueuing transactional request (type=EndTxnRequest, 
> transactionalId=my-first-tran
> sactional-id, producerId=1000, producerEpoch=0, result=ABORT) 
> (org.apache.kafka.clients.producer.internals.TransactionManager)
> [2017-06-10 19:29:33,536] TRACE [TransactionalId my-first-transactional-id] 
> Not sending transactional request (type=FindCoordinatorRequest, 
> coordinatorKey=my-
> first-transactional-id, coordinatorType=TRANSACTION) because we are in an 
> error state (org.apache.kafka.clients.producer.internals.TransactionManager)
> [2017-06-10 19:29:33,637] TRACE [TransactionalId my-first-transactional-id] 
> Request (type=EndTxnRequest, transactionalId=my-first-transactional-id, 
> producerId
> =1000, producerEpoch=0, result=ABORT) dequeued for sending 
> (org.apache.kafka.clients.producer.internals.TransactionManager)
> [2017-06-10 19:29:33,637] DEBUG [TransactionalId my-first-transactional-id] 
> Enqueuing transactional request (type=FindCoordinatorRequest, 
> coordinatorKey=my-fi
> rst-transactional-id, coordinatorType=TRANSACTION) 
> (org.apache.kafka.clients.producer.internals.TransactionManager)
> [2017-06-10 19:29:33,637] DEBUG [TransactionalId my-first-transactional-id] 
> Enqueuing transactional request (type=EndTxnRequest, 
> transactionalId=my-first-tran
> sactional-id, producerId=1000, producerEpoch=0, result=ABORT) 
> (org.apache.kafka.clients.producer.internals.TransactionManager)
> [2017-06-10 19:29:33,737] TRACE [TransactionalId my-first-transactional-id] 
> Not sending transactional request (type=FindCoordinatorRequest, 
> coordinatorKey=my-
> first-transactional-id, coordinatorType=TRANSACTION) because we are in an 
> error state (org.apache.kafka.clients.producer.internals.TransactionManager)
> [2017-06-10 19:29:33,837] TRACE [TransactionalId my-first-transactional-id] 
> Request (type=EndTxnRequest, transactionalId=my-first-transactional-id, 
> producerId
> =1000, producerEpoch=0, result=ABORT) dequeued for sending 
> (org.apache.kafka.clients.producer.internals.TransactionManager)
> [2017-06-10 19:29:33,838] DEBUG [TransactionalId my-first-transactional-id] 
> Enqueuing transactional request (type=FindCoordinatorRequest, 
> coordinatorKey=my-fi
> rst-transactional-id, coordinatorType=TRANSACTION) 
> (org.apache.kafka.clients.producer.internals.TransactionManager)
> [2017-06-10 19:29:33,838] DEBUG [TransactionalId my-first-transactional-id] 
> Enqueuing transactional request (type=EndTxnRequest, 
> transactionalId=my-first-tran
> sactional-id, producerId=1000, producerEpoch=0, result=ABORT) 
> (org.apache.kafka.clients.producer.internals.TransactionManager)
> [2017-06-10 19:29:33,938] TRACE [TransactionalId my-first-transactional-id] 
> Not sending transactional request (type=FindCoordinatorRequest, 
> coordinatorKey=my-
> first-transactional-id, coordinatorType=TRANSACTION) because we are in an 
> error state (org.apache.kafka.clients.producer.internals.TransactionManager)
> {code}
> A couple suggested improvements:
> 1. We should allow FindCoordinator requests regardless of the transaction 
> state.
> 2. It is a bit confusing that we allow EndTxnRequest to be sent in both the 
> ABORTABLE_ERROR and the ABORTING_TRANSACTION states. Perhaps we should only 
> allow EndTxnRequest to be sent in ABORTING_TRANSACTION. If we hit an 
> abortable error and we are already aborting, then we 

[jira] [Updated] (KAFKA-5342) Distinguish abortable failures in transactional producer

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5342:

Fix Version/s: (was: 0.11.0.0)
   0.11.0.1

> Distinguish abortable failures in transactional producer
> 
>
> Key: KAFKA-5342
> URL: https://issues.apache.org/jira/browse/KAFKA-5342
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Jason Gustafson
>Assignee: Jason Gustafson
> Fix For: 0.11.0.1
>
>
> The transactional producer distinguishes two classes of user-visible errors:
> 1. Abortable errors: these are errors which are fatal to the ongoing 
> transaction, but which can be successfully aborted. Essentially any error in 
> which the producer can still expect to successfully send EndTxn to the 
> transaction coordinator is abortable.
> 2. Fatal errors: any error which is not abortable is fatal. For example, a 
> transactionalId authorization error is fatal because it would also prevent 
> the TC from receiving the EndTxn request.
> At the moment, it's not clear how the user would know how they should handle 
> a given failure. One option is to add an exception type to indicate which 
> errors are abortable (e.g. AbortableKafkaException). Then any other exception 
> could be considered fatal.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5286) Producer should await transaction completion in close

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5286:

Fix Version/s: (was: 0.11.0.0)
   0.11.0.1

> Producer should await transaction completion in close
> -
>
> Key: KAFKA-5286
> URL: https://issues.apache.org/jira/browse/KAFKA-5286
> Project: Kafka
>  Issue Type: Sub-task
>  Components: clients, core, producer 
>Reporter: Jason Gustafson
> Fix For: 0.11.0.1
>
>
> We should wait at least as long as the timeout for a transaction which has 
> begun completion (commit or abort) to be finished. Tricky thing is whether we 
> should abort a transaction which is in progress. It seems reasonable since 
> that's the coordinator will either timeout and abort the transaction or the 
> next producer using the same transactionalId will fence the producer and 
> abort the transaction. In any case, the transaction will be aborted, so 
> perhaps we should do it proactively.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5415) TransactionCoordinator doesn't complete transition to PrepareCommit state

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5415:

Status: Patch Available  (was: Open)

> TransactionCoordinator doesn't complete transition to PrepareCommit state
> -
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (KAFKA-5415) TransactionCoordinator doesn't complete transition to PrepareCommit state

2017-06-09 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045162#comment-16045162
 ] 

Apurva Mehta edited comment on KAFKA-5415 at 6/9/17 10:48 PM:
--

The last successful metadata update was the following. The update timestamp was 
1496957141444.

{noformat}
[2017-06-08 21:25:41,449] DEBUG TransactionalId my-first-transactional-id 
complete transition from Ongoing to TxnTransitMetadata(producerId=2000, 
producerEpoch=0, txnTimeoutMs=6, txnState=Ongoing, 
topicPartitions=Set(output-topic-2, __consumer_offsets-47, output-topic-0, 
output-topic-1), txnStartTimestamp=1496957141430, 
txnLastUpdateTimestamp=1496957141444) 
(kafka.coordinator.transaction.TransactionMetadata)
{noformat}

then the system clock rolled back by a couple of hundred milliseconds, and the 
'prepare transition' to 'PrepareCommit' had this transition metadata, with an 
update time of 1496957141285

{noformat}
[2017-06-08 21:25:41,285] DEBUG TransactionalId my-first-transactional-id 
prepare transition from Ongoing to TxnTransitMetadata(producerId=2000, 
producerEpoch=0, txnTimeoutMs=6, txnState=PrepareCommit, 
topicPartitions=Set(output-topic-2, __consumer_offsets-47, output-topic-0, 
output-topic-1), txnStartTimestamp=1496957141430, 
txnLastUpdateTimestamp=1496957141285) 
(kafka.coordinator.transaction.TransactionMetadata)
{noformat}

So when it came time to complete the transition, the timestamp check would fail 
because the new update timestamp was older than the previous one. We wolud 
throw an illegalStateException, which would be caught and swallowed in the 
delayed fetch operation, hence leving the transaction hanging with a 
pendingState of PrepareCommit.




was (Author: apurva):
The last successful metadata update was the following. The update timestamp was 
1496957141444.

{noformat}
[2017-06-08 21:25:41,449] DEBUG TransactionalId my-first-transactional-id 
complete transition from Ongoing to TxnTransitMetadata(producerId=2000, 
producerEpoch=0, txnTimeoutMs=6, txnState=Ongoing, 
topicPartitions=Set(output-topic-2, __consumer_offsets-47, output-topic-0, 
output-topic-1), txnStartTimestamp=1496957141430, 
txnLastUpdateTimestamp=1496957141444) 
(kafka.coordinator.transaction.TransactionMetadata)
{noformat}

then the system clock rolled back by a couple of hundred milliseconds, and the 
'prepare transition' to 'PrepareCommit' had this transition metadata 

{noformat}
[2017-06-08 21:25:41,285] DEBUG TransactionalId my-first-transactional-id 
prepare transition from Ongoing to TxnTransitMetadata(producerId=2000, 
producerEpoch=0, txnTimeoutMs=6, txnState=PrepareCommit, 
topicPartitions=Set(output-topic-2, __consumer_offsets-47, output-topic-0, 
output-topic-1), txnStartTimestamp=1496957141430, 
txnLastUpdateTimestamp=1496957141285) 
(kafka.coordinator.transaction.TransactionMetadata)
{noformat}

So when it came time to complete the transition, the timestamp check would fail 
because the new update timestamp was older than the previous one. We wolud 
throw an illegalStateException, which would be caught and swallowed in the 
delayed fetch operation, hence leving the transaction hanging with a 
pendingState of PrepareCommit.



> TransactionCoordinator doesn't complete transition to PrepareCommit state
> -
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5415) TransactionCoordinator doesn't complete transition to PrepareCommit state

2017-06-09 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045162#comment-16045162
 ] 

Apurva Mehta commented on KAFKA-5415:
-

The last successful metadata update was the following. The update timestamp was 
1496957141444.

{noformat}
[2017-06-08 21:25:41,449] DEBUG TransactionalId my-first-transactional-id 
complete transition from Ongoing to TxnTransitMetadata(producerId=2000, 
producerEpoch=0, txnTimeoutMs=6, txnState=Ongoing, 
topicPartitions=Set(output-topic-2, __consumer_offsets-47, output-topic-0, 
output-topic-1), txnStartTimestamp=1496957141430, 
txnLastUpdateTimestamp=1496957141444) 
(kafka.coordinator.transaction.TransactionMetadata)
{noformat}

then the system clock rolled back by a couple of hundred milliseconds, and the 
'prepare transition' to 'PrepareCommit' had this transition metadata 

{noformat}
[2017-06-08 21:25:41,285] DEBUG TransactionalId my-first-transactional-id 
prepare transition from Ongoing to TxnTransitMetadata(producerId=2000, 
producerEpoch=0, txnTimeoutMs=6, txnState=PrepareCommit, 
topicPartitions=Set(output-topic-2, __consumer_offsets-47, output-topic-0, 
output-topic-1), txnStartTimestamp=1496957141430, 
txnLastUpdateTimestamp=1496957141285) 
(kafka.coordinator.transaction.TransactionMetadata)
{noformat}

So when it came time to complete the transition, the timestamp check would fail 
because the new update timestamp was older than the previous one. We wolud 
throw an illegalStateException, which would be caught and swallowed in the 
delayed fetch operation, hence leving the transaction hanging with a 
pendingState of PrepareCommit.



> TransactionCoordinator doesn't complete transition to PrepareCommit state
> -
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5415) TransactionCoordinator doesn't complete transition to PrepareCommit state

2017-06-09 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045151#comment-16045151
 ] 

Apurva Mehta commented on KAFKA-5415:
-

The problem here seem to be this check: 

https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/coordinator/transaction/TransactionMetadata.scala#L288

I noticed that the system clock time rolled back slightly during the processing 
of this transaction, and [~hachikuji] noticed this check.

What this amounts to is that the operation to complete the state transition 
will fail in the DelayedFetch operation, and the exception would be swallowed. 
The Pending state would not be cleared, and future EndTxnRequests would fail 
with a CONCURRENT_TRANSACTIONS exception.

> TransactionCoordinator doesn't complete transition to PrepareCommit state
> -
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5422) Multiple produce request failures causes invalid state transition in TransactionManager

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5422:

Status: Patch Available  (was: Open)

> Multiple produce request failures causes invalid state transition in 
> TransactionManager
> ---
>
> Key: KAFKA-5422
> URL: https://issues.apache.org/jira/browse/KAFKA-5422
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> When multiple produce requests fail (for instance when all inflight batches 
> are expired), each will try to transition to ABORTABLE_ERROR. 
> However, only the first transition will succeed, the rest will fail with the 
> following 'invalid transition from ABORTABLE_ERROR to ABORTABLE_ERROR'. 
> This will be caught in the sender thread and things will continue. However, 
> the correct thing to do do is to allow multiple transitions to 
> ABORTABLE_ERROR.
> {noformat}
> [2017-06-09 01:22:39,327] WARN Connection to node 3 could not be established. 
> Broker may not be available. (org.apache.kafka.clients.NetworkClient)
> [2017-06-09 01:22:39,958] TRACE Expired 2 batches in accumulator 
> (org.apache.kafka.clients.producer.internals.Sender)
> [2017-06-09 01:22:39,958] DEBUG [TransactionalId my-first-transactional-id] 
> Transition from state COMMITTING_TRANSACTION to error state ABORTABLE_ERROR 
> (org.apache.kafka.clients.producer.internals.TransactionManager)
> org.apache.kafka.common.errors.TimeoutException: Expiring 250 record(s) for 
> output-topic-0: 30099 ms has passed since batch creation plus linger time
> [2017-06-09 01:22:39,960] TRACE Produced messages to topic-partition 
> output-topic-0 with base offset offset -1 and error: {}. 
> (org.apache.kafka.clients.producer.internals.ProducerBatch)
> org.apache.kafka.common.errors.TimeoutException: Expiring 250 record(s) for 
> output-topic-0: 30099 ms has passed since batch creation plus linger time
> [2017-06-09 01:22:39,960] ERROR Uncaught error in kafka producer I/O thread:  
> (org.apache.kafka.clients.producer.internals.Sender)
> org.apache.kafka.common.KafkaException: Invalid transition attempted from 
> state ABORTABLE_ERROR to state ABORTABLE_ERROR
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.transitionTo(TransactionManager.java:475)
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.transitionToAbortableError(TransactionManager.java:288)
> at 
> org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:602)
> at 
> org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:271)
> at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:221)
> at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:162)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5422) Multiple produce request failures causes invalid state transition in TransactionManager

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5422:

Fix Version/s: 0.11.0.0

> Multiple produce request failures causes invalid state transition in 
> TransactionManager
> ---
>
> Key: KAFKA-5422
> URL: https://issues.apache.org/jira/browse/KAFKA-5422
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> When multiple produce requests fail (for instance when all inflight batches 
> are expired), each will try to transition to ABORTABLE_ERROR. 
> However, only the first transition will succeed, the rest will fail with the 
> following 'invalid transition from ABORTABLE_ERROR to ABORTABLE_ERROR'. 
> This will be caught in the sender thread and things will continue. However, 
> the correct thing to do do is to allow multiple transitions to 
> ABORTABLE_ERROR.
> {noformat}
> [2017-06-09 01:22:39,327] WARN Connection to node 3 could not be established. 
> Broker may not be available. (org.apache.kafka.clients.NetworkClient)
> [2017-06-09 01:22:39,958] TRACE Expired 2 batches in accumulator 
> (org.apache.kafka.clients.producer.internals.Sender)
> [2017-06-09 01:22:39,958] DEBUG [TransactionalId my-first-transactional-id] 
> Transition from state COMMITTING_TRANSACTION to error state ABORTABLE_ERROR 
> (org.apache.kafka.clients.producer.internals.TransactionManager)
> org.apache.kafka.common.errors.TimeoutException: Expiring 250 record(s) for 
> output-topic-0: 30099 ms has passed since batch creation plus linger time
> [2017-06-09 01:22:39,960] TRACE Produced messages to topic-partition 
> output-topic-0 with base offset offset -1 and error: {}. 
> (org.apache.kafka.clients.producer.internals.ProducerBatch)
> org.apache.kafka.common.errors.TimeoutException: Expiring 250 record(s) for 
> output-topic-0: 30099 ms has passed since batch creation plus linger time
> [2017-06-09 01:22:39,960] ERROR Uncaught error in kafka producer I/O thread:  
> (org.apache.kafka.clients.producer.internals.Sender)
> org.apache.kafka.common.KafkaException: Invalid transition attempted from 
> state ABORTABLE_ERROR to state ABORTABLE_ERROR
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.transitionTo(TransactionManager.java:475)
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.transitionToAbortableError(TransactionManager.java:288)
> at 
> org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:602)
> at 
> org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:271)
> at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:221)
> at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:162)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5415) TransactionCoordinator doesn't complete transition to PrepareCommit state

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5415:

Summary: TransactionCoordinator doesn't complete transition to 
PrepareCommit state  (was: TransactionCoordinator gets stuck in PrepareCommit 
state)

> TransactionCoordinator doesn't complete transition to PrepareCommit state
> -
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommit state

2017-06-09 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16045051#comment-16045051
 ] 

Apurva Mehta commented on KAFKA-5415:
-

[~guozhang]'s explanation above actually applies to KAFKA-5416.

> TransactionCoordinator gets stuck in PrepareCommit state
> 
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommit state

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta reassigned KAFKA-5415:
---

Assignee: Apurva Mehta  (was: Guozhang Wang)

> TransactionCoordinator gets stuck in PrepareCommit state
> 
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5416) TransactionCoordinator doesn't complete transition to CompleteCommit

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5416:

Fix Version/s: 0.11.0.0

> TransactionCoordinator doesn't complete transition to CompleteCommit
> 
>
> Key: KAFKA-5416
> URL: https://issues.apache.org/jira/browse/KAFKA-5416
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Guozhang Wang
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> In regard to this system test: 
> http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz
> Here are the ownership changes for __transaction_state-37: 
> {noformat}
> ./worker1/debug/server.log:3623:[2017-06-09 01:16:36,551] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16174:[2017-06-09 01:16:39,963] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16194:[2017-06-09 01:16:39,978] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 15 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:35633:[2017-06-09 01:16:45,308] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40353:[2017-06-09 01:16:52,218] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40664:[2017-06-09 01:16:52,683] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146044:[2017-06-09 01:19:08,844] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146048:[2017-06-09 01:19:08,850] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 6 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:150147:[2017-06-09 01:19:10,995] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:3413:[2017-06-09 01:16:36,109] INFO [Transaction 
> Log Manager 1]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:20889:[2017-06-09 01:16:39,991] INFO [Transaction 
> Log Manager 1]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26084:[2017-06-09 01:16:46,682] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26322:[2017-06-09 01:16:47,153] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> 

[jira] [Reopened] (KAFKA-5416) TransactionCoordinator doesn't complete transition to CompleteCommit

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta reopened KAFKA-5416:
-
  Assignee: Guozhang Wang

This is not a dup of KAFKA-5415. Instead the cause for KAFKA-5415 explained by 
[~guozhang] in 
https://issues.apache.org/jira/browse/KAFKA-5415?focusedCommentId=16044885=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16044885

actually applies to this ticket. 

> TransactionCoordinator doesn't complete transition to CompleteCommit
> 
>
> Key: KAFKA-5416
> URL: https://issues.apache.org/jira/browse/KAFKA-5416
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Guozhang Wang
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> In regard to this system test: 
> http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz
> Here are the ownership changes for __transaction_state-37: 
> {noformat}
> ./worker1/debug/server.log:3623:[2017-06-09 01:16:36,551] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16174:[2017-06-09 01:16:39,963] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16194:[2017-06-09 01:16:39,978] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 15 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:35633:[2017-06-09 01:16:45,308] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40353:[2017-06-09 01:16:52,218] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40664:[2017-06-09 01:16:52,683] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146044:[2017-06-09 01:19:08,844] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146048:[2017-06-09 01:19:08,850] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 6 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:150147:[2017-06-09 01:19:10,995] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:3413:[2017-06-09 01:16:36,109] INFO [Transaction 
> Log Manager 1]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:20889:[2017-06-09 01:16:39,991] INFO [Transaction 
> Log Manager 1]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26084:[2017-06-09 01:16:46,682] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26322:[2017-06-09 01:16:47,153] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 

[jira] [Assigned] (KAFKA-5422) Multiple produce request failures causes invalid state transition in TransactionManager

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta reassigned KAFKA-5422:
---

Assignee: Apurva Mehta

> Multiple produce request failures causes invalid state transition in 
> TransactionManager
> ---
>
> Key: KAFKA-5422
> URL: https://issues.apache.org/jira/browse/KAFKA-5422
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>  Labels: exactly-once
>
> When multiple produce requests fail (for instance when all inflight batches 
> are expired), each will try to transition to ABORTABLE_ERROR. 
> However, only the first transition will succeed, the rest will fail with the 
> following 'invalid transition from ABORTABLE_ERROR to ABORTABLE_ERROR'. 
> This will be caught in the sender thread and things will continue. However, 
> the correct thing to do do is to allow multiple transitions to 
> ABORTABLE_ERROR.
> {noformat}
> [2017-06-09 01:22:39,327] WARN Connection to node 3 could not be established. 
> Broker may not be available. (org.apache.kafka.clients.NetworkClient)
> [2017-06-09 01:22:39,958] TRACE Expired 2 batches in accumulator 
> (org.apache.kafka.clients.producer.internals.Sender)
> [2017-06-09 01:22:39,958] DEBUG [TransactionalId my-first-transactional-id] 
> Transition from state COMMITTING_TRANSACTION to error state ABORTABLE_ERROR 
> (org.apache.kafka.clients.producer.internals.TransactionManager)
> org.apache.kafka.common.errors.TimeoutException: Expiring 250 record(s) for 
> output-topic-0: 30099 ms has passed since batch creation plus linger time
> [2017-06-09 01:22:39,960] TRACE Produced messages to topic-partition 
> output-topic-0 with base offset offset -1 and error: {}. 
> (org.apache.kafka.clients.producer.internals.ProducerBatch)
> org.apache.kafka.common.errors.TimeoutException: Expiring 250 record(s) for 
> output-topic-0: 30099 ms has passed since batch creation plus linger time
> [2017-06-09 01:22:39,960] ERROR Uncaught error in kafka producer I/O thread:  
> (org.apache.kafka.clients.producer.internals.Sender)
> org.apache.kafka.common.KafkaException: Invalid transition attempted from 
> state ABORTABLE_ERROR to state ABORTABLE_ERROR
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.transitionTo(TransactionManager.java:475)
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.transitionToAbortableError(TransactionManager.java:288)
> at 
> org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:602)
> at 
> org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:271)
> at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:221)
> at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:162)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5422) Multiple produce request failures causes invalid state transition in TransactionManager

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5422:

Summary: Multiple produce request failures causes invalid state transition 
in TransactionManager  (was: Multiple expired batches causes invalid state 
transition in TransactionManager)

> Multiple produce request failures causes invalid state transition in 
> TransactionManager
> ---
>
> Key: KAFKA-5422
> URL: https://issues.apache.org/jira/browse/KAFKA-5422
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>  Labels: exactly-once
>
> When multiple produce requests fail (for instance when all inflight batches 
> are expired), each will try to transition to ABORTABLE_ERROR. 
> However, only the first transition will succeed, the rest will fail with the 
> following 'invalid transition from ABORTABLE_ERROR to ABORTABLE_ERROR'. 
> This will be caught in the sender thread and things will continue. However, 
> the correct thing to do do is to allow multiple transitions to 
> ABORTABLE_ERROR.
> {noformat}
> [2017-06-09 01:22:39,327] WARN Connection to node 3 could not be established. 
> Broker may not be available. (org.apache.kafka.clients.NetworkClient)
> [2017-06-09 01:22:39,958] TRACE Expired 2 batches in accumulator 
> (org.apache.kafka.clients.producer.internals.Sender)
> [2017-06-09 01:22:39,958] DEBUG [TransactionalId my-first-transactional-id] 
> Transition from state COMMITTING_TRANSACTION to error state ABORTABLE_ERROR 
> (org.apache.kafka.clients.producer.internals.TransactionManager)
> org.apache.kafka.common.errors.TimeoutException: Expiring 250 record(s) for 
> output-topic-0: 30099 ms has passed since batch creation plus linger time
> [2017-06-09 01:22:39,960] TRACE Produced messages to topic-partition 
> output-topic-0 with base offset offset -1 and error: {}. 
> (org.apache.kafka.clients.producer.internals.ProducerBatch)
> org.apache.kafka.common.errors.TimeoutException: Expiring 250 record(s) for 
> output-topic-0: 30099 ms has passed since batch creation plus linger time
> [2017-06-09 01:22:39,960] ERROR Uncaught error in kafka producer I/O thread:  
> (org.apache.kafka.clients.producer.internals.Sender)
> org.apache.kafka.common.KafkaException: Invalid transition attempted from 
> state ABORTABLE_ERROR to state ABORTABLE_ERROR
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.transitionTo(TransactionManager.java:475)
> at 
> org.apache.kafka.clients.producer.internals.TransactionManager.transitionToAbortableError(TransactionManager.java:288)
> at 
> org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:602)
> at 
> org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:271)
> at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:221)
> at 
> org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:162)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (KAFKA-5422) Multiple expired batches causes invalid state transition in TransactionManager

2017-06-09 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5422:
---

 Summary: Multiple expired batches causes invalid state transition 
in TransactionManager
 Key: KAFKA-5422
 URL: https://issues.apache.org/jira/browse/KAFKA-5422
 Project: Kafka
  Issue Type: Bug
Reporter: Apurva Mehta


When multiple produce requests fail (for instance when all inflight batches are 
expired), each will try to transition to ABORTABLE_ERROR. 

However, only the first transition will succeed, the rest will fail with the 
following 'invalid transition from ABORTABLE_ERROR to ABORTABLE_ERROR'. 

This will be caught in the sender thread and things will continue. However, the 
correct thing to do do is to allow multiple transitions to ABORTABLE_ERROR.

{noformat}

[2017-06-09 01:22:39,327] WARN Connection to node 3 could not be established. 
Broker may not be available. (org.apache.kafka.clients.NetworkClient)
[2017-06-09 01:22:39,958] TRACE Expired 2 batches in accumulator 
(org.apache.kafka.clients.producer.internals.Sender)
[2017-06-09 01:22:39,958] DEBUG [TransactionalId my-first-transactional-id] 
Transition from state COMMITTING_TRANSACTION to error state ABORTABLE_ERROR 
(org.apache.kafka.clients.producer.internals.TransactionManager)
org.apache.kafka.common.errors.TimeoutException: Expiring 250 record(s) for 
output-topic-0: 30099 ms has passed since batch creation plus linger time
[2017-06-09 01:22:39,960] TRACE Produced messages to topic-partition 
output-topic-0 with base offset offset -1 and error: {}. 
(org.apache.kafka.clients.producer.internals.ProducerBatch)
org.apache.kafka.common.errors.TimeoutException: Expiring 250 record(s) for 
output-topic-0: 30099 ms has passed since batch creation plus linger time
[2017-06-09 01:22:39,960] ERROR Uncaught error in kafka producer I/O thread:  
(org.apache.kafka.clients.producer.internals.Sender)
org.apache.kafka.common.KafkaException: Invalid transition attempted from state 
ABORTABLE_ERROR to state ABORTABLE_ERROR
at 
org.apache.kafka.clients.producer.internals.TransactionManager.transitionTo(TransactionManager.java:475)
at 
org.apache.kafka.clients.producer.internals.TransactionManager.transitionToAbortableError(TransactionManager.java:288)
at 
org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:602)
at 
org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:271)
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:221)
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:162)
at java.lang.Thread.run(Thread.java:745)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5416) TransactionCoordinator doesn't complete transition to CompleteCommit

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5416:

Summary: TransactionCoordinator doesn't complete transition to 
CompleteCommit  (was: TransactionCoordinator: TransactionMarkerChannelManager 
seems not to retry failed writes.)

> TransactionCoordinator doesn't complete transition to CompleteCommit
> 
>
> Key: KAFKA-5416
> URL: https://issues.apache.org/jira/browse/KAFKA-5416
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
>
> In regard to this system test: 
> http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz
> Here are the ownership changes for __transaction_state-37: 
> {noformat}
> ./worker1/debug/server.log:3623:[2017-06-09 01:16:36,551] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16174:[2017-06-09 01:16:39,963] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16194:[2017-06-09 01:16:39,978] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 15 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:35633:[2017-06-09 01:16:45,308] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40353:[2017-06-09 01:16:52,218] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40664:[2017-06-09 01:16:52,683] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146044:[2017-06-09 01:19:08,844] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146048:[2017-06-09 01:19:08,850] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 6 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:150147:[2017-06-09 01:19:10,995] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:3413:[2017-06-09 01:16:36,109] INFO [Transaction 
> Log Manager 1]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:20889:[2017-06-09 01:16:39,991] INFO [Transaction 
> Log Manager 1]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26084:[2017-06-09 01:16:46,682] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26322:[2017-06-09 01:16:47,153] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> 

[jira] [Assigned] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommit state

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta reassigned KAFKA-5415:
---

Assignee: Guozhang Wang  (was: Apurva Mehta)

> TransactionCoordinator gets stuck in PrepareCommit state
> 
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Guozhang Wang
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommit state

2017-06-09 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044861#comment-16044861
 ] 

Apurva Mehta commented on KAFKA-5415:
-

[~guozhang] has found the bug and will file the PR. Assigning to him.

> TransactionCoordinator gets stuck in PrepareCommit state
> 
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Guozhang Wang
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5416) TransactionCoordinator: TransactionMarkerChannelManager seems not to retry failed writes.

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5416:

Fix Version/s: (was: 0.11.0.0)

> TransactionCoordinator: TransactionMarkerChannelManager seems not to retry 
> failed writes.
> -
>
> Key: KAFKA-5416
> URL: https://issues.apache.org/jira/browse/KAFKA-5416
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
>
> In regard to this system test: 
> http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz
> Here are the ownership changes for __transaction_state-37: 
> {noformat}
> ./worker1/debug/server.log:3623:[2017-06-09 01:16:36,551] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16174:[2017-06-09 01:16:39,963] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16194:[2017-06-09 01:16:39,978] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 15 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:35633:[2017-06-09 01:16:45,308] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40353:[2017-06-09 01:16:52,218] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40664:[2017-06-09 01:16:52,683] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146044:[2017-06-09 01:19:08,844] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146048:[2017-06-09 01:19:08,850] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 6 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:150147:[2017-06-09 01:19:10,995] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:3413:[2017-06-09 01:16:36,109] INFO [Transaction 
> Log Manager 1]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:20889:[2017-06-09 01:16:39,991] INFO [Transaction 
> Log Manager 1]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26084:[2017-06-09 01:16:46,682] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26322:[2017-06-09 01:16:47,153] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:27320:[2017-06-09 01:16:50,748] 

[jira] [Resolved] (KAFKA-5416) TransactionCoordinator: TransactionMarkerChannelManager seems not to retry failed writes.

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta resolved KAFKA-5416.
-
Resolution: Duplicate

After more digging, this appears to be a duplicate of KAFKA-5415, but the extra 
logging from https://github.com/apache/kafka/pull/3278/files didn't help 
identify the root cause. 

> TransactionCoordinator: TransactionMarkerChannelManager seems not to retry 
> failed writes.
> -
>
> Key: KAFKA-5416
> URL: https://issues.apache.org/jira/browse/KAFKA-5416
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> In regard to this system test: 
> http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz
> Here are the ownership changes for __transaction_state-37: 
> {noformat}
> ./worker1/debug/server.log:3623:[2017-06-09 01:16:36,551] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16174:[2017-06-09 01:16:39,963] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16194:[2017-06-09 01:16:39,978] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 15 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:35633:[2017-06-09 01:16:45,308] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40353:[2017-06-09 01:16:52,218] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40664:[2017-06-09 01:16:52,683] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146044:[2017-06-09 01:19:08,844] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146048:[2017-06-09 01:19:08,850] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 6 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:150147:[2017-06-09 01:19:10,995] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:3413:[2017-06-09 01:16:36,109] INFO [Transaction 
> Log Manager 1]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:20889:[2017-06-09 01:16:39,991] INFO [Transaction 
> Log Manager 1]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26084:[2017-06-09 01:16:46,682] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26322:[2017-06-09 01:16:47,153] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is 

[jira] [Commented] (KAFKA-5416) TransactionCoordinator: TransactionMarkerChannelManager seems not to retry failed writes.

2017-06-09 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044081#comment-16044081
 ] 

Apurva Mehta commented on KAFKA-5416:
-

It appears the 'CompleteCommit' was written to the log:

{noformat}
offset: 85 position: 14531 CreateTime: 1496971014082 isvalid: true keysize: 30 
valuesize: 37 magic: 2 compresscodec: NONE producerId: -1 sequence: -1 
isTransactional: false headerKeys: [] key: 
transactionalId=my-second-transactional-id payload: 
producerId:1,producerEpoch:0,state=CompleteCommit,partitions=Set(),lastUpdateTimestamp=1496971010776
{noformat}

And the higwatermark was updated: 

{noformat}
[2017-06-09 01:16:54,132] DEBUG Partition [__transaction_state,37] on broker 1: 
High watermark for partition [__transaction_state,37] updated to 86 [0 : 14667] 
(kafka.cluster.Partition)
{noformat}

So, very similar to KAFKA-5415, the cache callback didn't seem to complete the 
state transition.

> TransactionCoordinator: TransactionMarkerChannelManager seems not to retry 
> failed writes.
> -
>
> Key: KAFKA-5416
> URL: https://issues.apache.org/jira/browse/KAFKA-5416
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> In regard to this system test: 
> http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz
> Here are the ownership changes for __transaction_state-37: 
> {noformat}
> ./worker1/debug/server.log:3623:[2017-06-09 01:16:36,551] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16174:[2017-06-09 01:16:39,963] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16194:[2017-06-09 01:16:39,978] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 15 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:35633:[2017-06-09 01:16:45,308] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40353:[2017-06-09 01:16:52,218] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40664:[2017-06-09 01:16:52,683] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146044:[2017-06-09 01:19:08,844] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146048:[2017-06-09 01:19:08,850] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 6 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:150147:[2017-06-09 01:19:10,995] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:3413:[2017-06-09 01:16:36,109] INFO [Transaction 
> Log Manager 1]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:20889:[2017-06-09 01:16:39,991] INFO [Transaction 
> Log Manager 1]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> 

[jira] [Commented] (KAFKA-5416) TransactionCoordinator: TransactionMarkerChannelManager seems not to retry failed writes.

2017-06-09 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044073#comment-16044073
 ] 

Apurva Mehta commented on KAFKA-5416:
-

Actually, my grep was bad. the 'retrying:' message was printed, but I didn't 
capture it in my grep. Here is the relevant portion. We do retry, and do call 
`replicaManager.appendToLog` (we print the 'Appended new metadata ... ' message 
after calling the append. 

But then we get no message in the cacheCallback, which is the mystery. 

{noformat}
[2017-06-09 01:16:54,081] DEBUG retrying: 1 transaction log appends 
(kafka.coordinator.transaction.TransactionMarkerChannelManager)
[2017-06-09 01:16:54,089] DEBUG [Transaction Log Manager 1]: Appended new 
metadata TxnTransitMetadata(producerId=1, producerEpoch=0, txnTimeoutMs=6, 
txnState=CompleteCommit, topicPartitions=Set(), 
txnStartTimestamp=1496971010483, txnLastUpdateTimestamp=1496971010776) for 
transaction id my-second-transactional-id with coordinator epoch 3 to the local 
transaction log (kafka.coordinator.transaction.TransactionStateManager)
[2017-06-09 01:16:54,091] DEBUG retrying: 0 transaction log appends 
(kafka.coordinator.transaction.TransactionMarkerChannelManager)
{noformat}

> TransactionCoordinator: TransactionMarkerChannelManager seems not to retry 
> failed writes.
> -
>
> Key: KAFKA-5416
> URL: https://issues.apache.org/jira/browse/KAFKA-5416
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> In regard to this system test: 
> http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz
> Here are the ownership changes for __transaction_state-37: 
> {noformat}
> ./worker1/debug/server.log:3623:[2017-06-09 01:16:36,551] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16174:[2017-06-09 01:16:39,963] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16194:[2017-06-09 01:16:39,978] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 15 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:35633:[2017-06-09 01:16:45,308] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40353:[2017-06-09 01:16:52,218] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40664:[2017-06-09 01:16:52,683] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146044:[2017-06-09 01:19:08,844] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146048:[2017-06-09 01:19:08,850] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 6 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:150147:[2017-06-09 01:19:10,995] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:3413:[2017-06-09 01:16:36,109] INFO [Transaction 
> Log Manager 1]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)

[jira] [Comment Edited] (KAFKA-5416) TransactionCoordinator: TransactionMarkerChannelManager seems not to retry failed writes.

2017-06-09 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044073#comment-16044073
 ] 

Apurva Mehta edited comment on KAFKA-5416 at 6/9/17 7:15 AM:
-

Actually, my grep was bad. the 'retrying:' message was printed, but I didn't 
capture it in my grep. Here is the relevant portion. We do retry, and do call 
`replicaManager.appendToLog` (we print the 'Appended new metadata ... ' message 
after calling the append.)

But then we get no message in the cacheCallback, which is the mystery. 

{noformat}
[2017-06-09 01:16:54,081] DEBUG retrying: 1 transaction log appends 
(kafka.coordinator.transaction.TransactionMarkerChannelManager)
[2017-06-09 01:16:54,089] DEBUG [Transaction Log Manager 1]: Appended new 
metadata TxnTransitMetadata(producerId=1, producerEpoch=0, txnTimeoutMs=6, 
txnState=CompleteCommit, topicPartitions=Set(), 
txnStartTimestamp=1496971010483, txnLastUpdateTimestamp=1496971010776) for 
transaction id my-second-transactional-id with coordinator epoch 3 to the local 
transaction log (kafka.coordinator.transaction.TransactionStateManager)
[2017-06-09 01:16:54,091] DEBUG retrying: 0 transaction log appends 
(kafka.coordinator.transaction.TransactionMarkerChannelManager)
{noformat}


was (Author: apurva):
Actually, my grep was bad. the 'retrying:' message was printed, but I didn't 
capture it in my grep. Here is the relevant portion. We do retry, and do call 
`replicaManager.appendToLog` (we print the 'Appended new metadata ... ' message 
after calling the append. 

But then we get no message in the cacheCallback, which is the mystery. 

{noformat}
[2017-06-09 01:16:54,081] DEBUG retrying: 1 transaction log appends 
(kafka.coordinator.transaction.TransactionMarkerChannelManager)
[2017-06-09 01:16:54,089] DEBUG [Transaction Log Manager 1]: Appended new 
metadata TxnTransitMetadata(producerId=1, producerEpoch=0, txnTimeoutMs=6, 
txnState=CompleteCommit, topicPartitions=Set(), 
txnStartTimestamp=1496971010483, txnLastUpdateTimestamp=1496971010776) for 
transaction id my-second-transactional-id with coordinator epoch 3 to the local 
transaction log (kafka.coordinator.transaction.TransactionStateManager)
[2017-06-09 01:16:54,091] DEBUG retrying: 0 transaction log appends 
(kafka.coordinator.transaction.TransactionMarkerChannelManager)
{noformat}

> TransactionCoordinator: TransactionMarkerChannelManager seems not to retry 
> failed writes.
> -
>
> Key: KAFKA-5416
> URL: https://issues.apache.org/jira/browse/KAFKA-5416
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> In regard to this system test: 
> http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz
> Here are the ownership changes for __transaction_state-37: 
> {noformat}
> ./worker1/debug/server.log:3623:[2017-06-09 01:16:36,551] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16174:[2017-06-09 01:16:39,963] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16194:[2017-06-09 01:16:39,978] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 15 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:35633:[2017-06-09 01:16:45,308] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40353:[2017-06-09 01:16:52,218] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40664:[2017-06-09 01:16:52,683] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower 

[jira] [Updated] (KAFKA-5416) TransactionCoordinator seems to not return NOT_COORDINATOR error

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5416:

Labels: exactly-once  (was: )

> TransactionCoordinator seems to not return NOT_COORDINATOR error
> 
>
> Key: KAFKA-5416
> URL: https://issues.apache.org/jira/browse/KAFKA-5416
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> In regard to this system test: 
> http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz
> Here are the ownership changes for __transaction_state-37: 
> {noformat}
> ./worker1/debug/server.log:3623:[2017-06-09 01:16:36,551] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16174:[2017-06-09 01:16:39,963] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16194:[2017-06-09 01:16:39,978] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 15 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:35633:[2017-06-09 01:16:45,308] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40353:[2017-06-09 01:16:52,218] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40664:[2017-06-09 01:16:52,683] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146044:[2017-06-09 01:19:08,844] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146048:[2017-06-09 01:19:08,850] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 6 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:150147:[2017-06-09 01:19:10,995] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:3413:[2017-06-09 01:16:36,109] INFO [Transaction 
> Log Manager 1]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:20889:[2017-06-09 01:16:39,991] INFO [Transaction 
> Log Manager 1]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26084:[2017-06-09 01:16:46,682] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26322:[2017-06-09 01:16:47,153] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:27320:[2017-06-09 01:16:50,748] INFO [Transaction 
> 

[jira] [Updated] (KAFKA-5416) TransactionCoordinator: TransactionMarkerChannelManager seems not to retry failed writes.

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5416:

Summary: TransactionCoordinator: TransactionMarkerChannelManager seems not 
to retry failed writes.  (was: TransactionCoordinator seems to not return 
NOT_COORDINATOR error)

> TransactionCoordinator: TransactionMarkerChannelManager seems not to retry 
> failed writes.
> -
>
> Key: KAFKA-5416
> URL: https://issues.apache.org/jira/browse/KAFKA-5416
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
>
> In regard to this system test: 
> http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz
> Here are the ownership changes for __transaction_state-37: 
> {noformat}
> ./worker1/debug/server.log:3623:[2017-06-09 01:16:36,551] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16174:[2017-06-09 01:16:39,963] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16194:[2017-06-09 01:16:39,978] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 15 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:35633:[2017-06-09 01:16:45,308] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40353:[2017-06-09 01:16:52,218] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40664:[2017-06-09 01:16:52,683] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146044:[2017-06-09 01:19:08,844] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146048:[2017-06-09 01:19:08,850] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 6 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:150147:[2017-06-09 01:19:10,995] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:3413:[2017-06-09 01:16:36,109] INFO [Transaction 
> Log Manager 1]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:20889:[2017-06-09 01:16:39,991] INFO [Transaction 
> Log Manager 1]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26084:[2017-06-09 01:16:46,682] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26322:[2017-06-09 01:16:47,153] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing 

[jira] [Updated] (KAFKA-5416) TransactionCoordinator seems to not return NOT_COORDINATOR error

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5416:

Description: 
In regard to this system test: 
http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz

Here are the ownership changes for __transaction_state-37: 
{noformat}
./worker1/debug/server.log:3623:[2017-06-09 01:16:36,551] INFO [Transaction Log 
Manager 2]: Trying to remove cached transaction metadata for 
__transaction_state-37 on follower transition but there is no entries 
remaining; it is likely that another process for removing the cached entries 
has just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:16174:[2017-06-09 01:16:39,963] INFO [Transaction 
Log Manager 2]: Loading transaction metadata from __transaction_state-37 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:16194:[2017-06-09 01:16:39,978] INFO [Transaction 
Log Manager 2]: Finished loading 1 transaction metadata from 
__transaction_state-37 in 15 milliseconds 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:35633:[2017-06-09 01:16:45,308] INFO [Transaction 
Log Manager 2]: Removed 1 cached transaction metadata for 
__transaction_state-37 on follower transition 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:40353:[2017-06-09 01:16:52,218] INFO [Transaction 
Log Manager 2]: Trying to remove cached transaction metadata for 
__transaction_state-37 on follower transition but there is no entries 
remaining; it is likelythat another process for removing the cached entries has 
just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:40664:[2017-06-09 01:16:52,683] INFO [Transaction 
Log Manager 2]: Trying to remove cached transaction metadata for 
__transaction_state-37 on follower transition but there is no entries 
remaining; it is likelythat another process for removing the cached entries has 
just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:146044:[2017-06-09 01:19:08,844] INFO [Transaction 
Log Manager 2]: Loading transaction metadata from __transaction_state-37 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:146048:[2017-06-09 01:19:08,850] INFO [Transaction 
Log Manager 2]: Finished loading 1 transaction metadata from 
__transaction_state-37 in 6 milliseconds 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:150147:[2017-06-09 01:19:10,995] INFO [Transaction 
Log Manager 2]: Removed 1 cached transaction metadata for 
__transaction_state-37 on follower transition 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:3413:[2017-06-09 01:16:36,109] INFO [Transaction Log 
Manager 1]: Loading transaction metadata from __transaction_state-37 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:20889:[2017-06-09 01:16:39,991] INFO [Transaction 
Log Manager 1]: Removed 1 cached transaction metadata for 
__transaction_state-37 on follower transition 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:26084:[2017-06-09 01:16:46,682] INFO [Transaction 
Log Manager 1]: Trying to remove cached transaction metadata for 
__transaction_state-37 on follower transition but there is no entries 
remaining; it is likelythat another process for removing the cached entries has 
just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:26322:[2017-06-09 01:16:47,153] INFO [Transaction 
Log Manager 1]: Trying to remove cached transaction metadata for 
__transaction_state-37 on follower transition but there is no entries 
remaining; it is likelythat another process for removing the cached entries has 
just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:27320:[2017-06-09 01:16:50,748] INFO [Transaction 
Log Manager 1]: Loading transaction metadata from __transaction_state-37 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:27331:[2017-06-09 01:16:50,775] INFO [Transaction 
Log Manager 1]: Finished loading 1 transaction metadata from 
__transaction_state-37 in 27 milliseconds 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:167429:[2017-06-09 01:19:08,857] INFO [Transaction 
Log Manager 1]: Removed 1 cached transaction metadata for 
__transaction_state-37 on follower transition 

[jira] [Updated] (KAFKA-5416) TransactionCoordinator seems to not return NOT_COORDINATOR error

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5416:

Priority: Blocker  (was: Major)

> TransactionCoordinator seems to not return NOT_COORDINATOR error
> 
>
> Key: KAFKA-5416
> URL: https://issues.apache.org/jira/browse/KAFKA-5416
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Priority: Blocker
> Fix For: 0.11.0.0
>
>
> In regard to this system test: 
> http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz
> Here are the ownership changes for __transaction_state-37: 
> {noformat}
> ./worker1/debug/server.log:3623:[2017-06-09 01:16:36,551] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16174:[2017-06-09 01:16:39,963] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16194:[2017-06-09 01:16:39,978] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 15 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:35633:[2017-06-09 01:16:45,308] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40353:[2017-06-09 01:16:52,218] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40664:[2017-06-09 01:16:52,683] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146044:[2017-06-09 01:19:08,844] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146048:[2017-06-09 01:19:08,850] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 6 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:150147:[2017-06-09 01:19:10,995] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:3413:[2017-06-09 01:16:36,109] INFO [Transaction 
> Log Manager 1]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:20889:[2017-06-09 01:16:39,991] INFO [Transaction 
> Log Manager 1]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26084:[2017-06-09 01:16:46,682] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26322:[2017-06-09 01:16:47,153] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:27320:[2017-06-09 01:16:50,748] INFO [Transaction 
> Log Manager 1]: Loading transaction 

[jira] [Updated] (KAFKA-5416) TransactionCoordinator seems to not return NOT_COORDINATOR error

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5416:

Fix Version/s: 0.11.0.0

> TransactionCoordinator seems to not return NOT_COORDINATOR error
> 
>
> Key: KAFKA-5416
> URL: https://issues.apache.org/jira/browse/KAFKA-5416
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
> Fix For: 0.11.0.0
>
>
> In regard to this system test: 
> http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz
> Here are the ownership changes for __transaction_state-37: 
> {noformat}
> ./worker1/debug/server.log:3623:[2017-06-09 01:16:36,551] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16174:[2017-06-09 01:16:39,963] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:16194:[2017-06-09 01:16:39,978] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 15 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:35633:[2017-06-09 01:16:45,308] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40353:[2017-06-09 01:16:52,218] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40664:[2017-06-09 01:16:52,683] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146044:[2017-06-09 01:19:08,844] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:146048:[2017-06-09 01:19:08,850] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-37 in 6 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:150147:[2017-06-09 01:19:10,995] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:3413:[2017-06-09 01:16:36,109] INFO [Transaction 
> Log Manager 1]: Loading transaction metadata from __transaction_state-37 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:20889:[2017-06-09 01:16:39,991] INFO [Transaction 
> Log Manager 1]: Removed 1 cached transaction metadata for 
> __transaction_state-37 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26084:[2017-06-09 01:16:46,682] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26322:[2017-06-09 01:16:47,153] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-37 on follower transition but there is no entries 
> remaining; it is likelythat another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:27320:[2017-06-09 01:16:50,748] INFO [Transaction 
> Log Manager 1]: Loading transaction metadata from __transaction_state-37 
> 

[jira] [Updated] (KAFKA-5416) TransactionCoordinator seems to not return NOT_COORDINATOR error

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5416:

Description: 
In regard to this system test: 
http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz

Here are the ownership changes for __transaction_state-37: 
{noformat}
./worker1/debug/server.log:3623:[2017-06-09 01:16:36,551] INFO [Transaction Log 
Manager 2]: Trying to remove cached transaction metadata for 
__transaction_state-37 on follower transition but there is no entries 
remaining; it is likely that another process for removing the cached entries 
has just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:16174:[2017-06-09 01:16:39,963] INFO [Transaction 
Log Manager 2]: Loading transaction metadata from __transaction_state-37 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:16194:[2017-06-09 01:16:39,978] INFO [Transaction 
Log Manager 2]: Finished loading 1 transaction metadata from 
__transaction_state-37 in 15 milliseconds 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:35633:[2017-06-09 01:16:45,308] INFO [Transaction 
Log Manager 2]: Removed 1 cached transaction metadata for 
__transaction_state-37 on follower transition 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:40353:[2017-06-09 01:16:52,218] INFO [Transaction 
Log Manager 2]: Trying to remove cached transaction metadata for 
__transaction_state-37 on follower transition but there is no entries 
remaining; it is likelythat another process for removing the cached entries has 
just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:40664:[2017-06-09 01:16:52,683] INFO [Transaction 
Log Manager 2]: Trying to remove cached transaction metadata for 
__transaction_state-37 on follower transition but there is no entries 
remaining; it is likelythat another process for removing the cached entries has 
just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:146044:[2017-06-09 01:19:08,844] INFO [Transaction 
Log Manager 2]: Loading transaction metadata from __transaction_state-37 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:146048:[2017-06-09 01:19:08,850] INFO [Transaction 
Log Manager 2]: Finished loading 1 transaction metadata from 
__transaction_state-37 in 6 milliseconds 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:150147:[2017-06-09 01:19:10,995] INFO [Transaction 
Log Manager 2]: Removed 1 cached transaction metadata for 
__transaction_state-37 on follower transition 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:3413:[2017-06-09 01:16:36,109] INFO [Transaction Log 
Manager 1]: Loading transaction metadata from __transaction_state-37 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:20889:[2017-06-09 01:16:39,991] INFO [Transaction 
Log Manager 1]: Removed 1 cached transaction metadata for 
__transaction_state-37 on follower transition 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:26084:[2017-06-09 01:16:46,682] INFO [Transaction 
Log Manager 1]: Trying to remove cached transaction metadata for 
__transaction_state-37 on follower transition but there is no entries 
remaining; it is likelythat another process for removing the cached entries has 
just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:26322:[2017-06-09 01:16:47,153] INFO [Transaction 
Log Manager 1]: Trying to remove cached transaction metadata for 
__transaction_state-37 on follower transition but there is no entries 
remaining; it is likelythat another process for removing the cached entries has 
just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:27320:[2017-06-09 01:16:50,748] INFO [Transaction 
Log Manager 1]: Loading transaction metadata from __transaction_state-37 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:27331:[2017-06-09 01:16:50,775] INFO [Transaction 
Log Manager 1]: Finished loading 1 transaction metadata from 
__transaction_state-37 in 27 milliseconds 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:167429:[2017-06-09 01:19:08,857] INFO [Transaction 
Log Manager 1]: Removed 1 cached transaction metadata for 
__transaction_state-37 on follower transition 

[jira] [Updated] (KAFKA-5416) TransactionCoordinator seems to not return NOT_COORDINATOR error

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5416:

Description: 
In regard to this system test: 
http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz

Here are the ownership changes for __transaction_state-37: 
{noformat}
./worker1/debug/server.log:3623:[2017-06-09 01:16:36,551] INFO [Transaction Log 
Manager 2]: Trying to remove cached transaction metadata for 
__transaction_state-37 on follower transition but there is no entries 
remaining; it is likely that another process for removing the cached entries 
has just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:16174:[2017-06-09 01:16:39,963] INFO [Transaction 
Log Manager 2]: Loading transaction metadata from __transaction_state-37 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:16194:[2017-06-09 01:16:39,978] INFO [Transaction 
Log Manager 2]: Finished loading 1 transaction metadata from 
__transaction_state-37 in 15 milliseconds 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:35633:[2017-06-09 01:16:45,308] INFO [Transaction 
Log Manager 2]: Removed 1 cached transaction metadata for 
__transaction_state-37 on follower transition 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:40353:[2017-06-09 01:16:52,218] INFO [Transaction 
Log Manager 2]: Trying to remove cached transaction metadata for 
__transaction_state-37 on follower transition but there is no entries 
remaining; it is likelythat another process for removing the cached entries has 
just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:40664:[2017-06-09 01:16:52,683] INFO [Transaction 
Log Manager 2]: Trying to remove cached transaction metadata for 
__transaction_state-37 on follower transition but there is no entries 
remaining; it is likelythat another process for removing the cached entries has 
just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:146044:[2017-06-09 01:19:08,844] INFO [Transaction 
Log Manager 2]: Loading transaction metadata from __transaction_state-37 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:146048:[2017-06-09 01:19:08,850] INFO [Transaction 
Log Manager 2]: Finished loading 1 transaction metadata from 
__transaction_state-37 in 6 milliseconds 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:150147:[2017-06-09 01:19:10,995] INFO [Transaction 
Log Manager 2]: Removed 1 cached transaction metadata for 
__transaction_state-37 on follower transition 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:3413:[2017-06-09 01:16:36,109] INFO [Transaction Log 
Manager 1]: Loading transaction metadata from __transaction_state-37 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:20889:[2017-06-09 01:16:39,991] INFO [Transaction 
Log Manager 1]: Removed 1 cached transaction metadata for 
__transaction_state-37 on follower transition 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:26084:[2017-06-09 01:16:46,682] INFO [Transaction 
Log Manager 1]: Trying to remove cached transaction metadata for 
__transaction_state-37 on follower transition but there is no entries 
remaining; it is likelythat another process for removing the cached entries has 
just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:26322:[2017-06-09 01:16:47,153] INFO [Transaction 
Log Manager 1]: Trying to remove cached transaction metadata for 
__transaction_state-37 on follower transition but there is no entries 
remaining; it is likelythat another process for removing the cached entries has 
just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:27320:[2017-06-09 01:16:50,748] INFO [Transaction 
Log Manager 1]: Loading transaction metadata from __transaction_state-37 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:27331:[2017-06-09 01:16:50,775] INFO [Transaction 
Log Manager 1]: Finished loading 1 transaction metadata from 
__transaction_state-37 in 27 milliseconds 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:167429:[2017-06-09 01:19:08,857] INFO [Transaction 
Log Manager 1]: Removed 1 cached transaction metadata for 
__transaction_state-37 on follower transition 

[jira] [Commented] (KAFKA-5416) TransactionCoordinator seems to not return NOT_COORDINATOR error

2017-06-09 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16044045#comment-16044045
 ] 

Apurva Mehta commented on KAFKA-5416:
-

cc [~guozhang] [~damianguy] 

A second pair of eyes making sense of these logs would help a lot.

> TransactionCoordinator seems to not return NOT_COORDINATOR error
> 
>
> Key: KAFKA-5416
> URL: https://issues.apache.org/jira/browse/KAFKA-5416
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>
> In regard to this system test: 
> http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz
> There are two issues:
> First, a coordinator who is not the owner for a given partition of the 
> transaction log does not return NOT_COORDINATOR, but CONCURRENT_TRANSACTIONS 
> instead.
> Here are the ownership changes for __transaction_state-41: 
> {noformat}
> ./worker1/debug/server.log:3559:[2017-06-09 01:16:36,244] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-41 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:38471:[2017-06-09 01:16:45,910] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-41 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:40226:[2017-06-09 01:16:51,821] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-41 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:41233:[2017-06-09 01:16:53,332] INFO [Transaction 
> Log Manager 2]: Trying to remove cached transaction metadata for 
> __transaction_state-41 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:42486:[2017-06-09 01:16:59,584] INFO [Transaction 
> Log Manager 2]: Loading transaction metadata from __transaction_state-41 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:42515:[2017-06-09 01:16:59,611] INFO [Transaction 
> Log Manager 2]: Finished loading 1 transaction metadata from 
> __transaction_state-41 in 27 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker1/debug/server.log:153029:[2017-06-09 01:19:11,484] INFO [Transaction 
> Log Manager 2]: Removed 1 cached transaction metadata for 
> __transaction_state-41 on follower transition 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:3537:[2017-06-09 01:16:36,441] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-41 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:25957:[2017-06-09 01:16:46,309] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-41 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:26618:[2017-06-09 01:16:48,164] INFO [Transaction 
> Log Manager 1]: Trying to remove cached transaction metadata for 
> __transaction_state-41 on follower transition but there is no entries 
> remaining; it is likely that another process for removing the cached entries 
> has just executed earlier before 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:27951:[2017-06-09 01:16:51,398] INFO [Transaction 
> Log Manager 1]: Loading transaction metadata from __transaction_state-41 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:27958:[2017-06-09 01:16:51,407] INFO [Transaction 
> Log Manager 1]: Finished loading 1 transaction metadata from 
> __transaction_state-41 in 9 milliseconds 
> (kafka.coordinator.transaction.TransactionStateManager)
> ./worker2/debug/server.log:80970:[2017-06-09 01:16:59,608] INFO [Transaction 
> Log Manager 1]: Removed 1 cached transaction metadata 

[jira] [Updated] (KAFKA-5416) TransactionCoordinator seems to not return NOT_COORDINATOR error

2017-06-09 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5416:

Description: 
In regard to this system test: 
http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz

There are two issues:
First, a coordinator who is not the owner for a given partition of the 
transaction log does not return NOT_COORDINATOR, but CONCURRENT_TRANSACTIONS 
instead.

Here are the ownership changes for __transaction_state-41: 
{noformat}
./worker1/debug/server.log:3559:[2017-06-09 01:16:36,244] INFO [Transaction Log 
Manager 2]: Loading transaction metadata from __transaction_state-41 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:38471:[2017-06-09 01:16:45,910] INFO [Transaction 
Log Manager 2]: Removed 1 cached transaction metadata for 
__transaction_state-41 on follower transition 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:40226:[2017-06-09 01:16:51,821] INFO [Transaction 
Log Manager 2]: Trying to remove cached transaction metadata for 
__transaction_state-41 on follower transition but there is no entries 
remaining; it is likely that another process for removing the cached entries 
has just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:41233:[2017-06-09 01:16:53,332] INFO [Transaction 
Log Manager 2]: Trying to remove cached transaction metadata for 
__transaction_state-41 on follower transition but there is no entries 
remaining; it is likely that another process for removing the cached entries 
has just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:42486:[2017-06-09 01:16:59,584] INFO [Transaction 
Log Manager 2]: Loading transaction metadata from __transaction_state-41 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:42515:[2017-06-09 01:16:59,611] INFO [Transaction 
Log Manager 2]: Finished loading 1 transaction metadata from 
__transaction_state-41 in 27 milliseconds 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:153029:[2017-06-09 01:19:11,484] INFO [Transaction 
Log Manager 2]: Removed 1 cached transaction metadata for 
__transaction_state-41 on follower transition 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:3537:[2017-06-09 01:16:36,441] INFO [Transaction Log 
Manager 1]: Trying to remove cached transaction metadata for 
__transaction_state-41 on follower transition but there is no entries 
remaining; it is likely that another process for removing the cached entries 
has just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:25957:[2017-06-09 01:16:46,309] INFO [Transaction 
Log Manager 1]: Trying to remove cached transaction metadata for 
__transaction_state-41 on follower transition but there is no entries 
remaining; it is likely that another process for removing the cached entries 
has just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:26618:[2017-06-09 01:16:48,164] INFO [Transaction 
Log Manager 1]: Trying to remove cached transaction metadata for 
__transaction_state-41 on follower transition but there is no entries 
remaining; it is likely that another process for removing the cached entries 
has just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:27951:[2017-06-09 01:16:51,398] INFO [Transaction 
Log Manager 1]: Loading transaction metadata from __transaction_state-41 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:27958:[2017-06-09 01:16:51,407] INFO [Transaction 
Log Manager 1]: Finished loading 1 transaction metadata from 
__transaction_state-41 in 9 milliseconds 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:80970:[2017-06-09 01:16:59,608] INFO [Transaction 
Log Manager 1]: Removed 1 cached transaction metadata for 
__transaction_state-41 on follower transition 
(kafka.coordinator.transaction.TransactionStateManager)
./worker7/debug/server.log:2626:[2017-06-09 01:16:36,882] INFO [Transaction Log 
Manager 3]: Trying to remove cached transaction metadata for 
__transaction_state-41 on follower transition but there is no entries 
remaining; it is likely that another process for removing the cached entries 
has just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker7/debug/server.log:14291:[2017-06-09 01:16:45,909] INFO [Transaction 
Log Manager 3]: 

[jira] [Created] (KAFKA-5416) TransactionCoordinator seems to not return NOT_COORDINATOR error

2017-06-09 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5416:
---

 Summary: TransactionCoordinator seems to not return 
NOT_COORDINATOR error
 Key: KAFKA-5416
 URL: https://issues.apache.org/jira/browse/KAFKA-5416
 Project: Kafka
  Issue Type: Bug
Reporter: Apurva Mehta


In regard to this system test: 
http://confluent-kafka-branch-builder-system-test-results.s3-us-west-2.amazonaws.com/2017-06-09--001.1496974430--apurvam--MINOR-add-logging-to-transaction-coordinator-in-all-failure-cases--02b3816/TransactionsTest/test_transactions/failure_mode=clean_bounce.bounce_target=brokers/4.tgz

There are two issues:
# A coordinator who is not the owner for a given partition of the transaction 
log does not return NOT_COORDINATOR, but CONCURRENT_TRANSACTIONS instead.

Here are the ownership changes for __transaction_state-41: 
{noformat}
./worker1/debug/server.log:3559:[2017-06-09 01:16:36,244] INFO [Transaction Log 
Manager 2]: Loading transaction metadata from __transaction_state-41 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:38471:[2017-06-09 01:16:45,910] INFO [Transaction 
Log Manager 2]: Removed 1 cached transaction metadata for 
__transaction_state-41 on follower transition 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:40226:[2017-06-09 01:16:51,821] INFO [Transaction 
Log Manager 2]: Trying to remove cached transaction metadata for 
__transaction_state-41 on follower transition but there is no entries 
remaining; it is likely that another process for removing the cached entries 
has just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:41233:[2017-06-09 01:16:53,332] INFO [Transaction 
Log Manager 2]: Trying to remove cached transaction metadata for 
__transaction_state-41 on follower transition but there is no entries 
remaining; it is likely that another process for removing the cached entries 
has just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:42486:[2017-06-09 01:16:59,584] INFO [Transaction 
Log Manager 2]: Loading transaction metadata from __transaction_state-41 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:42515:[2017-06-09 01:16:59,611] INFO [Transaction 
Log Manager 2]: Finished loading 1 transaction metadata from 
__transaction_state-41 in 27 milliseconds 
(kafka.coordinator.transaction.TransactionStateManager)
./worker1/debug/server.log:153029:[2017-06-09 01:19:11,484] INFO [Transaction 
Log Manager 2]: Removed 1 cached transaction metadata for 
__transaction_state-41 on follower transition 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:3537:[2017-06-09 01:16:36,441] INFO [Transaction Log 
Manager 1]: Trying to remove cached transaction metadata for 
__transaction_state-41 on follower transition but there is no entries 
remaining; it is likely that another process for removing the cached entries 
has just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:25957:[2017-06-09 01:16:46,309] INFO [Transaction 
Log Manager 1]: Trying to remove cached transaction metadata for 
__transaction_state-41 on follower transition but there is no entries 
remaining; it is likely that another process for removing the cached entries 
has just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:26618:[2017-06-09 01:16:48,164] INFO [Transaction 
Log Manager 1]: Trying to remove cached transaction metadata for 
__transaction_state-41 on follower transition but there is no entries 
remaining; it is likely that another process for removing the cached entries 
has just executed earlier before 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:27951:[2017-06-09 01:16:51,398] INFO [Transaction 
Log Manager 1]: Loading transaction metadata from __transaction_state-41 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:27958:[2017-06-09 01:16:51,407] INFO [Transaction 
Log Manager 1]: Finished loading 1 transaction metadata from 
__transaction_state-41 in 9 milliseconds 
(kafka.coordinator.transaction.TransactionStateManager)
./worker2/debug/server.log:80970:[2017-06-09 01:16:59,608] INFO [Transaction 
Log Manager 1]: Removed 1 cached transaction metadata for 
__transaction_state-41 on follower transition 
(kafka.coordinator.transaction.TransactionStateManager)
./worker7/debug/server.log:2626:[2017-06-09 01:16:36,882] INFO [Transaction Log 
Manager 3]: Trying to remove cached transaction metadata for 
__transaction_state-41 on follower transition but there is no entries 
remaining; it is likely that another process for removing the cached entries 
has just executed earlier before 

[jira] [Commented] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommit state

2017-06-08 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043758#comment-16043758
 ] 

Apurva Mehta commented on KAFKA-5415:
-

The interesting thing is that it reproduces consistently on jenkins only when 
it is part of a multi test run. Running this test in isolation makes it look 
stable.

> TransactionCoordinator gets stuck in PrepareCommit state
> 
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommit state

2017-06-08 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043741#comment-16043741
 ] 

Apurva Mehta edited comment on KAFKA-5415 at 6/9/17 12:59 AM:
--

Incriminating evidence from the logs.

1. Tail of the transaction log is PrepareCommit at offset 277
{noformat}
offset: 276 position: 46886 CreateTime: 1496957141444 isvalid: true keysize: 29 
valuesize: 95 magic: 2 compresscodec: NONE producerId: -1 sequence: -1 
isTransactional: false headerKeys: [] key: 
transactionalId=my-first-transactional-id payload: 
producerId:2000,producerEpoch:0,state=Ongoing,partitions=Set(output-topic-2, 
__consumer_offsets-47, output-topic-0, 
output-topic-1),lastUpdateTimestamp=1496957141444
offset: 277 position: 47080 CreateTime: 1496957141285 isvalid: true keysize: 29 
valuesize: 95 magic: 2 compresscodec: NONE producerId: -1 sequence: -1 
isTransactional: false headerKeys: [] key: 
transactionalId=my-first-transactional-id payload: 
producerId:2000,producerEpoch:0,state=PrepareCommit,partitions=Set(output-topic-2,
 __consumer_offsets-47, output-topic-0, 
output-topic-1),lastUpdateTimestamp=1496957141285
{noformat}

2. The client disconnected from the coordinator after sending the EndTxn 
request, and got 'CONCURRENT_TRANSACTIONS' on retry.
{noformat}
[2017-06-08 21:25:41,474] TRACE Produced messages to topic-partition 
output-topic-2 with base offset offset 28618 and error: null. 
(org.apache.kafka.clients.producer.internals.ProducerBatch)
[2017-06-08 21:25:41,474] TRACE [TransactionalId my-first-transactional-id] 
Request (type=EndTxnRequest, transactionalId=my-first-transactional-id, 
producerId=2000, producerEpoch=0, result=COMMIT) dequeued for sending 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:25:41,474] DEBUG [TransactionalId my-first-transactional-id] 
Sending transactional request (type=EndTxnRequest, 
transactionalId=my-first-transactional-id, producerId=2000, producerEpoch=0, 
result=COMMIT) to node worker12:9092 (id: 3 rack: null) 
(org.apache.kafka.clients.producer.internals.Sender)
[2017-06-08 21:26:11,533] DEBUG [TransactionalId my-first-transactional-id] 
Disconnected from 3. Will retry. 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,533] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=FindCoordinatorRequest, 
coordinatorKey=my-first-transactional-id, coordinatorType=TRANSACTION) 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,533] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=EndTxnRequest, 
transactionalId=my-first-transactional-id, producerId=2000, producerEpoch=0, 
result=COMMIT) (org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,533] TRACE [TransactionalId my-first-transactional-id] 
Request (type=FindCoordinatorRequest, coordinatorKey=my-first-transactional-id, 
coordinatorType=TRANSACTION) dequeued for sending 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,534] DEBUG [TransactionalId my-first-transactional-id] 
Sending transactional request (type=FindCoordinatorRequest, 
coordinatorKey=my-first-transactional-id, coordinatorType=TRANSACTION) to node 
worker5:9092 (id: 2 rack: null) 
(org.apache.kafka.clients.producer.internals.Sender)
[2017-06-08 21:26:11,535] TRACE [TransactionalId my-first-transactional-id] 
Received transactional response FindCoordinatorResponse(throttleTimeMs=0, 
errorMessage='null', error=NONE, node=worker12:9092 (id: 3 rack: null)) for 
request (type=FindCoordinatorRequest, coordinatorKey=my-first-transactional-id, 
coordinatorType=TRANSACTION) 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,535] TRACE [TransactionalId my-first-transactional-id] 
Request (type=EndTxnRequest, transactionalId=my-first-transactional-id, 
producerId=2000, producerEpoch=0, result=COMMIT) dequeued for sending 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,535] DEBUG [TransactionalId my-first-transactional-id] 
Disconnect from worker12:9092 (id: 3 rack: null) while trying to send request 
(type=EndTxnRequest, transactionalId=my-first-transactional-id, 
producerId=2000, producerEpoch=0, result=COMMIT). Going to back off and retry 
(org.apache.kafka.clients.producer.internals.Sender)
[2017-06-08 21:26:11,535] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=FindCoordinatorRequest, 
coordinatorKey=my-first-transactional-id, coordinatorType=TRANSACTION) 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,535] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=EndTxnRequest, 
transactionalId=my-first-transactional-id, 

[jira] [Updated] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommit state

2017-06-08 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5415:

Summary: TransactionCoordinator gets stuck in PrepareCommit state  (was: 
TransactionCoordinator gets stuck in PrepareCommits state.)

> TransactionCoordinator gets stuck in PrepareCommit state
> 
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommits state.

2017-06-08 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043741#comment-16043741
 ] 

Apurva Mehta edited comment on KAFKA-5415 at 6/9/17 12:50 AM:
--

Incriminating evidence from the logs.

1. Tail of the transaction log is PrepareCommit at offset 277
{noformat}
offset: 276 position: 46886 CreateTime: 1496957141444 isvalid: true keysize: 29 
valuesize: 95 magic: 2 compresscodec: NONE producerId: -1 sequence: -1 
isTransactional: false headerKeys: [] key: 
transactionalId=my-first-transactional-id payload: 
producerId:2000,producerEpoch:0,state=Ongoing,partitions=Set(output-topic-2, 
__consumer_offsets-47, output-topic-0, 
output-topic-1),lastUpdateTimestamp=1496957141444
offset: 277 position: 47080 CreateTime: 1496957141285 isvalid: true keysize: 29 
valuesize: 95 magic: 2 compresscodec: NONE producerId: -1 sequence: -1 
isTransactional: false headerKeys: [] key: 
transactionalId=my-first-transactional-id payload: 
producerId:2000,producerEpoch:0,state=PrepareCommit,partitions=Set(output-topic-2,
 __consumer_offsets-47, output-topic-0, 
output-topic-1),lastUpdateTimestamp=1496957141285
{noformat}

2. The client disconnected from the coordinator after sending the EndTxn 
request, and got 'CONCURRENT_TRANSACTIONS' on retry.
{noformat}
[2017-06-08 21:25:41,474] TRACE Produced messages to topic-partition 
output-topic-2 with base offset offset 28618 and error: null. 
(org.apache.kafka.clients.producer.internals.ProducerBatch)
[2017-06-08 21:25:41,474] TRACE [TransactionalId my-first-transactional-id] 
Request (type=EndTxnRequest, transactionalId=my-first-transactional-id, 
producerId=2000, producerEpoch=0, result=COMMIT) dequeued for sending 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:25:41,474] DEBUG [TransactionalId my-first-transactional-id] 
Sending transactional request (type=EndTxnRequest, 
transactionalId=my-first-transactional-id, producerId=2000, producerEpoch=0, 
result=COMMIT) to node worker12:9092 (id: 3 rack: null) 
(org.apache.kafka.clients.producer.internals.Sender)
[2017-06-08 21:26:11,533] DEBUG [TransactionalId my-first-transactional-id] 
Disconnected from 3. Will retry. 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,533] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=FindCoordinatorRequest, 
coordinatorKey=my-first-transactional-id, coordinatorType=TRANSACTION) 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,533] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=EndTxnRequest, 
transactionalId=my-first-transactional-id, producerId=2000, producerEpoch=0, 
result=COMMIT) (org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,533] TRACE [TransactionalId my-first-transactional-id] 
Request (type=FindCoordinatorRequest, coordinatorKey=my-first-transactional-id, 
coordinatorType=TRANSACTION) dequeued for sending 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,534] DEBUG [TransactionalId my-first-transactional-id] 
Sending transactional request (type=FindCoordinatorRequest, 
coordinatorKey=my-first-transactional-id, coordinatorType=TRANSACTION) to node 
worker5:9092 (id: 2 rack: null) 
(org.apache.kafka.clients.producer.internals.Sender)
[2017-06-08 21:26:11,535] TRACE [TransactionalId my-first-transactional-id] 
Received transactional response FindCoordinatorResponse(throttleTimeMs=0, 
errorMessage='null', error=NONE, node=worker12:9092 (id: 3 rack: null)) for 
request (type=FindCoordinatorRequest, coordinatorKey=my-first-transactional-id, 
coordinatorType=TRANSACTION) 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,535] TRACE [TransactionalId my-first-transactional-id] 
Request (type=EndTxnRequest, transactionalId=my-first-transactional-id, 
producerId=2000, producerEpoch=0, result=COMMIT) dequeued for sending 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,535] DEBUG [TransactionalId my-first-transactional-id] 
Disconnect from worker12:9092 (id: 3 rack: null) while trying to send request 
(type=EndTxnRequest, transactionalId=my-first-transactional-id, 
producerId=2000, producerEpoch=0, result=COMMIT). Going to back off and retry 
(org.apache.kafka.clients.producer.internals.Sender)
[2017-06-08 21:26:11,535] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=FindCoordinatorRequest, 
coordinatorKey=my-first-transactional-id, coordinatorType=TRANSACTION) 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,535] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=EndTxnRequest, 
transactionalId=my-first-transactional-id, 

[jira] [Comment Edited] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommits state.

2017-06-08 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043741#comment-16043741
 ] 

Apurva Mehta edited comment on KAFKA-5415 at 6/9/17 12:51 AM:
--

Incriminating evidence from the logs.

1. Tail of the transaction log is PrepareCommit at offset 277
{noformat}
offset: 276 position: 46886 CreateTime: 1496957141444 isvalid: true keysize: 29 
valuesize: 95 magic: 2 compresscodec: NONE producerId: -1 sequence: -1 
isTransactional: false headerKeys: [] key: 
transactionalId=my-first-transactional-id payload: 
producerId:2000,producerEpoch:0,state=Ongoing,partitions=Set(output-topic-2, 
__consumer_offsets-47, output-topic-0, 
output-topic-1),lastUpdateTimestamp=1496957141444
offset: 277 position: 47080 CreateTime: 1496957141285 isvalid: true keysize: 29 
valuesize: 95 magic: 2 compresscodec: NONE producerId: -1 sequence: -1 
isTransactional: false headerKeys: [] key: 
transactionalId=my-first-transactional-id payload: 
producerId:2000,producerEpoch:0,state=PrepareCommit,partitions=Set(output-topic-2,
 __consumer_offsets-47, output-topic-0, 
output-topic-1),lastUpdateTimestamp=1496957141285
{noformat}

2. The client disconnected from the coordinator after sending the EndTxn 
request, and got 'CONCURRENT_TRANSACTIONS' on retry.
{noformat}
[2017-06-08 21:25:41,474] TRACE Produced messages to topic-partition 
output-topic-2 with base offset offset 28618 and error: null. 
(org.apache.kafka.clients.producer.internals.ProducerBatch)
[2017-06-08 21:25:41,474] TRACE [TransactionalId my-first-transactional-id] 
Request (type=EndTxnRequest, transactionalId=my-first-transactional-id, 
producerId=2000, producerEpoch=0, result=COMMIT) dequeued for sending 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:25:41,474] DEBUG [TransactionalId my-first-transactional-id] 
Sending transactional request (type=EndTxnRequest, 
transactionalId=my-first-transactional-id, producerId=2000, producerEpoch=0, 
result=COMMIT) to node worker12:9092 (id: 3 rack: null) 
(org.apache.kafka.clients.producer.internals.Sender)
[2017-06-08 21:26:11,533] DEBUG [TransactionalId my-first-transactional-id] 
Disconnected from 3. Will retry. 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,533] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=FindCoordinatorRequest, 
coordinatorKey=my-first-transactional-id, coordinatorType=TRANSACTION) 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,533] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=EndTxnRequest, 
transactionalId=my-first-transactional-id, producerId=2000, producerEpoch=0, 
result=COMMIT) (org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,533] TRACE [TransactionalId my-first-transactional-id] 
Request (type=FindCoordinatorRequest, coordinatorKey=my-first-transactional-id, 
coordinatorType=TRANSACTION) dequeued for sending 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,534] DEBUG [TransactionalId my-first-transactional-id] 
Sending transactional request (type=FindCoordinatorRequest, 
coordinatorKey=my-first-transactional-id, coordinatorType=TRANSACTION) to node 
worker5:9092 (id: 2 rack: null) 
(org.apache.kafka.clients.producer.internals.Sender)
[2017-06-08 21:26:11,535] TRACE [TransactionalId my-first-transactional-id] 
Received transactional response FindCoordinatorResponse(throttleTimeMs=0, 
errorMessage='null', error=NONE, node=worker12:9092 (id: 3 rack: null)) for 
request (type=FindCoordinatorRequest, coordinatorKey=my-first-transactional-id, 
coordinatorType=TRANSACTION) 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,535] TRACE [TransactionalId my-first-transactional-id] 
Request (type=EndTxnRequest, transactionalId=my-first-transactional-id, 
producerId=2000, producerEpoch=0, result=COMMIT) dequeued for sending 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,535] DEBUG [TransactionalId my-first-transactional-id] 
Disconnect from worker12:9092 (id: 3 rack: null) while trying to send request 
(type=EndTxnRequest, transactionalId=my-first-transactional-id, 
producerId=2000, producerEpoch=0, result=COMMIT). Going to back off and retry 
(org.apache.kafka.clients.producer.internals.Sender)
[2017-06-08 21:26:11,535] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=FindCoordinatorRequest, 
coordinatorKey=my-first-transactional-id, coordinatorType=TRANSACTION) 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,535] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=EndTxnRequest, 
transactionalId=my-first-transactional-id, 

[jira] [Commented] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommits state.

2017-06-08 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043741#comment-16043741
 ] 

Apurva Mehta commented on KAFKA-5415:
-

Incriminating evidence from the logs.

1. Tail of the transaction log is PrepareCommit at offset 277
{noformat}
offset: 276 position: 46886 CreateTime: 1496957141444 isvalid: true keysize: 29 
valuesize: 95 magic: 2 compresscodec: NONE producerId: -1 sequence: -1 
isTransactional: false headerKeys: [] key: 
transactionalId=my-first-transactional-id payload: 
producerId:2000,producerEpoch:0,state=Ongoing,partitions=Set(output-topic-2, 
__consumer_offsets-47, output-topic-0, 
output-topic-1),lastUpdateTimestamp=1496957141444
offset: 277 position: 47080 CreateTime: 1496957141285 isvalid: true keysize: 29 
valuesize: 95 magic: 2 compresscodec: NONE producerId: -1 sequence: -1 
isTransactional: false headerKeys: [] key: 
transactionalId=my-first-transactional-id payload: 
producerId:2000,producerEpoch:0,state=PrepareCommit,partitions=Set(output-topic-2,
 __consumer_offsets-47, output-topic-0, 
output-topic-1),lastUpdateTimestamp=1496957141285
{noformat}

2. The client disconnected from the coordinator after sending the EndTxn request
{noformat}
[2017-06-08 21:25:41,474] TRACE Produced messages to topic-partition 
output-topic-2 with base offset offset 28618 and error: null. 
(org.apache.kafka.clients.producer.internals.ProducerBatch)
[2017-06-08 21:25:41,474] TRACE [TransactionalId my-first-transactional-id] 
Request (type=EndTxnRequest, transactionalId=my-first-transactional-id, 
producerId=2000, producerEpoch=0, result=COMMIT) dequeued for sending 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:25:41,474] DEBUG [TransactionalId my-first-transactional-id] 
Sending transactional request (type=EndTxnRequest, 
transactionalId=my-first-transactional-id, producerId=2000, producerEpoch=0, 
result=COMMIT) to node worker12:9092 (id: 3 rack: null) 
(org.apache.kafka.clients.producer.internals.Sender)
[2017-06-08 21:26:11,533] DEBUG [TransactionalId my-first-transactional-id] 
Disconnected from 3. Will retry. 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,533] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=FindCoordinatorRequest, 
coordinatorKey=my-first-transactional-id, coordinatorType=TRANSACTION) 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,533] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=EndTxnRequest, 
transactionalId=my-first-transactional-id, producerId=2000, producerEpoch=0, 
result=COMMIT) (org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,533] TRACE [TransactionalId my-first-transactional-id] 
Request (type=FindCoordinatorRequest, coordinatorKey=my-first-transactional-id, 
coordinatorType=TRANSACTION) dequeued for sending 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,534] DEBUG [TransactionalId my-first-transactional-id] 
Sending transactional request (type=FindCoordinatorRequest, 
coordinatorKey=my-first-transactional-id, coordinatorType=TRANSACTION) to node 
worker5:9092 (id: 2 rack: null) 
(org.apache.kafka.clients.producer.internals.Sender)
[2017-06-08 21:26:11,535] TRACE [TransactionalId my-first-transactional-id] 
Received transactional response FindCoordinatorResponse(throttleTimeMs=0, 
errorMessage='null', error=NONE, node=worker12:9092 (id: 3 rack: null)) for 
request (type=FindCoordinatorRequest, coordinatorKey=my-first-transactional-id, 
coordinatorType=TRANSACTION) 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,535] TRACE [TransactionalId my-first-transactional-id] 
Request (type=EndTxnRequest, transactionalId=my-first-transactional-id, 
producerId=2000, producerEpoch=0, result=COMMIT) dequeued for sending 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,535] DEBUG [TransactionalId my-first-transactional-id] 
Disconnect from worker12:9092 (id: 3 rack: null) while trying to send request 
(type=EndTxnRequest, transactionalId=my-first-transactional-id, 
producerId=2000, producerEpoch=0, result=COMMIT). Going to back off and retry 
(org.apache.kafka.clients.producer.internals.Sender)
[2017-06-08 21:26:11,535] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=FindCoordinatorRequest, 
coordinatorKey=my-first-transactional-id, coordinatorType=TRANSACTION) 
(org.apache.kafka.clients.producer.internals.TransactionManager)
[2017-06-08 21:26:11,535] DEBUG [TransactionalId my-first-transactional-id] 
Enqueuing transactional request (type=EndTxnRequest, 
transactionalId=my-first-transactional-id, producerId=2000, producerEpoch=0, 
result=COMMIT) 

[jira] [Commented] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommits state.

2017-06-08 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16043731#comment-16043731
 ] 

Apurva Mehta commented on KAFKA-5415:
-

Logs from one such incident: 
https://issues.apache.org/jira/secure/attachment/12872179/6.tgz

> TransactionCoordinator gets stuck in PrepareCommits state.
> --
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommits state.

2017-06-08 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5415:

Attachment: 6.tgz

> TransactionCoordinator gets stuck in PrepareCommits state.
> --
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommits state.

2017-06-08 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5415:

Labels: exactly-once  (was: )

> TransactionCoordinator gets stuck in PrepareCommits state.
> --
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
>  Labels: exactly-once
> Fix For: 0.11.0.0
>
> Attachments: 6.tgz
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5415) TransactionCoordinator gets stuck in PrepareCommits state.

2017-06-08 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5415:

Summary: TransactionCoordinator gets stuck in PrepareCommits state.  (was: 
TransactionCoordinator get stuck in PrepareCommits state.)

> TransactionCoordinator gets stuck in PrepareCommits state.
> --
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
> Fix For: 0.11.0.0
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5415) TransactionCoordinator get stuck in PrepareCommits state.

2017-06-08 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5415:

Summary: TransactionCoordinator get stuck in PrepareCommits state.  (was: 
TransactionCoordinator )

> TransactionCoordinator get stuck in PrepareCommits state.
> -
>
> Key: KAFKA-5415
> URL: https://issues.apache.org/jira/browse/KAFKA-5415
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
>Priority: Blocker
> Fix For: 0.11.0.0
>
>
> This has been revealed by the system test failures on jenkins. 
> The transaction coordinator seems to get into a path during the handling of 
> the EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
> COORDINATOR_NOT_AVAILABLE error, to be revealed by 
> https://github.com/apache/kafka/pull/3278) to the client. However, due to 
> network instability, the producer is disconnected before it receives this 
> error.
> As a result, the transaction remains in a `PrepareXX` state, and future 
> `EndTxn` requests sent by the client after reconnecting result in a 
> `CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
> transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (KAFKA-5415) TransactionCoordinator

2017-06-08 Thread Apurva Mehta (JIRA)
Apurva Mehta created KAFKA-5415:
---

 Summary: TransactionCoordinator 
 Key: KAFKA-5415
 URL: https://issues.apache.org/jira/browse/KAFKA-5415
 Project: Kafka
  Issue Type: Bug
Reporter: Apurva Mehta
Assignee: Apurva Mehta
Priority: Blocker
 Fix For: 0.11.0.0


This has been revealed by the system test failures on jenkins. 

The transaction coordinator seems to get into a path during the handling of the 
EndTxnRequest where it returns an error (possibly a NOT_COORDINATOR or 
COORDINATOR_NOT_AVAILABLE error, to be revealed by 
https://github.com/apache/kafka/pull/3278) to the client. However, due to 
network instability, the producer is disconnected before it receives this error.

As a result, the transaction remains in a `PrepareXX` state, and future 
`EndTxn` requests sent by the client after reconnecting result in a 
`CONCURRENT_TRANSACTION` error code. Hence the client gets stuck and the 
transaction never finishes, as expiration isn't done from a PrepareXX state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5403) Transactions system test should dedup consumed messages by offset

2017-06-07 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5403:

Status: Patch Available  (was: Open)

> Transactions system test should dedup consumed messages by offset
> -
>
> Key: KAFKA-5403
> URL: https://issues.apache.org/jira/browse/KAFKA-5403
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
> Fix For: 0.11.0.1
>
>
> In KAFKA-5396, we saw that the consumers which verify the data in multiple 
> topics could read the same offsets multiple times, for instance when a 
> rebalance happens. 
> This would detect spurious duplicates, causing the test to fail. We should 
> dedup the consumed messages by offset and only fail the test if we have 
> duplicate values for a if for a unique set of offsets.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5403) Transactions system test should dedup consumed messages by offset

2017-06-07 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5403:

Fix Version/s: (was: 0.11.0.0)
   0.11.0.1

> Transactions system test should dedup consumed messages by offset
> -
>
> Key: KAFKA-5403
> URL: https://issues.apache.org/jira/browse/KAFKA-5403
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
> Fix For: 0.11.0.1
>
>
> In KAFKA-5396, we saw that the consumers which verify the data in multiple 
> topics could read the same offsets multiple times, for instance when a 
> rebalance happens. 
> This would detect spurious duplicates, causing the test to fail. We should 
> dedup the consumed messages by offset and only fail the test if we have 
> duplicate values for a if for a unique set of offsets.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (KAFKA-5403) Transactions system test should dedup consumed messages by offset

2017-06-07 Thread Apurva Mehta (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-5403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apurva Mehta updated KAFKA-5403:

Affects Version/s: 0.11.0.0

> Transactions system test should dedup consumed messages by offset
> -
>
> Key: KAFKA-5403
> URL: https://issues.apache.org/jira/browse/KAFKA-5403
> Project: Kafka
>  Issue Type: Bug
>Affects Versions: 0.11.0.0
>Reporter: Apurva Mehta
>Assignee: Apurva Mehta
> Fix For: 0.11.0.1
>
>
> In KAFKA-5396, we saw that the consumers which verify the data in multiple 
> topics could read the same offsets multiple times, for instance when a 
> rebalance happens. 
> This would detect spurious duplicates, causing the test to fail. We should 
> dedup the consumed messages by offset and only fail the test if we have 
> duplicate values for a if for a unique set of offsets.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


  1   2   3   4   >