[ 
https://issues.apache.org/jira/browse/KAFKA-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762361#comment-15762361
 ] 

Apurva Mehta commented on KAFKA-4526:
-------------------------------------

I had a look at the logs from one of the failures, and here is the problem: 

# The test has two phases: one bulk producer phase, which seeds the topic with 
large enough quantities of data so that we can actually test throttled 
reassignment. The other phase is the regular produce-consume-validate loop. 
# We start the reassignment, and then run the produce-consume-validate loop to 
ensure that no new messages are lost during reassignment.
# Because the produce-consume-validate pattern uses structured (integer) data 
in phase two, we require that the consumer start from the end of the log and 
also start before the producer begins producing messages. If this is true, then 
the consumer will read and validate all the messages sent by the producer. The 
test has a `wait_until` block, but that only checks for the existence of the 
process. 
# What is seen in the logs is that the producer starts and begins producing 
messages _before_ the consumer fetches the metadata for all the partitions. As 
as a result, the consumer misses the initial messages, which is consistent 
across all test failures. 
# This can be explained by the recent changes in ducktape: thanks to paramiko, 
running commands on worker machines is much faster since ssh connections are 
reused. Hence, the producer starts much faster than before, causing the initial 
set of messages to be missed by the consumer some of the time.
# The fix is to avoid using the PID of the consumer as a proxy for 'the 
consumer is ready'. Something  like 'partitions assigned' would be a more 
reliable proxy of the consumer being ready. Note that the original PR of the 
test had a timeout between consumer and producer start since there was no more 
robust method to ensure that the consumer was init'd before the producer 
started. But since the use of timeouts are --rightly!-- discouraged, it was 
removed. Adding suitable metrics would be a step in the right direction. 
# Next step is to leverage suitable metrics (like partitions assigned if it 
exists), or add them to the console consumer to ensure that it is init'd before 
continuing to start the producer.

> Transient failure in ThrottlingTest.test_throttled_reassignment
> ---------------------------------------------------------------
>
>                 Key: KAFKA-4526
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4526
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Ewen Cheslack-Postava
>            Assignee: Apurva Mehta
>              Labels: system-test-failure, system-tests
>             Fix For: 0.10.2.0
>
>
> This test is seeing transient failures sometimes
> {quote}
> Module: kafkatest.tests.core.throttling_test
> Class:  ThrottlingTest
> Method: test_throttled_reassignment
> Arguments:
> {
>   "bounce_brokers": false
> }
> {quote}
> This happens with both bounce_brokers = true and false. Fails with
> {quote}
> AssertionError: 1646 acked message did not make it to the Consumer. They are: 
> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19...plus 
> 1626 more. Total Acked: 174799, Total Consumed: 173153. We validated that the 
> first 1000 of these missing messages correctly made it into Kafka's data 
> files. This suggests they were lost on their way to the consumer.
> {quote}
> See 
> http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-12--001.1481535295--apache--trunk--62e043a/report.html
>  for an example.
> Note that there are a number of similar bug reports for different tests: 
> https://issues.apache.org/jira/issues/?jql=text%20~%20%22acked%20message%20did%20not%20make%20it%20to%20the%20Consumer%22%20and%20project%20%3D%20Kafka
>  I am wondering if we have a wrong ack setting somewhere that we should be 
> specifying as acks=all but is only defaulting to 0?
> It also seems interesting that the missing messages in these recent failures 
> seem to always start at 0...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to