[
https://issues.apache.org/jira/browse/KAFKA-4574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15905778#comment-15905778
]
Apurva Mehta commented on KAFKA-4574:
-------------------------------------
It appears than different messages wound up at the same offset:
{noformat}
amehta-macbook-pro:50 apurva$ grep 4738
VerifiableProducer-0-140193439058000/worker5/verifiable_producer.stdout | grep
producer_send_success
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"4738","time_ms":1489036862544,"offset":1496,"key":null}
amehta-macbook-pro:50 apurva$ grep 5614
VerifiableProducer-0-140193439058000/worker5/verifiable_producer.stdout | grep
producer_send_success
{"topic":"test_topic","partition":2,"name":"producer_send_success","value":"5614","time_ms":1489036872096,"offset":1496,"key":null}
{noformat}
Note that two unique messages were acked at offset 1496 in partition 2. In the
log, only the second message (by timestamp) persists. So the first message was
lost. What happened during the test is that the consumer read offset 1496 which
had the first value at that time, and then moved on. So the second message was
never read, and was reported missing.
This may be a scenario where KIP-101 would solve the problem? The brokers are
being continually bounced in the test, so it is possible that once in a while
the wrong truncation happens and messages are lost.
Another interesting finding: the producer received acks for 6425 messages. But
there are 6427 unique messages in the log, and the consumer also read 6427
unique messages. So some unacked messages seem to have made it into the log.
There were three messages which were not acked, but are in the log, and all of
them have this error:
{noformat}
amehta-macbook-pro:50 apurva$ grep 4728
VerifiableProducer-0-140193439058000/worker5/verifiable_producer.stdout | grep
producer_send_error
{"topic":"test_topic","message":"This server is not the leader for that
topic-partition.","exception":"class
org.apache.kafka.common.errors.NotLeaderForPartitionException","name":"producer_send_error","value":"4728","time_ms":1489036862629,"key":null}
amehta-macbook-pro:50 apurva$ grep 2844
VerifiableProducer-0-140193439058000/worker5/verifiable_producer.stdout | grep
producer_send_error
{"topic":"test_topic","message":"Messages are written to the log, but to fewer
in-sync replicas than required.","exception":"class
org.apache.kafka.common.errors.NotEnoughReplicasAfterAppendException","name":"producer_send_error","value":"2844","time_ms":1489036843590,"key":null}
amehta-macbook-pro:50 apurva$ grep 5588
VerifiableProducer-0-140193439058000/worker5/verifiable_producer.stdout | grep
producer_send_error
{"topic":"test_topic","message":"Messages are written to the log, but to fewer
in-sync replicas than required.","exception":"class
org.apache.kafka.common.errors.NotEnoughReplicasAfterAppendException","name":"producer_send_error","value":"5588","time_ms":1489036871054,"key":null}
{noformat}
This also seems problematic, since the producer config shows acks=-1, and
retries=0. So the producer considered this a failure, but the messages wound up
in the log. This also seems like a problem, but it is not the cause of this
test failure.
> Transient failure in ZooKeeperSecurityUpgradeTest.test_zk_security_upgrade
> with security_protocol = SASL_PLAINTEXT, SSL
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: KAFKA-4574
> URL: https://issues.apache.org/jira/browse/KAFKA-4574
> Project: Kafka
> Issue Type: Test
> Components: system tests
> Reporter: Shikhar Bhushan
> Assignee: Apurva Mehta
>
> http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2016-12-29--001.1483003056--apache--trunk--dc55025/report.html
> {{ZooKeeperSecurityUpgradeTest.test_zk_security_upgrade}} failed with these
> {{security_protocol}} parameters
> {noformat}
> ====================================================================================================
> test_id:
> kafkatest.tests.core.zookeeper_security_upgrade_test.ZooKeeperSecurityUpgradeTest.test_zk_security_upgrade.security_protocol=SASL_PLAINTEXT
> status: FAIL
> run time: 3 minutes 44.094 seconds
> 1 acked message did not make it to the Consumer. They are: [5076]. We
> validated that the first 1 of these missing messages correctly made it into
> Kafka's data files. This suggests they were lost on their way to the consumer.
> Traceback (most recent call last):
> File
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
> line 123, in run
> data = self.run_test()
> File
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
> line 176, in run_test
> return self.test_context.function(self.test)
> File
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
> line 321, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
> File
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/core/zookeeper_security_upgrade_test.py",
> line 117, in test_zk_security_upgrade
> self.run_produce_consume_validate(self.run_zk_migration)
> File
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py",
> line 101, in run_produce_consume_validate
> self.validate()
> File
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py",
> line 163, in validate
> assert success, msg
> AssertionError: 1 acked message did not make it to the Consumer. They are:
> [5076]. We validated that the first 1 of these missing messages correctly
> made it into Kafka's data files. This suggests they were lost on their way to
> the consumer.
> {noformat}
> {noformat}
> ====================================================================================================
> test_id:
> kafkatest.tests.core.zookeeper_security_upgrade_test.ZooKeeperSecurityUpgradeTest.test_zk_security_upgrade.security_protocol=SSL
> status: FAIL
> run time: 3 minutes 50.578 seconds
> 1 acked message did not make it to the Consumer. They are: [3559]. We
> validated that the first 1 of these missing messages correctly made it into
> Kafka's data files. This suggests they were lost on their way to the consumer.
> Traceback (most recent call last):
> File
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
> line 123, in run
> data = self.run_test()
> File
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
> line 176, in run_test
> return self.test_context.function(self.test)
> File
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
> line 321, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
> File
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/core/zookeeper_security_upgrade_test.py",
> line 117, in test_zk_security_upgrade
> self.run_produce_consume_validate(self.run_zk_migration)
> File
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py",
> line 101, in run_produce_consume_validate
> self.validate()
> File
> "/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py",
> line 163, in validate
> assert success, msg
> AssertionError: 1 acked message did not make it to the Consumer. They are:
> [3559]. We validated that the first 1 of these missing messages correctly
> made it into Kafka's data files. This suggests they were lost on their way to
> the consumer.
> {noformat}
> Previously: KAFKA-3985
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)