[jira] [Commented] (KAFKA-4779) Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py

2017-03-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15890202#comment-15890202
 ] 

ASF GitHub Bot commented on KAFKA-4779:
---

Github user rajinisivaram closed the pull request at:

https://github.com/apache/kafka/pull/2608


> Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py
> 
>
> Key: KAFKA-4779
> URL: https://issues.apache.org/jira/browse/KAFKA-4779
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Rajini Sivaram
> Fix For: 0.10.3.0, 0.10.2.1
>
>
> This test failed on 01/29, on both trunk and 0.10.2, error message:
> {noformat}
> The consumer has terminated, or timed out, on node ubuntu@worker3.
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 123, in run
> data = self.run_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 176, in run_test
> return self.test_context.function(self.test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 321, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py",
>  line 148, in test_rolling_upgrade_phase_two
> self.run_produce_consume_validate(self.roll_in_secured_settings, 
> client_protocol, broker_protocol)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 100, in run_produce_consume_validate
> self.stop_producer_and_consumer()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 87, in stop_producer_and_consumer
> self.check_alive()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 79, in check_alive
> raise Exception(msg)
> Exception: The consumer has terminated, or timed out, on node ubuntu@worker3.
> {noformat}
> Looks like the console consumer times out: 
> {noformat}
> [2017-01-30 04:56:00,972] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> kafka.consumer.ConsumerTimeoutException
> at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:90)
> at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:120)
> at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:75)
> at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:50)
> at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala)
> {noformat}
> A bunch of these security_rolling_upgrade tests failed, and in all cases, the 
> producer produced ~15k messages, of which ~7k were acked, and the consumer 
> only got around ~2600 before timing out. 
> There are a lot of messages like the following for different request types on 
> the producer and consumer:
> {noformat}
> [2017-01-30 05:13:35,954] WARN Received unknown topic or partition error in 
> produce request on partition test_topic-0. The topic/partition may not exist 
> or the user may not have Describe access to it 
> (org.apache.kafka.clients.producer.internals.Sender)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-4779) Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py

2017-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887978#comment-15887978
 ] 

ASF GitHub Bot commented on KAFKA-4779:
---

GitHub user rajinisivaram opened a pull request:

https://github.com/apache/kafka/pull/2608

KAFKA-4779: Request metadata in consumer if partitions unavailable

If leader node of one more more partitions in a consumer subscription are 
temporarily unavailable, request metadata refresh so that partitions skipped 
for assignment dont have to wait for metadata expiry before reassignment.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajinisivaram/kafka 
KAFKA-4779-consumer-metadata

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2608.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2608


commit af5ef2b98314653de5ae098e8c45eb84b81685e7
Author: Rajini Sivaram 
Date:   2017-02-28T12:09:44Z

KAFKA-4779: Request metadata in consumer if partitions unavailable




> Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py
> 
>
> Key: KAFKA-4779
> URL: https://issues.apache.org/jira/browse/KAFKA-4779
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Rajini Sivaram
> Fix For: 0.10.3.0, 0.10.2.1
>
>
> This test failed on 01/29, on both trunk and 0.10.2, error message:
> {noformat}
> The consumer has terminated, or timed out, on node ubuntu@worker3.
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 123, in run
> data = self.run_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 176, in run_test
> return self.test_context.function(self.test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 321, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py",
>  line 148, in test_rolling_upgrade_phase_two
> self.run_produce_consume_validate(self.roll_in_secured_settings, 
> client_protocol, broker_protocol)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 100, in run_produce_consume_validate
> self.stop_producer_and_consumer()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 87, in stop_producer_and_consumer
> self.check_alive()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 79, in check_alive
> raise Exception(msg)
> Exception: The consumer has terminated, or timed out, on node ubuntu@worker3.
> {noformat}
> Looks like the console consumer times out: 
> {noformat}
> [2017-01-30 04:56:00,972] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> kafka.consumer.ConsumerTimeoutException
> at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:90)
> at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:120)
> at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:75)
> at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:50)
> at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala)
> {noformat}
> A bunch of these security_rolling_upgrade tests failed, and in all cases, the 
> producer produced ~15k messages, of which ~7k were acked, and the consumer 
> only got around ~2600 before timing out. 
> There are a lot of messages like the following for different request types on 
> the producer and consumer:
> {noformat}
> [2017-01-30 05:13:35,954] WARN Received unknown topic or partition error in 
> produce request on partition test_topic-0. The topic/partition may not exist 
> or the user may not have Describe access to it 
> (org.apache.kafka.clients.producer.internals.Sender)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-4779) Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py

2017-02-28 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887934#comment-15887934
 ] 

Ismael Juma commented on KAFKA-4779:


Thanks [~rsivaram]! Hopefully this is the last issue uncovered by this system 
test. :)

> Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py
> 
>
> Key: KAFKA-4779
> URL: https://issues.apache.org/jira/browse/KAFKA-4779
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Rajini Sivaram
> Fix For: 0.10.3.0, 0.10.2.1
>
>
> This test failed on 01/29, on both trunk and 0.10.2, error message:
> {noformat}
> The consumer has terminated, or timed out, on node ubuntu@worker3.
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 123, in run
> data = self.run_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 176, in run_test
> return self.test_context.function(self.test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 321, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py",
>  line 148, in test_rolling_upgrade_phase_two
> self.run_produce_consume_validate(self.roll_in_secured_settings, 
> client_protocol, broker_protocol)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 100, in run_produce_consume_validate
> self.stop_producer_and_consumer()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 87, in stop_producer_and_consumer
> self.check_alive()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 79, in check_alive
> raise Exception(msg)
> Exception: The consumer has terminated, or timed out, on node ubuntu@worker3.
> {noformat}
> Looks like the console consumer times out: 
> {noformat}
> [2017-01-30 04:56:00,972] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> kafka.consumer.ConsumerTimeoutException
> at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:90)
> at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:120)
> at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:75)
> at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:50)
> at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala)
> {noformat}
> A bunch of these security_rolling_upgrade tests failed, and in all cases, the 
> producer produced ~15k messages, of which ~7k were acked, and the consumer 
> only got around ~2600 before timing out. 
> There are a lot of messages like the following for different request types on 
> the producer and consumer:
> {noformat}
> [2017-01-30 05:13:35,954] WARN Received unknown topic or partition error in 
> produce request on partition test_topic-0. The topic/partition may not exist 
> or the user may not have Describe access to it 
> (org.apache.kafka.clients.producer.internals.Sender)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-4779) Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py

2017-02-28 Thread Rajini Sivaram (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15887913#comment-15887913
 ] 

Rajini Sivaram commented on KAFKA-4779:
---

The new failure looks like a very different issue. The consumer logs show:

{quote}
[2017-02-26 05:14:28,833] WARN Error while fetching metadata with correlation 
id 10712 : test_topic=LEADER_NOT_AVAILABLE 
(org.apache.kafka.clients.NetworkClient)

[2017-02-26 05:14:29,655] DEBUG Received successful JoinGroup response for 
group group: org.apache.kafka.common.requests.JoinGroupResponse@7802468d 
(org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
[2017-02-26 05:14:29,655] DEBUG Performing assignment for group group using 
strategy range with subscriptions 
console-consumer-810cd1cc-eb5b-4d6d-bfb5-6e094e107d14=Subscription(topics=[test_topic])(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
[2017-02-26 05:14:29,655] DEBUG Skipping assignment for topic test_topic since 
no metadata is available 
(org.apache.kafka.clients.consumer.internals.AbstractPartitionAssignor)
[2017-02-26 05:14:29,655] WARN The following subscribed topics are not assigned 
to any members in the group group : [test_topic]  
(org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
{quote}

Even though partitions of the subscribed topic could not be assigned because no 
metadata is available due to a transient error resulting from broker restart, 
no metadata refresh is requested. The consumer times out after a minute. 
Metadata expiry is 5 minutes, so rebalancing will be delayed for 5 minutes. 
Consumers currently request metadata when leader is not known during fetching, 
but not if assignment cannot be performed. I think we should request metadata 
sooner for this case. Will submit a PR.


> Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py
> 
>
> Key: KAFKA-4779
> URL: https://issues.apache.org/jira/browse/KAFKA-4779
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Rajini Sivaram
> Fix For: 0.10.3.0, 0.10.2.1
>
>
> This test failed on 01/29, on both trunk and 0.10.2, error message:
> {noformat}
> The consumer has terminated, or timed out, on node ubuntu@worker3.
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 123, in run
> data = self.run_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 176, in run_test
> return self.test_context.function(self.test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 321, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py",
>  line 148, in test_rolling_upgrade_phase_two
> self.run_produce_consume_validate(self.roll_in_secured_settings, 
> client_protocol, broker_protocol)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 100, in run_produce_consume_validate
> self.stop_producer_and_consumer()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 87, in stop_producer_and_consumer
> self.check_alive()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 79, in check_alive
> raise Exception(msg)
> Exception: The consumer has terminated, or timed out, on node ubuntu@worker3.
> {noformat}
> Looks like the console consumer times out: 
> {noformat}
> [2017-01-30 04:56:00,972] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> kafka.consumer.ConsumerTimeoutException
> at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:90)
> at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:120)
> at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:75)
> at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:50)
> at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala)
> {noformat}
> A bunch of these security_rolling_upgrade tests failed, and in all cases, the 
> producer produced ~15k messages, of which ~7k were acked, and the consumer 
> only got around ~2600 before timing out. 
> There are a 

[jira] [Commented] (KAFKA-4779) Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py

2017-02-27 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15885716#comment-15885716
 ] 

Ismael Juma commented on KAFKA-4779:


The test failed again, this time with a different message:

{code}

test_id:
kafkatest.tests.core.security_rolling_upgrade_test.TestSecurityRollingUpgrade.test_rolling_upgrade_phase_two.broker_protocol=SASL_PLAINTEXT.client_protocol=SSL
status: FAIL
run time:   4 minutes 32.586 seconds


1152 acked message did not make it to the Consumer. They are: 12288, 12289, 
12290, 12291, 12292, 12293, 12294, 12295, 12296, 12297, 12298, 12299, 12300, 
12301, 12302, 12303, 12304, 12305, 12306, 12307...plus 1132 more. Total Acked: 
12184, Total Consumed: 11032. We validated that the first 1000 of these missing 
messages correctly made it into Kafka's data files. This suggests they were 
lost on their way to the consumer.
Traceback (most recent call last):
  File 
"/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
 line 123, in run
data = self.run_test()
  File 
"/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
 line 176, in run_test
return self.test_context.function(self.test)
  File 
"/var/lib/jenkins/workspace/system-test-kafka/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
 line 321, in wrapper
return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File 
"/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py",
 line 148, in test_rolling_upgrade_phase_two
self.run_produce_consume_validate(self.roll_in_secured_settings, 
client_protocol, broker_protocol)
  File 
"/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py",
 line 117, in run_produce_consume_validate
self.validate()
  File 
"/var/lib/jenkins/workspace/system-test-kafka/kafka/tests/kafkatest/tests/produce_consume_validate.py",
 line 179, in validate
assert success, msg
AssertionError: 1152 acked message did not make it to the Consumer. They are: 
12288, 12289, 12290, 12291, 12292, 12293, 12294, 12295, 12296, 12297, 12298, 
12299, 12300, 12301, 12302, 12303, 12304, 12305, 12306, 12307...plus 1132 more. 
Total Acked: 12184, Total Consumed: 11032. We validated that the first 1000 of 
these missing messages correctly made it into Kafka's data files. This suggests 
they were lost on their way to the consumer.

{code}

http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2017-02-26--001.1488103947--apache--trunk--5b682ba/report.html
http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2017-02-26--001.1488103947--apache--trunk--5b682ba/TestSecurityRollingUpgrade/test_rolling_upgrade_phase_two/broker_protocol=SASL_PLAINTEXT.client_protocol=SSL/62.tgz


> Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py
> 
>
> Key: KAFKA-4779
> URL: https://issues.apache.org/jira/browse/KAFKA-4779
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Rajini Sivaram
> Fix For: 0.10.3.0, 0.10.2.1
>
>
> This test failed on 01/29, on both trunk and 0.10.2, error message:
> {noformat}
> The consumer has terminated, or timed out, on node ubuntu@worker3.
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 123, in run
> data = self.run_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 176, in run_test
> return self.test_context.function(self.test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 321, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py",
>  line 148, in test_rolling_upgrade_phase_two
> self.run_produce_consume_validate(self.roll_in_secured_settings, 
> client_protocol, broker_protocol)
>   File 
> 

[jira] [Commented] (KAFKA-4779) Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py

2017-02-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15883144#comment-15883144
 ] 

ASF GitHub Bot commented on KAFKA-4779:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2589


> Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py
> 
>
> Key: KAFKA-4779
> URL: https://issues.apache.org/jira/browse/KAFKA-4779
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Rajini Sivaram
> Fix For: 0.10.3.0, 0.10.2.1
>
>
> This test failed on 01/29, on both trunk and 0.10.2, error message:
> {noformat}
> The consumer has terminated, or timed out, on node ubuntu@worker3.
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 123, in run
> data = self.run_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 176, in run_test
> return self.test_context.function(self.test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 321, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py",
>  line 148, in test_rolling_upgrade_phase_two
> self.run_produce_consume_validate(self.roll_in_secured_settings, 
> client_protocol, broker_protocol)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 100, in run_produce_consume_validate
> self.stop_producer_and_consumer()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 87, in stop_producer_and_consumer
> self.check_alive()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 79, in check_alive
> raise Exception(msg)
> Exception: The consumer has terminated, or timed out, on node ubuntu@worker3.
> {noformat}
> Looks like the console consumer times out: 
> {noformat}
> [2017-01-30 04:56:00,972] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> kafka.consumer.ConsumerTimeoutException
> at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:90)
> at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:120)
> at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:75)
> at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:50)
> at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala)
> {noformat}
> A bunch of these security_rolling_upgrade tests failed, and in all cases, the 
> producer produced ~15k messages, of which ~7k were acked, and the consumer 
> only got around ~2600 before timing out. 
> There are a lot of messages like the following for different request types on 
> the producer and consumer:
> {noformat}
> [2017-01-30 05:13:35,954] WARN Received unknown topic or partition error in 
> produce request on partition test_topic-0. The topic/partition may not exist 
> or the user may not have Describe access to it 
> (org.apache.kafka.clients.producer.internals.Sender)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-4779) Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py

2017-02-23 Thread Rajini Sivaram (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880684#comment-15880684
 ] 

Rajini Sivaram commented on KAFKA-4779:
---

I couldn't recreate the failure, but the code for phase_two of security upgrade 
with different client_protocol and broker_protocol is currently a disruptive 
upgrade that stops produce and consume during the upgrade. As a result, 
consumer can timeout if the upgrade takes slightly longer than expected.

Non-disruptive upgrade of cluster to enable new security protocols is described 
in the docs (http://kafka.apache.org/documentation/#security_rolling_upgrade). 
The new protocols must be enabled first with incremental bounce. And then the 
inter-broker protocol is updated with incremental bounce. And finally, the old 
protocol is removed. When client_protocol (SASL_PLAINTEXT) and broker_protocol 
(SSL) are being updated to different protocols starting with PLAINTEXT, both 
SASL_PLAINTEXT and SSL must be enabled first before inter-broker protocol is 
changed to SSL. The test was enabling only SASL_PLAINTEXT. As a result 
inter-broker communication was broken during the upgrade, causing produce and 
consume to fail until the cluster got back to a good state. Since the purpose 
of the test is to verify non-disruptive upgrade, I have changed the test to 
enable both SASL_PLAINTEXT and SSL first so that the upgrade is performed 
without disrupting producers or consumers.

> Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py
> 
>
> Key: KAFKA-4779
> URL: https://issues.apache.org/jira/browse/KAFKA-4779
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Rajini Sivaram
>
> This test failed on 01/29, on both trunk and 0.10.2, error message:
> {noformat}
> The consumer has terminated, or timed out, on node ubuntu@worker3.
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 123, in run
> data = self.run_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 176, in run_test
> return self.test_context.function(self.test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 321, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py",
>  line 148, in test_rolling_upgrade_phase_two
> self.run_produce_consume_validate(self.roll_in_secured_settings, 
> client_protocol, broker_protocol)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 100, in run_produce_consume_validate
> self.stop_producer_and_consumer()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 87, in stop_producer_and_consumer
> self.check_alive()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 79, in check_alive
> raise Exception(msg)
> Exception: The consumer has terminated, or timed out, on node ubuntu@worker3.
> {noformat}
> Looks like the console consumer times out: 
> {noformat}
> [2017-01-30 04:56:00,972] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> kafka.consumer.ConsumerTimeoutException
> at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:90)
> at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:120)
> at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:75)
> at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:50)
> at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala)
> {noformat}
> A bunch of these security_rolling_upgrade tests failed, and in all cases, the 
> producer produced ~15k messages, of which ~7k were acked, and the consumer 
> only got around ~2600 before timing out. 
> There are a lot of messages like the following for different request types on 
> the producer and consumer:
> {noformat}
> [2017-01-30 05:13:35,954] WARN Received unknown topic or partition error in 
> produce request on partition test_topic-0. The topic/partition may not exist 
> or the user may not have Describe access to it 
> (org.apache.kafka.clients.producer.internals.Sender)
> {noformat}


[jira] [Commented] (KAFKA-4779) Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py

2017-02-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880642#comment-15880642
 ] 

ASF GitHub Bot commented on KAFKA-4779:
---

GitHub user rajinisivaram opened a pull request:

https://github.com/apache/kafka/pull/2589

KAFKA-4779: Fix security upgrade system test to be non-disruptive

The phase_two security upgrade test verifies upgrading inter-broker and 
client protocols to the same value as well as different values. The second case 
currently changes inter-broker protocol without first enabling the protocol, 
disrupting produce/consume until the whole cluster is updated. This commit 
changes the test to be a non-disruptive upgrade test that enables protocols 
first (simulating phase one of upgrade).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rajinisivaram/kafka KAFKA-4779

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2589.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2589


commit 71aa4c4fa8035695ac460797b84a6754576dc99f
Author: Rajini Sivaram 
Date:   2017-02-23T14:57:45Z

KAFKA-4779: Fix security upgrade system test to be non-disruptive




> Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py
> 
>
> Key: KAFKA-4779
> URL: https://issues.apache.org/jira/browse/KAFKA-4779
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>Assignee: Rajini Sivaram
>
> This test failed on 01/29, on both trunk and 0.10.2, error message:
> {noformat}
> The consumer has terminated, or timed out, on node ubuntu@worker3.
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 123, in run
> data = self.run_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 176, in run_test
> return self.test_context.function(self.test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 321, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py",
>  line 148, in test_rolling_upgrade_phase_two
> self.run_produce_consume_validate(self.roll_in_secured_settings, 
> client_protocol, broker_protocol)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 100, in run_produce_consume_validate
> self.stop_producer_and_consumer()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 87, in stop_producer_and_consumer
> self.check_alive()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 79, in check_alive
> raise Exception(msg)
> Exception: The consumer has terminated, or timed out, on node ubuntu@worker3.
> {noformat}
> Looks like the console consumer times out: 
> {noformat}
> [2017-01-30 04:56:00,972] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> kafka.consumer.ConsumerTimeoutException
> at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:90)
> at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:120)
> at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:75)
> at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:50)
> at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala)
> {noformat}
> A bunch of these security_rolling_upgrade tests failed, and in all cases, the 
> producer produced ~15k messages, of which ~7k were acked, and the consumer 
> only got around ~2600 before timing out. 
> There are a lot of messages like the following for different request types on 
> the producer and consumer:
> {noformat}
> [2017-01-30 05:13:35,954] WARN Received unknown topic or partition error in 
> produce request on partition test_topic-0. The topic/partition may not exist 
> or the user may not have Describe access to it 
> (org.apache.kafka.clients.producer.internals.Sender)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-4779) Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py

2017-02-21 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877047#comment-15877047
 ] 

Apurva Mehta commented on KAFKA-4779:
-

[~rsivaram] here you go: 
http://confluent-kafka-system-test-results.s3-us-west-2.amazonaws.com/2017-02-15--001.1487153551--apache--trunk--4db048d/TestSecurityRollingUpgrade/test_rolling_upgrade_phase_two/broker_protocol=SSL.client_protocol=SASL_PLAINTEXT/66.tgz



> Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py
> 
>
> Key: KAFKA-4779
> URL: https://issues.apache.org/jira/browse/KAFKA-4779
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>
> This test failed on 01/29, on both trunk and 0.10.2, error message:
> {noformat}
> The consumer has terminated, or timed out, on node ubuntu@worker3.
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 123, in run
> data = self.run_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 176, in run_test
> return self.test_context.function(self.test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 321, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py",
>  line 148, in test_rolling_upgrade_phase_two
> self.run_produce_consume_validate(self.roll_in_secured_settings, 
> client_protocol, broker_protocol)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 100, in run_produce_consume_validate
> self.stop_producer_and_consumer()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 87, in stop_producer_and_consumer
> self.check_alive()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 79, in check_alive
> raise Exception(msg)
> Exception: The consumer has terminated, or timed out, on node ubuntu@worker3.
> {noformat}
> Looks like the console consumer times out: 
> {noformat}
> [2017-01-30 04:56:00,972] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> kafka.consumer.ConsumerTimeoutException
> at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:90)
> at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:120)
> at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:75)
> at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:50)
> at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala)
> {noformat}
> A bunch of these security_rolling_upgrade tests failed, and in all cases, the 
> producer produced ~15k messages, of which ~7k were acked, and the consumer 
> only got around ~2600 before timing out. 
> There are a lot of messages like the following for different request types on 
> the producer and consumer:
> {noformat}
> [2017-01-30 05:13:35,954] WARN Received unknown topic or partition error in 
> produce request on partition test_topic-0. The topic/partition may not exist 
> or the user may not have Describe access to it 
> (org.apache.kafka.clients.producer.internals.Sender)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-4779) Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py

2017-02-21 Thread Rajini Sivaram (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877030#comment-15877030
 ] 

Rajini Sivaram commented on KAFKA-4779:
---

[~apurva] Do you by any chance have the full logs of a failure? Can you attach 
a zip to the JIRA if they are not too big? I tried running the tests locally, 
but they passed for me. Can't think of anything that changed recently that 
could result in this failure.

> Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py
> 
>
> Key: KAFKA-4779
> URL: https://issues.apache.org/jira/browse/KAFKA-4779
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>
> This test failed on 01/29, on both trunk and 0.10.2, error message:
> {noformat}
> The consumer has terminated, or timed out, on node ubuntu@worker3.
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 123, in run
> data = self.run_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 176, in run_test
> return self.test_context.function(self.test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 321, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py",
>  line 148, in test_rolling_upgrade_phase_two
> self.run_produce_consume_validate(self.roll_in_secured_settings, 
> client_protocol, broker_protocol)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 100, in run_produce_consume_validate
> self.stop_producer_and_consumer()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 87, in stop_producer_and_consumer
> self.check_alive()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 79, in check_alive
> raise Exception(msg)
> Exception: The consumer has terminated, or timed out, on node ubuntu@worker3.
> {noformat}
> Looks like the console consumer times out: 
> {noformat}
> [2017-01-30 04:56:00,972] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> kafka.consumer.ConsumerTimeoutException
> at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:90)
> at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:120)
> at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:75)
> at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:50)
> at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala)
> {noformat}
> A bunch of these security_rolling_upgrade tests failed, and in all cases, the 
> producer produced ~15k messages, of which ~7k were acked, and the consumer 
> only got around ~2600 before timing out. 
> There are a lot of messages like the following for different request types on 
> the producer and consumer:
> {noformat}
> [2017-01-30 05:13:35,954] WARN Received unknown topic or partition error in 
> produce request on partition test_topic-0. The topic/partition may not exist 
> or the user may not have Describe access to it 
> (org.apache.kafka.clients.producer.internals.Sender)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-4779) Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py

2017-02-17 Thread Apurva Mehta (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872862#comment-15872862
 ] 

Apurva Mehta commented on KAFKA-4779:
-

[~rsivaram] do you have an idea about what could be going wrong here? these 
tests have been failing for maybe 4 weeks now on and off.

> Failure in kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py
> 
>
> Key: KAFKA-4779
> URL: https://issues.apache.org/jira/browse/KAFKA-4779
> Project: Kafka
>  Issue Type: Bug
>Reporter: Apurva Mehta
>
> This test failed on 01/29, on both trunk and 0.10.2, error message:
> {noformat}
> The consumer has terminated, or timed out, on node ubuntu@worker3.
> Traceback (most recent call last):
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 123, in run
> data = self.run_test()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/tests/runner_client.py",
>  line 176, in run_test
> return self.test_context.function(self.test)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/venv/local/lib/python2.7/site-packages/ducktape-0.6.0-py2.7.egg/ducktape/mark/_mark.py",
>  line 321, in wrapper
> return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/core/security_rolling_upgrade_test.py",
>  line 148, in test_rolling_upgrade_phase_two
> self.run_produce_consume_validate(self.roll_in_secured_settings, 
> client_protocol, broker_protocol)
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 100, in run_produce_consume_validate
> self.stop_producer_and_consumer()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 87, in stop_producer_and_consumer
> self.check_alive()
>   File 
> "/var/lib/jenkins/workspace/system-test-kafka-0.10.2/kafka/tests/kafkatest/tests/produce_consume_validate.py",
>  line 79, in check_alive
> raise Exception(msg)
> Exception: The consumer has terminated, or timed out, on node ubuntu@worker3.
> {noformat}
> Looks like the console consumer times out: 
> {noformat}
> [2017-01-30 04:56:00,972] ERROR Error processing message, terminating 
> consumer process:  (kafka.tools.ConsoleConsumer$)
> kafka.consumer.ConsumerTimeoutException
> at kafka.consumer.NewShinyConsumer.receive(BaseConsumer.scala:90)
> at kafka.tools.ConsoleConsumer$.process(ConsoleConsumer.scala:120)
> at kafka.tools.ConsoleConsumer$.run(ConsoleConsumer.scala:75)
> at kafka.tools.ConsoleConsumer$.main(ConsoleConsumer.scala:50)
> at kafka.tools.ConsoleConsumer.main(ConsoleConsumer.scala)
> {noformat}
> A bunch of these security_rolling_upgrade tests failed, and in all cases, the 
> producer produced ~15k messages, of which ~7k were acked, and the consumer 
> only got around ~2600 before timing out. 
> There are a lot of messages like the following for different request types on 
> the producer and consumer:
> {noformat}
> [2017-01-30 05:13:35,954] WARN Received unknown topic or partition error in 
> produce request on partition test_topic-0. The topic/partition may not exist 
> or the user may not have Describe access to it 
> (org.apache.kafka.clients.producer.internals.Sender)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)