[
https://issues.apache.org/jira/browse/KAFKA-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17869699#comment-17869699
]
Kirk True commented on KAFKA-17219:
-----------------------------------
Thanks for tracking this down, [~dongnuolyu].
What perplexes me is that we'd uncovered many similar issues in our initial
migration of the system tests to support the new consumer. We'd found the same
root issue before, namely that the new consumer takes some time to stabilize
its groups. Our fix was to switch checks like these:
{code:python}
assert something_is_true()
{code}
To something that allows a few seconds for that check to become true:
{code:python}
waitFor(lambda -> something_is_true(), 15)
{code}
So it's really odd to me that we're still seeing these. Or maybe those changes
were reverted? :(
> Adjust system test framework for new protocol consumer
> ------------------------------------------------------
>
> Key: KAFKA-17219
> URL: https://issues.apache.org/jira/browse/KAFKA-17219
> Project: Kafka
> Issue Type: Task
> Components: clients, consumer, system tests
> Reporter: Dongnuo Lyu
> Priority: Major
>
> The current test framework doesn't work well with the existing tests using
> the new consumer protocol. There are two main issues I've seen.
>
> First, we sometimes assume there is no rebalance triggered, for instance in
> {{consumer_test.py::test_consumer_failure}}
> {code:java}
> verify that there were no rebalances on failover
> assert num_rebalances == consumer.num_rebalances(), "Broker failure should
> not cause a rebalance"{code}
> The current frame work calculates {{num_rebalances}} by increment by one
> every time a new assignment is received, so if a reconciliation happened
> during the failover, {{num_rebalances}} will also be incremented. For new
> protocol we need a new way to update {{{}num_rebalances{}}}.
>
> Second, for the new protocol, we need a way to make sure all members have
> joined {*}and stablized{*}. Currently we only make sure all members have
> joined (the event handlers are all in Joined state), where some partitions
> haven't been assigned and more time is needed for reconciliation. The issue
> can cause failure in assertions like timeout waiting for consumption and
> {code:java}
> partition_owner = consumer.owner(partition)
> assert partition_owner is not None {code}
>
> For a short term solution, we can make the tests pass by bypassing with
> adding {{time.sleep}} or skip checking {{{}num_rebalance{}}}. To truly fix
> them, we should adjust
> {{tools/src/main/java/org/apache/kafka/tools/VerifiableConsumer.java}} to
> work well with the new protocol.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)