[ 
https://issues.apache.org/jira/browse/KAFKA-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15519348#comment-15519348
 ] 

Apurva Mehta commented on KAFKA-4215:
-------------------------------------

Here is [~junrao]'s theory for what is going on:

{quote}
For  the 2nd problem that Apurva discovered, i.e., consumer loses messages 
during reassignment with bounce and replication factor 1, the problem seems 
likely due to an existing limitation. Let's say replication factor is 1 and we 
want to move the replica to a different broker. When the reassignment tool 
starts, the controller will add both the old and the new replica to assigned 
replicas and let the new replica be the follower to catch up from the old 
replica (which is the leader). Now if we stop the broker where the old replica 
is in, the controller will select the new replica as the new leader even though 
it's not fully caught up yet since by default we allow unclean leader election. 
So in the end, some of the messages may only be in the old replica but not in 
the new replica, and some other messages may only be in the new replica, but 
not the old replica, due to the unclean leader election. Then the consumer will 
be missing some of the messages since it only reads from the leader. When 
verifying the log data, we simply accumulate messages from both replicas. So, 
all messages can still be found. We can verify if this is the case. If so, we 
can file a jira to track this but don't have to fix it immediately since it 
only happens when replication factor is 1, unclean leader election is enabled, 
and there are bounces during partition reassignment. So, it should be rare. As 
an immediate fix for the test, we can either use replication factor 2 or 
disable unclean leader election. It would be useful to improve the log 
verification part as well. Instead of accumulating messages from all replicas, 
we probably want to verify that all replicas are identical.
{quote}

The facts backup this hypothesis: 
# unclean leader election was enabled in this test.
# the log verification aggregates the data in all the replicas for a partition 
into a single set, and is not representative of the data the consumers actually 
see.


> Consumers miss messages during partition reassignment
> -----------------------------------------------------
>
>                 Key: KAFKA-4215
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4215
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Apurva Mehta
>            Assignee: Apurva Mehta
>
> In the specific case where the replication-factor of a topic is 1, when 
> partition reassignment is ongoing, and when a broker is bounced, consumers 
> reliably lose some messages in the stream. 
> This can be reproduced in system tests where the following error message sis 
> observed:
> {noformat}
> AssertionError: 737 acked message did not make it to the Consumer. They are: 
> 22530, 45059, 22534, 45063, 22538, 45067, 22542, 45071, 22546, 45075, 22550, 
> 45079, 22554, 45083, 22558, 45087, 22562, 45091, 22566, 45095, ...plus 717 
> more. Total Acked: 51809, Total Consumed: 51073. We validated that the first 
> 737 of these missing messages correctly made it into Kafka's data files. This 
> suggests they were lost on their way to the consumer.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to