[ 
https://issues.apache.org/jira/browse/KAFKA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884085#comment-15884085
 ] 

James Cheng commented on KAFKA-1120:
------------------------------------

I'm not sure if this helps, but I figured out how to trivially reproduce this 
problem.

1. Start 2 brokers.
2. Put 10000 partitions on each of them.
3. Do a controlled shutdown of one of them. It will take its normal 3 attempts 
and then do an uncontrolled shutdown.
4. Once it exits, start it back up immediately.

When the brokers settle back down, you will see that you still have a bunch of 
under replicated partitions. The brokers will be sitting there idly, with no 
strange behavior in their logs.

I've tested this on 0.10.0.0, 0.10.1.1, and 0.10.2

What happens in step 4 is that the controller is still busy processing the 
shutdown from step 3. You can see this by looking at all the messages that are 
being written to controller.log. If the broker starts back up before the 
controller was done processing the controller shutdown, then you will encounter 
this problem.

/cc [~junrao]

I order to make this repro happen faster, I set 
controlled.shutdown.max.retries=1. And, in order to not fill up my hard drive, 
I set log.index.size.max.bytes=100000


> Controller could miss a broker state change 
> --------------------------------------------
>
>                 Key: KAFKA-1120
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1120
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.1
>            Reporter: Jun Rao
>              Labels: reliability
>
> When the controller is in the middle of processing a task (e.g., preferred 
> leader election, broker change), it holds a controller lock. During this 
> time, a broker could have de-registered and re-registered itself in ZK. After 
> the controller finishes processing the current task, it will start processing 
> the logic in the broker change listener. However, it will see no broker 
> change and therefore won't do anything to the restarted broker. This broker 
> will be in a weird state since the controller doesn't inform it to become the 
> leader of any partition. Yet, the cached metadata in other brokers could 
> still list that broker as the leader for some partitions. Client requests 
> routed to that broker will then get a TopicOrPartitionNotExistException. This 
> broker will continue to be in this bad state until it's restarted again.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to