[ https://issues.apache.org/jira/browse/KAFKA-340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13467380#comment-13467380 ]
Joel Koshy commented on KAFKA-340: ---------------------------------- This may not be too difficult to implement, but tricky to use so I would like to put down a more detailed description as that may help come up with more subtleties and corner cases to account for. The actual invocation of shutdown (describe at end) needs to be thought through. The existing shutdown hook provides a conventional shutdown approach and is convenient from an operations perspective in that it can be easily scripted (just kill -s SIGTERM pid). The problem with the existing shutdown logic which this jira is meant to address is that it simply brings the broker down, which takes partitions offline abruptly. This can lead to message loss during rolling bounces - e.g., consider a rolling bounce of brokers B1, B2. If a producer is configured with num-request-acks -1, and T1-P1 is on (B1, B2) and B2 is leader. If B1 is bounced, it may fall out of the ISR for T1-P1. B2 continues to receive messages to T1-P1 and commits them immediately. After B1 comes back up, if B2 is bounced, then B1 would become the new leader, but its log would be truncated to the last HW which is smaller than the committed (and ack'd) offsets on B2. The original proposal (to wait until there is at least one other broker in ISR before shutting down) is also not without issues. It could lead to long waits during clean shutdown. Furthermore, it involves moving all the leaders on the broker that is being shutdown to other brokers. Partitions that are moved later will experience longer outages than the partitions earlier in the list. In addition to avoiding message loss, another goal in clean shutdown should be to keep partitions as available as possible. Jun suggested another possible approach: before shutting down a broker, pre-emptively relinquish and move the leadership of partitions for which it is currently leader to another broker in ISR. One benefit of this approach is that leader transition can be done serially, one partition at a time. So each partition will only be unavailable for the period it takes to establish a new leader. The actual process to shut down broker B may look something like this. (This assumes the command is issued to the current controller but can be adapted for the other options outlined at the end.) - For each partition T-P that B is leader for: - Move leadership to another broker that is in ISR. (So if |ISR| == 1 - meaning only B - and replication factor >= 2 then we have to wait.) - Issue the leaderAndISR request to the updated list of brokers in leaderAndISR. - For all partitions for which Bx is a follower: - Issue a "stopFetcher" for T-P. (Explicitly stopping the fetchers helps reduce the possibility of timeouts on producer requests, since not doing so would cause B to remain in ISR longer.) - Issue a leaderAndISR request to the updated list of brokers in leaderAndISR. Here are a few options, for the actual clean-shutdown invocation. These are not mutually exclusive, so we can pick one for now and implement the others later if they make sense. - Option 1: JMX-operation on the Kafka-controller to clean shut-down a specified broker. (If the controller fails during its execution, then the operation will simply fail.) This could also be a JMX-operation on individual brokers, but that would require individual brokers to forward the "relinquishLeadership" request to the controller. - Option 2: The broker's shutdown handler sends a "relinquishLeadership" request to the controller and awaits a response from the controller. See note below on the effect of a failure in this step. - Option 3: Expose a new command port to issue a "cleanShutdown brokerid" command to the controller. As above, this could be on individual brokers as well. The command port would also be useful for other commands such as "print broker config", "dump stats", etc. The main disadvantage with Option 2 is that if the leader transition fails then you cannot really retry - well, you could wait/retry until the transition is successful, but then you would need to introduce a timeout. Introducing a timeout has its own issues - e.g., what is a reasonable timeout; also you would be forced to wait for the timeout during a full cluster shutdown. With Options 1/3 the operation would just fail. (And if you *know* that you are doing a full cluster shutdown you can go ahead and use the SIGTERM-based shutdown.) Typical scenarios: - Rolling bounce. After a broker is bounced and comes back up it can take some time for it to return to ISR (if it fell out). In this case, the relinquishLeadership operation of the next broker in the bounce sequence may fail but would succeed after the preceding broker comes back up and returns to ISR. - Full cluster shutdown. In this case, relinquishLeadership would likely fail for brokers that come later in the shutdown sequence. However, the user would know this is a full cluster shutdown and just use the usual SIGTERM-based shutdown. - Replication factor of 1: similar to full cluster shutdown. > ensure ISR has enough brokers during clean shutdown of a broker > --------------------------------------------------------------- > > Key: KAFKA-340 > URL: https://issues.apache.org/jira/browse/KAFKA-340 > Project: Kafka > Issue Type: Sub-task > Components: core > Affects Versions: 0.8 > Reporter: Jun Rao > Assignee: Joel Koshy > Priority: Blocker > Labels: bugs > Original Estimate: 168h > Remaining Estimate: 168h > > If we are shutting down a broker when the ISR of a partition includes only > that broker, we could lose some messages that have been previously committed. > For clean shutdown, we need to guarantee that there is at least 1 other > broker in ISR after the broker is shut down. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira