Hi Colin, Which error code in particular though? Because so far as I'm aware there's no existing error code which really captures this situation and creating a new one would not be backward compatible.
Cheers, Tom On Sat, Oct 24, 2020 at 12:20 AM Jun Rao <j...@confluent.io> wrote: > Hi, Colin, > > Thanks for the reply. A few more comments. > > 55. There is still text that favors new broker registration. "When a broker > first starts up, when it is in the INITIAL state, it will always "win" > broker ID conflicts. However, once it is granted a lease, it transitions > out of the INITIAL state. Thereafter, it may lose subsequent conflicts if > its broker epoch is stale. (See KIP-380 for some background on broker > epoch.) The reason for favoring new processes is to accommodate the common > case where a process is killed with kill -9 and then restarted. We want it > to be able to reclaim its old ID quickly in this case." > > 80.1 Sounds good. Could you document that listeners is a required config > now? It would also be useful to annotate other required configs. For > example, controller.connect should be required. > > 80.2 Could you list all deprecated existing configs? Another one is > control.plane.listener.name since the controller no longer sends > LeaderAndIsr, UpdateMetadata and StopReplica requests. > > 83.1 It seems that the broker can transition from FENCED to RUNNING without > registering for a new broker epoch. I am not sure how this works. Once the > controller fences a broker, there is no need for the controller to keep the > boker epoch around. So, if the fenced broker's heartbeat request with the > existing broker epoch will be rejected, leading the broker back to the > FENCED state again. > > 83.5 Good point on KIP-590. Then should we expose the controller for > debugging purposes? If not, we should deprecate the controllerID field in > MetadataResponse? > > 90. We rejected the shared ID with just one reason "This is not a good idea > because NetworkClient assumes a single ID space. So if there is both a > controller 1 and a broker 1, we don't have a way of picking the "right" > one." This doesn't seem to be a strong reason. For example, we could > address the NetworkClient issue with the node type as you pointed out or > using the negative value of a broker ID as the controller ID. > > 100. In KIP-589 > < > https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller > >, > the broker reports all offline replicas due to a disk failure to the > controller. It seems this information needs to be persisted to the metadata > log. Do we have a corresponding record for that? > > 101. Currently, StopReplica request has 2 modes, without deletion and with > deletion. The former is used for controlled shutdown and handling disk > failure, and causes the follower to stop. The latter is for topic deletion > and partition reassignment, and causes the replica to be deleted. Since we > are deprecating StopReplica, could we document what triggers the stopping > of a follower and the deleting of a replica now? > > 102. Should we include the metadata topic in the MetadataResponse? If so, > when it will be included and what will the metadata response look like? > > 103. "The active controller assigns the broker a new broker epoch, based on > the latest committed offset in the log." This seems inaccurate since the > latest committed offset doesn't always advance on every log append. > > 104. REGISTERING(1) : It says "Otherwise, the broker moves into the FENCED > state.". It seems this should be RUNNING? > > 105. RUNNING: Should we require the broker to catch up to the metadata log > to get into this state? > > Thanks, > > Jun > > > > On Fri, Oct 23, 2020 at 1:20 PM Colin McCabe <cmcc...@apache.org> wrote: > > > On Wed, Oct 21, 2020, at 05:51, Tom Bentley wrote: > > > Hi Colin, > > > > > > On Mon, Oct 19, 2020, at 08:59, Ron Dagostino wrote: > > > > > Hi Colin. Thanks for the hard work on this KIP. > > > > > > > > > > I have some questions about what happens to a broker when it > becomes > > > > > fenced (e.g. because it can't send a heartbeat request to keep its > > > > > lease). The KIP says "When a broker is fenced, it cannot process > any > > > > > client requests. This prevents brokers which are not receiving > > > > > metadata updates or that are not receiving and processing them fast > > > > > enough from causing issues to clients." And in the description of > the > > > > > FENCED(4) state it likewise says "While in this state, the broker > > does > > > > > not respond to client requests." It makes sense that a fenced > broker > > > > > should not accept producer requests -- I assume any such requests > > > > > would result in NotLeaderOrFollowerException. But what about > KIP-392 > > > > > (fetch from follower) consumer requests? It is conceivable that > > these > > > > > could continue. Related to that, would a fenced broker continue to > > > > > fetch data for partitions where it thinks it is a follower? Even > if > > > > > it rejects consumer requests it might still continue to fetch as a > > > > > follower. Might it be helpful to clarify both decisions here? > > > > > > > > Hi Ron, > > > > > > > > Good question. I think a fenced broker should continue to fetch on > > > > partitions it was already fetching before it was fenced, unless it > > hits a > > > > problem. At that point it won't be able to continue, since it > doesn't > > have > > > > the new metadata. For example, it won't know about leadership > changes > > in > > > > the partitions it's fetching. The rationale for continuing to fetch > > is to > > > > try to avoid disruptions as much as possible. > > > > > > > > I don't think fenced brokers should accept client requests. The > issue > > is > > > > that the fenced broker may or may not have any data it is supposed to > > > > have. It may or may not have applied any configuration changes, etc. > > that > > > > it is supposed to have applied. So it could get pretty confusing, > and > > also > > > > potentially waste the client's time. > > > > > > > > > > > When fenced, how would the broker reply to a client which did make a > > > request? > > > > > > > Hi Tom, > > > > The broker will respond with a retryable error in that case. Once the > > client has re-fetched its metadata, it will no longer see the fenced > broker > > as part of the cluster. I added a note to the KIP. > > > > best, > > Colin > > > > > > > > Thanks, > > > > > > Tom > > > > > >