Re: [DISCUSS] KIP-631: The Quorum-based Kafka Controller

Tom Bentley Sat, 24 Oct 2020 01:08:28 -0700

Hi Colin,

Which error code in particular though? Because so far as I'm aware there's
no existing error code which really captures this situation and creating a
new one would not be backward compatible.


Cheers,

Tom

On Sat, Oct 24, 2020 at 12:20 AM Jun Rao <j...@confluent.io> wrote:

> Hi, Colin,
>
> Thanks for the reply. A few more comments.
>
> 55. There is still text that favors new broker registration. "When a broker
> first starts up, when it is in the INITIAL state, it will always "win"
> broker ID conflicts.  However, once it is granted a lease, it transitions
> out of the INITIAL state.  Thereafter, it may lose subsequent conflicts if
> its broker epoch is stale.  (See KIP-380 for some background on broker
> epoch.)  The reason for favoring new processes is to accommodate the common
> case where a process is killed with kill -9 and then restarted.  We want it
> to be able to reclaim its old ID quickly in this case."
>
> 80.1 Sounds good. Could you document that listeners is a required config
> now? It would also be useful to annotate other required configs. For
> example, controller.connect should be required.
>
> 80.2 Could you list all deprecated existing configs? Another one is
> control.plane.listener.name since the controller no longer sends
> LeaderAndIsr, UpdateMetadata and StopReplica requests.
>
> 83.1 It seems that the broker can transition from FENCED to RUNNING without
> registering for a new broker epoch. I am not sure how this works. Once the
> controller fences a broker, there is no need for the controller to keep the
> boker epoch around. So, if the fenced broker's heartbeat request with the
> existing broker epoch will be rejected, leading the broker back to the
> FENCED state again.
>
> 83.5 Good point on KIP-590. Then should we expose the controller for
> debugging purposes? If not, we should deprecate the controllerID field in
> MetadataResponse?
>
> 90. We rejected the shared ID with just one reason "This is not a good idea
> because NetworkClient assumes a single ID space.  So if there is both a
> controller 1 and a broker 1, we don't have a way of picking the "right"
> one." This doesn't seem to be a strong reason. For example, we could
> address the NetworkClient issue with the node type as you pointed out or
> using the negative value of a broker ID as the controller ID.
>
> 100. In KIP-589
> <
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-589+Add+API+to+update+Replica+state+in+Controller
> >,
> the broker reports all offline replicas due to a disk failure to the
> controller. It seems this information needs to be persisted to the metadata
> log. Do we have a corresponding record for that?
>
> 101. Currently, StopReplica request has 2 modes, without deletion and with
> deletion. The former is used for controlled shutdown and handling disk
> failure, and causes the follower to stop. The latter is for topic deletion
> and partition reassignment, and causes the replica to be deleted. Since we
> are deprecating StopReplica, could we document what triggers the stopping
> of a follower and the deleting of a replica now?
>
> 102. Should we include the metadata topic in the MetadataResponse? If so,
> when it will be included and what will the metadata response look like?
>
> 103. "The active controller assigns the broker a new broker epoch, based on
> the latest committed offset in the log." This seems inaccurate since the
> latest committed offset doesn't always advance on every log append.
>
> 104. REGISTERING(1) : It says "Otherwise, the broker moves into the FENCED
> state.". It seems this should be RUNNING?
>
> 105. RUNNING: Should we require the broker to catch up to the metadata log
> to get into this state?
>
> Thanks,
>
> Jun
>
>
>
> On Fri, Oct 23, 2020 at 1:20 PM Colin McCabe <cmcc...@apache.org> wrote:
>
> > On Wed, Oct 21, 2020, at 05:51, Tom Bentley wrote:
> > > Hi Colin,
> > >
> > > On Mon, Oct 19, 2020, at 08:59, Ron Dagostino wrote:
> > > > > Hi Colin.  Thanks for the hard work on this KIP.
> > > > >
> > > > > I have some questions about what happens to a broker when it
> becomes
> > > > > fenced (e.g. because it can't send a heartbeat request to keep its
> > > > > lease).  The KIP says "When a broker is fenced, it cannot process
> any
> > > > > client requests.  This prevents brokers which are not receiving
> > > > > metadata updates or that are not receiving and processing them fast
> > > > > enough from causing issues to clients." And in the description of
> the
> > > > > FENCED(4) state it likewise says "While in this state, the broker
> > does
> > > > > not respond to client requests."  It makes sense that a fenced
> broker
> > > > > should not accept producer requests -- I assume any such requests
> > > > > would result in NotLeaderOrFollowerException.  But what about
> KIP-392
> > > > > (fetch from follower) consumer requests?  It is conceivable that
> > these
> > > > > could continue.  Related to that, would a fenced broker continue to
> > > > > fetch data for partitions where it thinks it is a follower?  Even
> if
> > > > > it rejects consumer requests it might still continue to fetch as a
> > > > > follower.  Might it be helpful to clarify both decisions here?
> > > >
> > > > Hi Ron,
> > > >
> > > > Good question.  I think a fenced broker should continue to fetch on
> > > > partitions it was already fetching before it was fenced, unless it
> > hits a
> > > > problem.  At that point it won't be able to continue, since it
> doesn't
> > have
> > > > the new metadata.  For example, it won't know about leadership
> changes
> > in
> > > > the partitions it's fetching.  The rationale for continuing to fetch
> > is to
> > > > try to avoid disruptions as much as possible.
> > > >
> > > > I don't think fenced brokers should accept client requests.  The
> issue
> > is
> > > > that the fenced broker may or may not have any data it is supposed to
> > > > have.  It may or may not have applied any configuration changes, etc.
> > that
> > > > it is supposed to have applied.  So it could get pretty confusing,
> and
> > also
> > > > potentially waste the client's time.
> > > >
> > > >
> > > When fenced, how would the broker reply to a client which did make a
> > > request?
> > >
> >
> > Hi Tom,
> >
> > The broker will respond with a retryable error in that case.  Once the
> > client has re-fetched its metadata, it will no longer see the fenced
> broker
> > as part of the cluster.  I added a note to the KIP.
> >
> > best,
> > Colin
> >
> > >
> > > Thanks,
> > >
> > > Tom
> > >
> >
>

Re: [DISCUSS] KIP-631: The Quorum-based Kafka Controller

Reply via email to