Re: [DISCUSS] KIP-631: The Quorum-based Kafka Controller

Colin McCabe Mon, 28 Sep 2020 09:24:43 -0700

On Fri, Sep 25, 2020, at 17:35, Jun Rao wrote:
> Hi, Colin,
> 
> Thanks for the reply.
> 
> 62. Thinking about this more, I am wondering what's our overall strategy
> for configs shared between the broker and the controller. For example, both
> the broker and the controller have to define listeners. So both need
> configs like listeners/advertised.listeners. Both the new controller and
> the broker replicate data. So both need to define some replication related
> configurations (replica.fetch.min.bytes, replica.fetch.wait.max.ms, etc).
> Should we let the new controller share those configs with brokers or should
> we define new configs just for the controller? It seems that we want to
> apply the same strategy consistently for all shared configs?
>


Hi Jun,

That's a good question.  I think that we should share as many configurations as 
possible.  There will be a few cases where this isn't possible, and we need to 
create a new configuration key that is controller-only, but I think that will 
be relatively rare.

>
> 63. If a node has process.roles set to controller, does one still need to
> set broker.id on this node?
> 

No, broker.id does not need to be set in that case.

best,
Colin

>
> Thanks,
> 
> Jun
> 
> 
> 
> On Fri, Sep 25, 2020 at 2:17 PM Colin McCabe <[email protected]> wrote:
> 
> > On Fri, Sep 25, 2020, at 10:17, Jun Rao wrote:
> > > Hi, Colin,
> > >
> > > Thanks for the reply.
> > >
> > > 60. Yes, I think you are right. We probably need the controller id when a
> > > broker starts up. A broker only stores the Raft leader id in the metadata
> > > file. To do the initial fetch to the Raft leader, it needs to know the
> > > host/port associated with the leader id.
> > >
> > > 62. It seems there are 2 parts to this : (1) which listener a client
> > should
> > > use to initiate a connection to the controller and (2) which listener
> > > should a controller use to accept client requests. For (1), at any point
> > of
> > > time, a client only needs to use one listener. I think
> > > controller.listener.name is meant for the client.
> >
> > Hi Jun,
> >
> > controller.listener.names is also used by the controllers.  In the case
> > where we have a broker and a controller in the same JVM, we have a single
> > config file.  Then we need to know which listeners belong to the controller
> > and which belong to the broker component.  That's why it's a list.
> >
> > > So, a single value seems
> > > to make more sense. Currently, we don't have a configuration for (2). We
> > > could add a new one for that and support a list. I am wondering how
> > useful
> > > it will be. One example that I can think of is that we can reject
> > > non-controller related requests if accepted on controller-only listeners.
> > > However, we already support separate authentication for the controller
> > > listener. So, not sure how useful it is.
> >
> > The controller always has a separate listener and does not share listeners
> > with the broker.  The main reason for this is to avoid giving clients the
> > ability to launch a denial-of-service attack on the controller by flooding
> > a broker port.  A lot of times, these attacks are made unintentionally by
> > poorly coded or configured clients.  Additionally, broker ports can also be
> > very busy with large fetch requests, and so on.  It's just a bad
> > configuration in general to try to overload the same port for the
> > controller and broker, and we don't want to allow it.  It would be a
> > regression to go from the current system where control requests are safely
> > isolated on a separate port, to one where they're not.  It also makes the
> > code and configuration a lot messier.
> >
> > >
> > > 63. (a) I think most users won't know controller.id defaults to
> > broker.id +
> > > 3000. So, it can be confusing for them to set up controller.connect. If
> > > this is truly needed, it seems that it's less confusing to make
> > > controller.id required.
> > > (b) I am still trying to understand if we truly need to expose a
> > > controller.id. What if we only expose broker.id and let
> > controller.connect
> > > just use broker.id? What will be missing?
> >
> > The controller has a separate ID from the broker, so knowing broker.id is
> > not helpful here.
> >
> > best,
> > Colin
> >
> > >
> > > Thanks,
> > >
> > > Jun
> > >
> > > On Thu, Sep 24, 2020 at 10:55 PM Colin McCabe <[email protected]>
> > wrote:
> > >
> > > > On Thu, Sep 24, 2020, at 16:24, Jun Rao wrote:
> > > > > Hi, Colin,
> > > > >
> > > > > Thanks for the reply and the updated KIP. A few more comments below.
> > > > >
> > > >
> > > > Hi Jun,
> > > >
> > > > >
> > > > > 53. It seems that you already incorporated the changes in KIP-516.
> > With
> > > > > topic ids, we don't need to wait for the topic's data to be deleted
> > > > before
> > > > > removing the topic metadata. If the topic is recreated, we can still
> > > > delete
> > > > > the data properly based on the topic id. So, it seems that we can
> > remove
> > > > > TopicRecord.Deleting.
> > > > >
> > > >
> > > > Thanks for the reply.  What I was thinking of doing here was using
> > topic
> > > > IDs internally, but still using names externally.  So the topic UUIDs
> > are
> > > > only for the purpose of associating topics with partitions -- from the
> > > > user's point of view, topics are still identified by names.
> > > >
> > > > You're right that KIP-516 will simplify things, but I'm not sure when
> > that
> > > > will land, so I wanted to avoid blocking the initial implementation of
> > this
> > > > KIP on it.
> > > >
> > > > >
> > > > > 55. It seems to me that the current behavior where we favor the
> > current
> > > > > broker registration is better. This is because uncontrolled broker
> > > > shutdown
> > > > > is rare and its impact is less severe since one just needs to wait
> > for
> > > > the
> > > > > session timeout before restarting the broker. If a mis-configured
> > broker
> > > > > replaces an existing broker, the consequence is more severe since it
> > can
> > > > > cause the leader to be unavailable, a replica to be out of ISR, and
> > add
> > > > > more load on the leaders etc.
> > > > >
> > > >
> > > > Hmm, that's a good point.  Let me check this a bit more before I change
> > > > it, though.
> > > >
> > > > >
> > > > > 60. [email protected]:9093,
> > > > > [email protected]:9093,[email protected] : Do we
> > need to
> > > > > include the controller id before @? It seems that the host/port is
> > enough
> > > > > for establishing the connection. It would also be useful to make it
> > clear
> > > > > that controller.connect replaces quorum.voters in KIP-595.
> > > > >
> > > >
> > > > I discussed this with Jason earlier, and he felt that the controller
> > IDs
> > > > were needed in this configuration key.  It is certainly needed when
> > > > configuring the controllers themselves, since they need to know each
> > > > others' IDs.
> > > >
> > > > >
> > > > > 61. I am not sure that we need both controller.listeners and
> > > > > controller.connect.security.protocol since the former implies the
> > > > security
> > > > > protocol. The reason that we have both inter.broker.listener.name
> > and
> > > > > inter.broker.security.protocol is because we had the latter first and
> > > > later
> > > > > realized that the former is more general.
> > > > >
> > > >
> > > > That's a good point.  I've removed this from the KIP.
> > > >
> > > > >
> > > > > 62. I am still not sure that you need controller.listeners to be a
> > list.
> > > > > All listeners are already defined in listeners. To migrate from
> > plaintext
> > > > > to SSL, one can configure listeners to have both plaintext and SSL.
> > After
> > > > > that, one can just change controller.listeners from plaintext to SSL.
> > > > This
> > > > > is similar to how to change the listener for inter broker
> > connections.
> > > > > Also, controller.listener.name may be a more accurate name?
> > > > >
> > > >
> > > > The issue that I see here is that if you are running with the
> > controller
> > > > and broker in the same JVM, if you define a few listeners in
> > "listeners"
> > > > they will get used as regular broker listeners, unless you put them in
> > > > controller.listeners.  Therefore, controller.listeners needs to be a
> > list.
> > > >
> > > > controller.listener.names does sound like a better name, though... I've
> > > > updated it to that.
> > > >
> > > > >
> > > > > 63. Regarding controller.id, I am trying to understand whether it's
> > a
> > > > > required configuration or an optional one. From the KIP, it sounds
> > like
> > > > > controller.id is optional. Then, if this is unset, it's not clear
> > how
> > > > the
> > > > > user will obtain the controller id for setting controller.connect.
> > > > >
> > > >
> > > > If you specify broker.id but not controller.id, then controller.id
> > will
> > > > just be set to broker.id + 3000.  This was intended to make some
> > > > configurations a little bit more concise to write.  You still do have
> > to
> > > > know the controller IDs when configuring the brokers, though.  If this
> > > > seems confusing then I can just make it mandatory.
> > > >
> > > > >
> > > > > 64. KIP-516 adds a flag in LeaderAndIsrRequest to indicate whether it
> > > > > includes all partitions or not. This will be used to clean up strayed
> > > > logs.
> > > > > I was thinking how we will do the same thing once the controller
> > moves to
> > > > > Raft based. One way to do that is on broker startup, it gets the HWM
> > in
> > > > the
> > > > > metadata log from the Raft leader and waits until its metadata
> > catches up
> > > > > to HWM. At that point, any local log not seen in the metadata can be
> > > > > removed. Since the Fetch response returns the HWM, there seems to be
> > > > enough
> > > > > APIs to achieve this.
> > > > >
> > > >
> > > > That's a very good point.  I added a note about this under Broker
> > Startup.
> > > >
> > > > best,
> > > > Colin
> > > >
> > > > >
> > > > > Jun
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Sep 24, 2020 at 1:07 PM Colin McCabe <[email protected]>
> > wrote:
> > > > >
> > > > > > On Mon, Sep 21, 2020, at 18:13, Jun Rao wrote:
> > > > > > > Hi, Colin,
> > > > > > >
> > > > > > > Sorry for the late reply. A few more comments below.
> > > > > > >
> > > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > Thanks for taking another look.
> > > > > >
> > > > > > >
> > > > > > > 50. Configurations
> > > > > > > 50.1 controller.listeners: It seems that a controller just needs
> > one
> > > > > > > listener. Why do we need to have a list here? Also, could you
> > > > provide an
> > > > > > > example of how this is set and what's its relationship with
> > existing
> > > > > > > configs such as "security.inter.broker.protocol" and "
> > > > > > > inter.broker.listener.name"?
> > > > > > >
> > > > > >
> > > > > > I agree that most administrators will want to run with only one
> > > > controller
> > > > > > listener.  However, just as with brokers, it is nice to have the
> > > > option to
> > > > > > expose multiple ports.
> > > > > >
> > > > > > One reason why you might want multiple ports is if you were doing a
> > > > > > migration from plaintext to SSL.  You could add new SSL listeners
> > but
> > > > > > continue to expose the PLAINTEXT listeners.  Then you could add the
> > > > new SSL
> > > > > > controller config to each broker and roll each broker.  Then you
> > could
> > > > > > migrate from PLAINTEXT to SSL with no downtime.
> > > > > >
> > > > > > Here's an example configuration for the controller:
> > > > > > [email protected]:9093,
> > > > > > [email protected]:9093,[email protected]
> > > > > > controller.listeners=CONTROLLER
> > > > > > listeners=CONTROLLER://controller0.example.com:9093
> > > > > > listener.security.protocol.map=CONTROLLER:SSL
> > > > > >
> > > > > > Here's an example configuration for the broker:
> > > > > > [email protected]:9093,
> > > > > > [email protected]:9093,[email protected]
> > > > > > controller.connect.security.protocol=SSL
> > > > > >
> > > > > > security.inter.broker.protocol or inter.broker.listener.name do
> > not
> > > > > > affect how the broker communicates with the controller.  Those
> > > > > > configurations relate to how brokers communicate with each other,
> > but
> > > > the
> > > > > > controller is not a broker.
> > > > > >
> > > > > > (Note that I just added controller.connect.security.protocol to the
> > > > KIP --
> > > > > > I had forgotten to put it in earlier)
> > > > > >
> > > > > > >
> > > > > > > 50.2 registration.heartbeat.interval.ms and
> > > > > > registration.lease.timeout.ms.
> > > > > > > Should we match their default value with the corresponding
> > default
> > > > for
> > > > > > ZK?
> > > > > > >
> > > > > >
> > > > > > Fair enough.  I'll set them to the values of the
> > > > zookeeper.sync.time.ms
> > > > > > and zookeeper.connection.timeout.ms configurations.  I do think we
> > > > should
> > > > > > experiment later on to see what works well here, but the ZK values
> > are
> > > > at
> > > > > > least a starting point.
> > > > > >
> > > > > > >
> > > > > > > 50.3 controller.connect: Could you provide an example? I am
> > > > wondering how
> > > > > > > it differs from quorum.voters=1@kafka-1:9092, 2@kafka-2:9092,
> > > > 3@kafka-3
> > > > > > > :9092.
> > > > > > >
> > > > > >
> > > > > > controller.connect is intended to be the new name for
> > quorum.voters.
> > > > > > During the vote for KIP-595, we sort of agreed to defer the
> > discussion
> > > > > > about what this configuration should be called.  I proposed this
> > new
> > > > name
> > > > > > because it makes it clear what the configuration is for (how to
> > > > connect to
> > > > > > the controllers).
> > > > > >
> > > > > > >
> > > > > > > 50.4 controller.id: I am still not sure how this is being used.
> > > > Could
> > > > > > you
> > > > > > > explain this in more detail?
> > > > > > >
> > > > > >
> > > > > > The controller ID needs to be set on each controller.  It also
> > appears
> > > > in
> > > > > > controller.connect as the thing before the "at" sign.  Its
> > function is
> > > > > > pretty similar to broker.id.
> > > > > >
> > > > > > Broker IDs and controller IDs exist in the same ID space, so you
> > can't
> > > > > > have a broker and a controller that use the same ID.
> > > > > >
> > > > > > >
> > > > > > > 51. BrokerHeartbeat: It seems a bit wasteful to include
> > Listeners in
> > > > > > every
> > > > > > > heartbeat request since it typically doesn't change. Could we
> > make
> > > > that
> > > > > > an
> > > > > > > optional field?
> > > > > > >
> > > > > >
> > > > > > Ok.
> > > > > >
> > > > > > >
> > > > > > > 52. KIP-584 adds a new ZK node /features. Should we add a
> > > > corresponding
> > > > > > > metadata record?
> > > > > > >
> > > > > >
> > > > > > Good point.  I added FeatureLevelRecord.
> > > > > >
> > > > > > >
> > > > > > > 53. TopicRecord and DeleteTopic: Both DeleteTopic and
> > > > > > TopicRecord.Deleting
> > > > > > > indicate topic deletion. Could we outline the flow when each
> > will be
> > > > set?
> > > > > > > In particular, which one indicates the intention to delete and
> > which
> > > > one
> > > > > > > indicates the completion of the deletion.
> > > > > > >
> > > > > >
> > > > > > TopicRecord.Deleting is set when we intend to delete the topic.
> > > > > >
> > > > > > DeleteTopic removes the topic completely.  I will rename
> > DeleteTopic to
> > > > > > RemoveTopic to make this clearer.
> > > > > >
> > > > > > >
> > > > > > > 54. "The controller can generate a new broker epoch by using the
> > > > latest
> > > > > > log
> > > > > > > offset." Which offset is that? Is it the offset of the metadata
> > > > topic for
> > > > > > > the corresponding BrokerRecord? Is it guaranteed to be different
> > on
> > > > each
> > > > > > > broker restart?
> > > > > > >
> > > > > >
> > > > > > Yes.... a new broker epoch implies that there has been a new record
> > > > > > created in the metadata log.  Therefore the last committed offset
> > must
> > > > be
> > > > > > different.
> > > > > >
> > > > > > >
> > > > > > > 55. "Thereafter, it may lose subsequent conflicts if its broker
> > > > epoch is
> > > > > > > stale.  (See KIP-380 for some background on broker epoch.)  The
> > > > reason
> > > > > > for
> > > > > > > favoring new processes is to accommodate the common case where a
> > > > process
> > > > > > is
> > > > > > > killed with kill -9 and then restarted. " Are you saying that if
> > > > there is
> > > > > > > an active broker registered in the controller, a new broker
> > heartbeat
> > > > > > > request with the INITIAL state will fence the current broker
> > > > session? Not
> > > > > > > sure about this. For example, this will allow a broker with
> > > > incorrectly
> > > > > > > configured broker id to kill an existing valid broker.
> > > > > > >
> > > > > >
> > > > > > Yes, a new broker with an incorrectly configured broker id could
> > fence
> > > > an
> > > > > > existing valid broker.
> > > > > >
> > > > > > The goal of the conflict resolution process described here is to
> > avoid
> > > > > > having two brokers go back and forth in claiming the broker id.
> > > > Choosing
> > > > > > the newest was just an easy way to do that.
> > > > > >
> > > > > > Choosing the oldest is another possibility, and more inline with
> > what
> > > > > > ZooKeeper does today.  The problem with that is that it would not
> > be
> > > > > > possible to re-create the broker if the old instance died suddenly,
> > > > except
> > > > > > by waiting for the full lease timeout.
> > > > > >
> > > > > > It might be possible to do a little better here by breaking the
> > lease
> > > > if
> > > > > > the TCP connection from the broker drops.  But that requires
> > crossing
> > > > some
> > > > > > layers of abstraction in the Kafka networking stack.  There's also
> > the
> > > > > > possibility of false negatives or positives.  For example, TCP
> > > > connections
> > > > > > sometimes just drop even if the node is still there.  Or sometimes
> > the
> > > > > > networking stack keeps a TCP connection alive for a while when the
> > > > other
> > > > > > end is gone.
> > > > > >
> > > > > > >
> > > > > > > 56. kafka-storage.sh:
> > > > > > > 56.1 In the info mode, what other information does it show in
> > > > addition to
> > > > > > > "kip.500.mode=enabled"?
> > > > > > >
> > > > > >
> > > > > > I think it should show a list of all the storage directories, and
> > > > whether
> > > > > > each one is formatted.  In addition, it should show whether
> > > > kip.500.mode is
> > > > > > enabled, and what the cluster id is.
> > > > > >
> > > > > > >
> > > > > > > 56.2 Should the format mode take the config file as the input too
> > > > like
> > > > > > the
> > > > > > > info mode?
> > > > > > >
> > > > > >
> > > > > > Yes, that's a good idea.
> > > > > >
> > > > > > best,
> > > > > > Colin
> > > > > >
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Jun
> > > > > > >
> > > > > > > On Thu, Sep 17, 2020 at 4:12 PM Colin McCabe <[email protected]
> > >
> > > > wrote:
> > > > > > >
> > > > > > > > Hi Unmesh,
> > > > > > > >
> > > > > > > > That's a fair point.  I have moved the lease duration to the
> > broker
> > > > > > > > heartbeat response.  That way lease durations can be changed
> > just
> > > > be
> > > > > > > > reconfiguring the controllers.
> > > > > > > >
> > > > > > > > best,
> > > > > > > > Colin
> > > > > > > >
> > > > > > > > On Wed, Sep 16, 2020, at 07:40, Unmesh Joshi wrote:
> > > > > > > > > Thanks Colin, the changes look good to me. One small thing.
> > > > > > > > > registration.lease.timeout.ms is the configuration on the
> > > > controller
> > > > > > > > side.
> > > > > > > > > It will be good to comment on how brokers know about it, to
> > be
> > > > able
> > > > > > to
> > > > > > > > > send LeaseDurationMs
> > > > > > > > > in the heartbeat request,
> > > > > > > > > or else it can be added in the heartbeat response for
> > brokers to
> > > > know
> > > > > > > > about
> > > > > > > > > it.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Unmesh
> > > > > > > > >
> > > > > > > > > On Fri, Sep 11, 2020 at 10:32 PM Colin McCabe <
> > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Unmesh,
> > > > > > > > > >
> > > > > > > > > > I think you're right that we should use a duration here
> > rather
> > > > > > than a
> > > > > > > > > > time.  As you said, the clock on the controller will
> > probably
> > > > not
> > > > > > > > match the
> > > > > > > > > > one on the broker.  I have updated the KIP.
> > > > > > > > > >
> > > > > > > > > > > > It's important to keep in mind that messages may be
> > > > delayed in
> > > > > > the
> > > > > > > > > > > > network, or arrive out of order.  When this happens, we
> > > > will
> > > > > > use
> > > > > > > > the
> > > > > > > > > > start
> > > > > > > > > > > > time specified in the request to determine if the
> > request
> > > > is
> > > > > > stale.
> > > > > > > > > > > I am assuming that there will be a single TCP connection
> > > > > > maintained
> > > > > > > > > > between
> > > > > > > > > > > broker and active controller. So, there won't be any out
> > of
> > > > order
> > > > > > > > > > requests?
> > > > > > > > > > > There will be a scenario of broker GC pause, which might
> > > > cause
> > > > > > > > connection
> > > > > > > > > > > timeout and broker might need to reestablish the
> > connection.
> > > > If
> > > > > > the
> > > > > > > > pause
> > > > > > > > > > > is too long, lease will expire and the heartbeat sent
> > after
> > > > the
> > > > > > pause
> > > > > > > > > > will
> > > > > > > > > > > be treated as a new registration (similar to restart
> > case),
> > > > and
> > > > > > a new
> > > > > > > > > > epoch
> > > > > > > > > > > number will be assigned to the broker.
> > > > > > > > > >
> > > > > > > > > > I agree with the end of this paragraph, but not with the
> > start
> > > > :)
> > > > > > > > > >
> > > > > > > > > > There can be out-of-order requests, since the broker will
> > > > simply
> > > > > > use a
> > > > > > > > new
> > > > > > > > > > TCP connection if the old one has problems.  This can
> > happen
> > > > for a
> > > > > > > > variety
> > > > > > > > > > of reasons.  I don't think GC pauses are the most common
> > > > reason for
> > > > > > > > this to
> > > > > > > > > > happen.  It's more common to see issues issues in the
> > network
> > > > > > itself
> > > > > > > > that
> > > > > > > > > > result connections getting dropped from time to time.
> > > > > > > > > >
> > > > > > > > > > So we have to assume that messages may arrive out of
> > order, and
> > > > > > > > possibly
> > > > > > > > > > be delayed.  I added a note that heartbeat requests should
> > be
> > > > > > ignored
> > > > > > > > if
> > > > > > > > > > the metadata log offset they contain is smaller than a
> > previous
> > > > > > > > heartbeat.
> > > > > > > > > >
> > > > > > > > > > > When the active controller fails, the new active
> > controller
> > > > > > needs to
> > > > > > > > be
> > > > > > > > > > > sure that it considers all the known brokers as alive
> > till
> > > > the
> > > > > > lease
> > > > > > > > > > > expiration interval.  Because
> > registration.lease.timeout.ms,
> > > > is
> > > > > > > > > > configured
> > > > > > > > > > > on the controller, the new active controller will extend
> > all
> > > > the
> > > > > > > > leases
> > > > > > > > > > by
> > > > > > > > > > > registration.lease.timeout.ms. I see that it won't need
> > last
> > > > > > > > heartbeat
> > > > > > > > > > > time.
> > > > > > > > > >
> > > > > > > > > > Agreed.
> > > > > > > > > >
> > > > > > > > > > best,
> > > > > > > > > > Colin
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Unmesh
> > > > > > > > > > >
> > > > > > > > > > > On Sat, Sep 5, 2020 at 1:28 AM Colin McCabe <
> > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > > Colin wrote:
> > > > > > > > > > > > > > The reason for including LeaseStartTimeMs in the
> > > > request
> > > > > > is to
> > > > > > > > > > ensure
> > > > > > > > > > > > > > that the time required to communicate with the
> > > > controller
> > > > > > gets
> > > > > > > > > > > > included in
> > > > > > > > > > > > > > the lease time.  Since requests can potentially be
> > > > delayed
> > > > > > in
> > > > > > > > the
> > > > > > > > > > > > network
> > > > > > > > > > > > > > for a long time, this is important.
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, Aug 31, 2020, at 05:58, Unmesh Joshi wrote:
> > > > > > > > > > > > > The network time will be added anyway, because the
> > lease
> > > > > > timer
> > > > > > > > on the
> > > > > > > > > > > > > active controller will start only after the heartbeat
> > > > request
> > > > > > > > > > reaches the
> > > > > > > > > > > > > server.
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Unmesh,
> > > > > > > > > > > >
> > > > > > > > > > > > If the start time is not specified in the request,
> > then the
> > > > > > network
> > > > > > > > > > time
> > > > > > > > > > > > is excluded from the heartbeat time.
> > > > > > > > > > > >
> > > > > > > > > > > > Here's an example:
> > > > > > > > > > > > Let's say broker A sends a heartbeat at time 100, and
> > it
> > > > > > arrives
> > > > > > > > on the
> > > > > > > > > > > > controller at time 200, and the lease duration is 1000.
> > > > > > > > > > > >
> > > > > > > > > > > > The controller looks at the start time in the request,
> > > > which is
> > > > > > > > 100,
> > > > > > > > > > and
> > > > > > > > > > > > adds 1000 to it, getting 1100.  On the other hand, if
> > start
> > > > > > time
> > > > > > > > is not
> > > > > > > > > > > > specified in the request, then the expiration will be
> > at
> > > > time
> > > > > > 1200.
> > > > > > > > > > > > That is what I mean by "the network time is included
> > in the
> > > > > > > > expiration
> > > > > > > > > > > > time."
> > > > > > > > > > > >
> > > > > > > > > > > > > And I think, some assumption about network round trip
> > > > time is
> > > > > > > > > > > > > needed anyway to decide on the frequency of the
> > > > heartbeat (
> > > > > > > > > > > > > registration.heartbeat.interval.ms), and lease
> > timeout (
> > > > > > > > > > > > > registration.lease.timeout.ms). So I think just
> > having a
> > > > > > > > leaseTTL
> > > > > > > > > > in the
> > > > > > > > > > > > > request is easier to understand and implement.
> > > > > > > > > > > >
> > > > > > > > > > > > It's important to keep in mind that messages may be
> > > > delayed in
> > > > > > the
> > > > > > > > > > > > network, or arrive out of order.  When this happens, we
> > > > will
> > > > > > use
> > > > > > > > the
> > > > > > > > > > start
> > > > > > > > > > > > time specified in the request to determine if the
> > request
> > > > is
> > > > > > stale.
> > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, I agree that the lease timeout on the
> > controller
> > > > side
> > > > > > > > should
> > > > > > > > > > be
> > > > > > > > > > > > > > reset in the case of controller failover.  The
> > > > alternative
> > > > > > > > would
> > > > > > > > > > be to
> > > > > > > > > > > > > > track the lease as hard state rather than soft
> > state,
> > > > but I
> > > > > > > > think
> > > > > > > > > > that
> > > > > > > > > > > > > > is not really needed, and would result in more log
> > > > entries.
> > > > > > > > > > > > > My interpretation of the mention of BrokerRecord in
> > the
> > > > KIP
> > > > > > was
> > > > > > > > that
> > > > > > > > > > this
> > > > > > > > > > > > > record exists in the Raft log.
> > > > > > > > > > > >
> > > > > > > > > > > > BrokerRecord does exist in the Raft log, but does not
> > > > include
> > > > > > the
> > > > > > > > last
> > > > > > > > > > > > heartbeat time.
> > > > > > > > > > > >
> > > > > > > > > > > > > By soft state, do you mean the broker
> > > > > > > > > > > > > records exist only on the active leader and will not
> > be
> > > > > > > > replicated
> > > > > > > > > > in the
> > > > > > > > > > > > > raft log? If the live brokers list is maintained
> > only on
> > > > the
> > > > > > > > active
> > > > > > > > > > > > > controller (raft leader), then, in case of leader
> > > > failure,
> > > > > > there
> > > > > > > > > > will be
> > > > > > > > > > > > a
> > > > > > > > > > > > > window where the new leader does not know about the
> > live
> > > > > > brokers,
> > > > > > > > > > till
> > > > > > > > > > > > the
> > > > > > > > > > > > > brokers establish the leases again.
> > > > > > > > > > > > > I think it will be safer to have leases as a hard
> > state
> > > > > > managed
> > > > > > > > by
> > > > > > > > > > > > standard
> > > > > > > > > > > > > Raft replication.
> > > > > > > > > > > >
> > > > > > > > > > > > Leases are short, so the need to re-establish them
> > after a
> > > > > > > > controller
> > > > > > > > > > > > failover doesn't seem like a big problem.  But this is
> > > > > > something
> > > > > > > > we can
> > > > > > > > > > > > tweak if it becomes an issue.  One option would be to
> > have
> > > > a
> > > > > > > > separate
> > > > > > > > > > log
> > > > > > > > > > > > which is only used by the controller nodes for this
> > (since,
> > > > > > after
> > > > > > > > all,
> > > > > > > > > > > > brokers don't care about registration renewals).
> > > > > > > > > > > >
> > > > > > > > > > > > > Or am I misunderstanding something? (I assume that
> > with
> > > > soft
> > > > > > > > state,
> > > > > > > > > > you
> > > > > > > > > > > > > mean something like zookeeper local sessions
> > > > > > > > > > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-1147
> > .)
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Our code is single threaded as well.  I think it
> > makes
> > > > > > sense
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > > > > controller, since otherwise locking becomes very
> > messy.
> > > > > > I'm
> > > > > > > > not
> > > > > > > > > > sure I
> > > > > > > > > > > > > > understand your question about duplicate broker ID
> > > > > > detection,
> > > > > > > > > > though.
> > > > > > > > > > > > > There's a section in the KIP about this -- is there a
> > > > detail
> > > > > > we
> > > > > > > > > > should
> > > > > > > > > > > > add
> > > > > > > > > > > > > there?
> > > > > > > > > > > >
> > > > > > > > > > > > This is an implementation detail that doesn't need to
> > be
> > > > in the
> > > > > > > > KIP.
> > > > > > > > > > > >
> > > > > > > > > > > > best,
> > > > > > > > > > > > Colin
> > > > > > > > > > > >
> > > > > > > > > > > > > I assumed broker leases are implemented as a hard
> > state.
> > > > In
> > > > > > that
> > > > > > > > > > case, to
> > > > > > > > > > > > > check for broker id conflict, we need to check the
> > broker
> > > > > > ids at
> > > > > > > > two
> > > > > > > > > > > > places
> > > > > > > > > > > > > 1. Pending broker registrations (which are yet to be
> > > > > > committed)
> > > > > > > > 2.
> > > > > > > > > > > > Already
> > > > > > > > > > > > > committed broker registrations.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Unmesh
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, Aug 31, 2020 at 5:42 PM Colin McCabe <
> > > > > > [email protected]
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On Sat, Aug 29, 2020, at 01:12, Unmesh Joshi wrote:
> > > > > > > > > > > > > > > >>>Can you repeat your questions about broker
> > leases?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >>>>The LeaseStartTimeMs is expected to be the
> > > > broker's
> > > > > > > > > > > > > > > 'System.currentTimeMillis()' at the point of the
> > > > > > request. The
> > > > > > > > > > active
> > > > > > > > > > > > > > > controller will add its lease period to this in
> > order
> > > > > > >>>>to
> > > > > > > > > > compute
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > LeaseEndTimeMs.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I think the use of LeaseStartTimeMs and
> > > > LeaseEndTimeMs
> > > > > > in the
> > > > > > > > > > KIP is
> > > > > > > > > > > > a
> > > > > > > > > > > > > > > bit
> > > > > > > > > > > > > > > confusing.  Monotonic Clock (System.nanoTime) on
> > the
> > > > > > active
> > > > > > > > > > > > controller
> > > > > > > > > > > > > > > should be used to track leases.
> > > > > > > > > > > > > > > (For example,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://issues.apache.org/jira/browse/ZOOKEEPER-1616https://github.com/etcd-io/etcd/pull/6888/commits/e7f4010ccaf28b6ce64fe514d25a4b2fa459d114
> > > > > > > > > > > > > > > )
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Then we will not need LeaseStartTimeMs?
> > > > > > > > > > > > > > > Instead of LeaseStartTimeMs, can we call it
> > > > LeaseTTL? The
> > > > > > > > active
> > > > > > > > > > > > > > controller
> > > > > > > > > > > > > > > can then calculate LeaseEndTime =
> > System.nanoTime() +
> > > > > > > > LeaseTTL.
> > > > > > > > > > > > > > > In this case we might just drop LeaseEndTimeMs
> > from
> > > > the
> > > > > > > > > > response, as
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > broker already knows about the TTL and can send
> > > > > > heartbeats at
> > > > > > > > > > some
> > > > > > > > > > > > > > fraction
> > > > > > > > > > > > > > > of TTL, say every TTL/4 milliseconds.(elapsed
> > time
> > > > on the
> > > > > > > > broker
> > > > > > > > > > > > measured
> > > > > > > > > > > > > > > by System.nanoTime)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Hi Unmesh,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I agree that the monotonic clock is probably a
> > better
> > > > idea
> > > > > > > > here.
> > > > > > > > > > It is
> > > > > > > > > > > > > > good to be robust against wall clock changes,
> > although
> > > > I
> > > > > > think
> > > > > > > > a
> > > > > > > > > > > > cluster
> > > > > > > > > > > > > > which had them might suffer other issues.  I will
> > > > change
> > > > > > it to
> > > > > > > > > > specify
> > > > > > > > > > > > a
> > > > > > > > > > > > > > monotonic clock.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The reason for including LeaseStartTimeMs in the
> > > > request
> > > > > > is to
> > > > > > > > > > ensure
> > > > > > > > > > > > that
> > > > > > > > > > > > > > the time required to communicate with the
> > controller
> > > > gets
> > > > > > > > included
> > > > > > > > > > in
> > > > > > > > > > > > the
> > > > > > > > > > > > > > lease time.  Since requests can potentially be
> > delayed
> > > > in
> > > > > > the
> > > > > > > > > > network
> > > > > > > > > > > > for a
> > > > > > > > > > > > > > long time, this is important.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I have a prototype built to demonstrate this as
> > > > > > following:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/unmeshjoshi/distrib-broker/blob/master/src/main/scala/com/dist/simplekafka/kip500/Kip631Controller.scala
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The Kip631Controller itself depends on a
> > Consensus
> > > > > > module, to
> > > > > > > > > > > > demonstrate
> > > > > > > > > > > > > > > how possible interactions with the consensus
> > module
> > > > will
> > > > > > look
> > > > > > > > > > like
> > > > > > > > > > > > > > >  (The Consensus can be pluggable really, with an
> > API
> > > > to
> > > > > > allow
> > > > > > > > > > reading
> > > > > > > > > > > > > > > replicated log upto HighWaterMark)
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It has an implementation of LeaseTracker
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/unmeshjoshi/distrib-broker/blob/master/src/main/scala/com/dist/simplekafka/kip500/LeaderLeaseTracker.scala
> > > > > > > > > > > > > > > to demonstrate LeaseTracker's interaction with
> > the
> > > > > > consensus
> > > > > > > > > > module.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The implementation has the following aspects:
> > > > > > > > > > > > > > > 1. The lease tracking happens only on the active
> > > > > > controller
> > > > > > > > (raft
> > > > > > > > > > > > > > > leader)
> > > > > > > > > > > > > > > 2. Once the lease expires, it needs to propose
> > and
> > > > > > commit a
> > > > > > > > > > > > FenceBroker
> > > > > > > > > > > > > > > record for that lease.
> > > > > > > > > > > > > > > 3. In case of active controller failure, the
> > lease
> > > > will
> > > > > > be
> > > > > > > > > > tracked by
> > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > newly raft leader. The new raft leader starts the
> > > > lease
> > > > > > timer
> > > > > > > > > > again,
> > > > > > > > > > > > (as
> > > > > > > > > > > > > > > implemented in onBecomingLeader method of
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/unmeshjoshi/distrib-broker/blob/master/src/main/scala/com/dist/simplekafka/kip500/Kip631Controller.scala
> > > > > > > > > > > > > > > )
> > > > > > > > > > > > > > > in effect extending the lease by the time spent
> > in
> > > > the
> > > > > > leader
> > > > > > > > > > > > election
> > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > whatever time was elapsed on the old leader.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yes, I agree that the lease timeout on the
> > controller
> > > > side
> > > > > > > > should
> > > > > > > > > > be
> > > > > > > > > > > > reset
> > > > > > > > > > > > > > in the case of controller failover.  The
> > alternative
> > > > would
> > > > > > be
> > > > > > > > to
> > > > > > > > > > track
> > > > > > > > > > > > the
> > > > > > > > > > > > > > lease as hard state rather than soft state, but I
> > think
> > > > > > that
> > > > > > > > is not
> > > > > > > > > > > > really
> > > > > > > > > > > > > > needed, and would result in more log entries.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > There are working tests for this implementation
> > here.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/unmeshjoshi/distrib-broker/blob/master/src/test/scala/com/dist/simplekafka/kip500/Kip631ControllerTest.scala
> > > > > > > > > > > > > > > and an end to end test here
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/unmeshjoshi/distrib-broker/blob/master/src/test/scala/com/dist/simplekafka/ProducerConsumerKIP500Test.scala
> > > > > > > > > > > > > > > <
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/unmeshjoshi/distrib-broker/blob/master/src/test/scala/com/dist/simplekafka/kip500/Kip631ControllerTest.scala
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >>'m not sure what you mean by "de-duplication
> > of the
> > > > > > > > broker."
> > > > > > > > > > Can
> > > > > > > > > > > > you
> > > > > > > > > > > > > > > give a little more context?
> > > > > > > > > > > > > > > Apologies for using the confusing term
> > > > deduplication. I
> > > > > > meant
> > > > > > > > > > broker
> > > > > > > > > > > > id
> > > > > > > > > > > > > > > conflict.
> > > > > > > > > > > > > > > As you can see in the prototype handleRequest of
> > > > > > > > KIP631Controller
> > > > > > > > > > > > > > > <
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://github.com/unmeshjoshi/distrib-broker/blob/master/src/main/scala/com/dist/simplekafka/kip500/Kip631Controller.scala
> > > > > > > > > > > > > > >,
> > > > > > > > > > > > > > > the duplicate broker id needs to be detected
> > before
> > > > the
> > > > > > > > > > BrokerRecord
> > > > > > > > > > > > is
> > > > > > > > > > > > > > > submitted to the raft module.
> > > > > > > > > > > > > > > Also as implemented in the prototype, the
> > > > > > KIP631Controller is
> > > > > > > > > > single
> > > > > > > > > > > > > > > threaded, handling requests one at a time. (an
> > > > example of
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > https://martinfowler.com/articles/patterns-of-distributed-systems/singular-update-queue.html
> > > > > > > > > > > > > > > )
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Our code is single threaded as well.  I think it
> > makes
> > > > > > sense
> > > > > > > > for
> > > > > > > > > > the
> > > > > > > > > > > > > > controller, since otherwise locking becomes very
> > messy.
> > > > > > I'm
> > > > > > > > not
> > > > > > > > > > sure I
> > > > > > > > > > > > > > understand your question about duplicate broker ID
> > > > > > detection,
> > > > > > > > > > though.
> > > > > > > > > > > > > > There's a section in the KIP about this -- is
> > there a
> > > > > > detail we
> > > > > > > > > > should
> > > > > > > > > > > > add
> > > > > > > > > > > > > > there?
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > best,
> > > > > > > > > > > > > > Colin
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > Unmesh
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Sat, Aug 29, 2020 at 10:49 AM Colin McCabe <
> > > > > > > > [email protected]
> > > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Fri, Aug 28, 2020, at 19:36, Unmesh Joshi
> > wrote:
> > > > > > > > > > > > > > > > > Hi Colin,
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > There were a few of questions I had..
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi Unmesh,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for the response.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 1. Were my comments on the broker lease
> > > > > > implementation
> > > > > > > > (and
> > > > > > > > > > > > > > corresponding
> > > > > > > > > > > > > > > > > prototype) appropriate and do we need to
> > change
> > > > the
> > > > > > KIP
> > > > > > > > > > > > > > > > > description accordingly?.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Can you repeat your questions about broker
> > leases?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 2. How will broker epochs be generated? I am
> > > > > > assuming it
> > > > > > > > can
> > > > > > > > > > be
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > committed log offset (like zxid?)
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > There isn't any need to use a log offset.  We
> > can
> > > > just
> > > > > > > > look at
> > > > > > > > > > an
> > > > > > > > > > > > > > > > in-memory hash table and see what the latest
> > > > number is,
> > > > > > > > and add
> > > > > > > > > > > > one, to
> > > > > > > > > > > > > > > > generate a new broker epoch.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 3. How will producer registration happen? I
> > am
> > > > > > assuming
> > > > > > > > it
> > > > > > > > > > > > should be
> > > > > > > > > > > > > > > > > similar to broker registration, with a
> > similar
> > > > way to
> > > > > > > > > > generate
> > > > > > > > > > > > > > producer
> > > > > > > > > > > > > > > > id.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > For the EOS stuff, we will need a few new RPCs
> > to
> > > > the
> > > > > > > > > > controller.
> > > > > > > > > > > > I
> > > > > > > > > > > > > > think
> > > > > > > > > > > > > > > > we should do that in a follow-on KIP, though,
> > since
> > > > > > this
> > > > > > > > one is
> > > > > > > > > > > > already
> > > > > > > > > > > > > > > > pretty big.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > 4. Because we expose Raft log to all the
> > > > brokers, any
> > > > > > > > > > > > de-duplication
> > > > > > > > > > > > > > of
> > > > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > broker needs to happen before the requests
> > are
> > > > > > proposed
> > > > > > > > to
> > > > > > > > > > Raft.
> > > > > > > > > > > > For
> > > > > > > > > > > > > > this
> > > > > > > > > > > > > > > > > the controller needs to be single threaded,
> > and
> > > > > > should do
> > > > > > > > > > > > validation
> > > > > > > > > > > > > > > > > against the in-process or pending requests
> > and
> > > > the
> > > > > > final
> > > > > > > > > > state. I
> > > > > > > > > > > > > > read a
> > > > > > > > > > > > > > > > > mention of this, in the responses in this
> > > > > > thread.Will it
> > > > > > > > be
> > > > > > > > > > > > useful to
> > > > > > > > > > > > > > > > > mention this in the KIP?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I'm not sure what you mean by "de-duplication
> > of
> > > > the
> > > > > > > > broker."
> > > > > > > > > > Can
> > > > > > > > > > > > you
> > > > > > > > > > > > > > > > give a little more context?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > best,
> > > > > > > > > > > > > > > > Colin
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > Unmesh
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Sat, Aug 29, 2020 at 4:50 AM Colin McCabe
> > <
> > > > > > > > > > [email protected]
> > > > > > > > > > > > >
> > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I'm thinking of calling a vote on KIP-631
> > on
> > > > > > Monday.
> > > > > > > > Let
> > > > > > > > > > me
> > > > > > > > > > > > know
> > > > > > > > > > > > > > if
> > > > > > > > > > > > > > > > > > there's any more comments I should address
> > > > before I
> > > > > > > > start
> > > > > > > > > > the
> > > > > > > > > > > > vote.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > cheers,
> > > > > > > > > > > > > > > > > > Colin
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > On Tue, Aug 11, 2020, at 05:39, Unmesh
> > Joshi
> > > > wrote:
> > > > > > > > > > > > > > > > > > > >>Hi Unmesh,
> > > > > > > > > > > > > > > > > > > >>Thanks, I'll take a look.
> > > > > > > > > > > > > > > > > > > Thanks. I will be adding more to the
> > > > prototype
> > > > > > and
> > > > > > > > will
> > > > > > > > > > be
> > > > > > > > > > > > happy
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > help
> > > > > > > > > > > > > > > > > > > and collaborate.
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > > > > > > > Unmesh
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > On Tue, Aug 11, 2020 at 12:28 AM Colin
> > > > McCabe <
> > > > > > > > > > > > > > [email protected]>
> > > > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Hi Jose,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > That'a s good point that I hadn't
> > > > considered.
> > > > > > It's
> > > > > > > > > > > > probably
> > > > > > > > > > > > > > worth
> > > > > > > > > > > > > > > > > > having
> > > > > > > > > > > > > > > > > > > > a separate leader change message, as
> > you
> > > > > > mentioned.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Hi Unmesh,
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > Thanks, I'll take a look.
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > best,
> > > > > > > > > > > > > > > > > > > > Colin
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > On Fri, Aug 7, 2020, at 11:56, Jose
> > Garcia
> > > > > > Sancio
> > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > > > > > Hi Unmesh,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Very cool prototype!
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Hi Colin,
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > The KIP proposes a record called
> > > > IsrChange
> > > > > > which
> > > > > > > > > > > > includes the
> > > > > > > > > > > > > > > > > > > > > partition, topic, isr, leader and
> > leader
> > > > > > epoch.
> > > > > > > > > > During
> > > > > > > > > > > > normal
> > > > > > > > > > > > > > > > > > > > > operation ISR changes do not result
> > in
> > > > leader
> > > > > > > > > > changes.
> > > > > > > > > > > > > > Similarly,
> > > > > > > > > > > > > > > > > > > > > leader changes do not necessarily
> > > > involve ISR
> > > > > > > > > > changes.
> > > > > > > > > > > > The
> > > > > > > > > > > > > > > > controller
> > > > > > > > > > > > > > > > > > > > > implementation that uses ZK modeled
> > them
> > > > > > together
> > > > > > > > > > because
> > > > > > > > > > > > > > > > > > > > > 1. All of this information is stored
> > in
> > > > one
> > > > > > > > znode.
> > > > > > > > > > > > > > > > > > > > > 2. ZK's optimistic lock requires
> > that you
> > > > > > specify
> > > > > > > > > > the new
> > > > > > > > > > > > > > value
> > > > > > > > > > > > > > > > > > > > completely
> > > > > > > > > > > > > > > > > > > > > 3. The change to that znode was being
> > > > > > performed
> > > > > > > > by
> > > > > > > > > > both
> > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > controller
> > > > > > > > > > > > > > > > > > > > > and the leader.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > None of these reasons are true in
> > > > KIP-500.
> > > > > > Have
> > > > > > > > we
> > > > > > > > > > > > considered
> > > > > > > > > > > > > > > > having
> > > > > > > > > > > > > > > > > > > > > two different records? For example
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > 1. IsrChange record which includes
> > topic,
> > > > > > > > partition,
> > > > > > > > > > isr
> > > > > > > > > > > > > > > > > > > > > 2. LeaderChange record which includes
> > > > topic,
> > > > > > > > > > partition,
> > > > > > > > > > > > > > leader
> > > > > > > > > > > > > > > > and
> > > > > > > > > > > > > > > > > > > > leader epoch.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > I suspect that making this change
> > will
> > > > also
> > > > > > > > require
> > > > > > > > > > > > changing
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > > > > > > > > message AlterIsrRequest introduced in
> > > > > > KIP-497:
> > > > > > > > Add
> > > > > > > > > > > > > > inter-broker
> > > > > > > > > > > > > > > > API
> > > > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > > > > > > alter ISR.
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > > > Thanks
> > > > > > > > > > > > > > > > > > > > > -Jose
> > > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-631: The Quorum-based Kafka Controller

Reply via email to