Hi, Stan,

Thanks for the explanation.

10. If those new fields in LeaderAndIsr are only needed for future work,
perhaps they should be added when we do the future work instead of now?

Jun


On Wed, Jul 31, 2019 at 2:30 AM Stanislav Kozlovski <stanis...@confluent.io>
wrote:

> Hey Jun,
>
> I think I can answer some of your questions on behalf of Colin -- he can
> confirm if I'm correct
>
> > 10. The KIP adds two new fields (AddingReplicas and RemovingReplicas) to
> LeaderAndIsr request. Could you explain how these 2 fields will be used?
> Sorry for not explaining this in the KIP - those fields won't be used by
> the non-controller brokers yet. Our plans for them are outlined in the
> Future Work section of the KIP - namely "Reassignment Quotas that only
> throttle reassignment traffic" and "Add reassignment metrics".
>
> > Should we include those two fields in UpdateMetadata and potentially
> Metadata requests too?
> I recall this was discussed in the beginning by Colin and Jason, so I'll
> let Colin answer that question.
>
> 11 & 12. Correct, we need to send StopReplica requests. The implementation
> does this (
>
> https://github.com/apache/kafka/pull/7128/files#diff-ed90e8ecc5439a5ede5e362255d11be1R651
> )
> -- I'll update the KIP to mention it as well.
> I tried to document the algorithm here
>
> https://github.com/apache/kafka/pull/7128/files#diff-ed90e8ecc5439a5ede5e362255d11be1R521
> .
>
> 13. I think so. (
> https://github.com/apache/kafka/pull/7128#discussion_r308866206). I'll
> reflect this in the KIP
>
> Here is the updated KIP diff -
>
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=112820260&selectedPageVersions=36&selectedPageVersions=35
>
> Thanks,
> Stanislav
>
> On Tue, Jul 30, 2019 at 10:18 PM Jun Rao <j...@confluent.io> wrote:
>
> > Hi, Colin,
> >
> > Thanks for the KIP. Sorry for the late reply. LGTM overall. A few
> detailed
> > comments below.
> >
> > 10. The KIP adds two new fields (AddingReplicas and RemovingReplicas) to
> > LeaderAndIsr request. Could you explain how these 2 fields will be used?
> > Should we include those two fields in UpdateMetadata and potentially
> > Metadata requests too?
> >
> > 11. "If a new reassignment is issued during an on-going one, we cancel
> the
> > current one by emptying out both AR and RR, constructing them from (the
> > updated from the last-reassignment) R and TR, and starting anew." In this
> > case, it seems that the controller needs to issue a StopReplica request
> to
> > remove those unneeded replicas.
> >
> > 12. "Essentially, once a cancellation is called we subtract AR from R,
> > empty out both AR and RR, and send LeaderAndIsr requests to cancel the
> > replica movements that have not yet completed." Similar to the above, it
> > seems the controller needs to issue a StopReplica request to remove those
> > unneeded replicas.
> >
> > 13. Since we changed the format of the topics/[topic] zNode, should we
> bump
> > up the version number in the json value?
> >
> > Jun
> >
> > On Mon, Jul 22, 2019 at 8:38 AM Colin McCabe <cmcc...@apache.org> wrote:
> >
> > > Hi all,
> > >
> > > With three non-binding +1 votes from Viktor Somogyi-Vass, Robert
> Barrett,
> > > and George Li, and 3 binding +1 votes from Gwen Shapira, Jason
> Gustafson,
> > > and myself, the vote passes.  Thanks, everyone!
> > >
> > > best,
> > > Colin
> > >
> > > On Fri, Jul 19, 2019, at 08:55, Robert Barrett wrote:
> > > > +1 (non-binding). Thanks for the KIP!
> > > >
> > > > On Thu, Jul 18, 2019 at 5:59 PM George Li <sql_consult...@yahoo.com
> > > .invalid>
> > > > wrote:
> > > >
> > > > >  +1 (non-binding)
> > > > >
> > > > >
> > > > >
> > > > > Thanks for addressing the comments.
> > > > > George
> > > > >
> > > > >     On Thursday, July 18, 2019, 05:03:58 PM PDT, Gwen Shapira <
> > > > > g...@confluent.io> wrote:
> > > > >
> > > > >  Renewing my +1, thank you Colin and Stan for working through all
> the
> > > > > questions, edge cases, requests and alternatives. We ended up with
> a
> > > > > great protocol.
> > > > >
> > > > > On Thu, Jul 18, 2019 at 4:54 PM Jason Gustafson <
> ja...@confluent.io>
> > > > > wrote:
> > > > > >
> > > > > > +1 Thanks for the KIP. Really looking forward to this!
> > > > > >
> > > > > > -Jason
> > > > > >
> > > > > > On Wed, Jul 17, 2019 at 1:41 PM Colin McCabe <cmcc...@apache.org
> >
> > > wrote:
> > > > > >
> > > > > > > Thanks, Stanislav.  Let's restart the vote to reflect the fact
> > that
> > > > > we've
> > > > > > > made significant changes.  The new vote will go for 3 days as
> > > usual.
> > > > > > >
> > > > > > > I'll start with my +1 (binding).
> > > > > > >
> > > > > > > best,
> > > > > > > Colin
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Jul 17, 2019, at 08:56, Stanislav Kozlovski wrote:
> > > > > > > > Hey everybody,
> > > > > > > >
> > > > > > > > We have further iterated on the KIP in the accompanying
> > > discussion
> > > > > thread
> > > > > > > > and I'd like to propose we resume the vote.
> > > > > > > >
> > > > > > > > Some notable changes:
> > > > > > > > - we will store reassignment information in the
> > > > > `/brokers/topics/[topic]`
> > > > > > > > - we will internally use two collections to represent a
> > > reassignment
> > > > > -
> > > > > > > > "addingReplicas" and "removingReplicas". LeaderAndIsr has
> been
> > > > > updated
> > > > > > > > accordingly
> > > > > > > > - the Alter API will still use the "targetReplicas"
> collection,
> > > but
> > > > > the
> > > > > > > > List API will now return three separate collections - the
> full
> > > > > replica
> > > > > > > set,
> > > > > > > > the replicas we are adding as part of this reassignment
> > > > > > > ("addingReplicas")
> > > > > > > > and the replicas we are removing ("removingReplicas")
> > > > > > > > - cancellation of a reassignment now means a proper rollback
> of
> > > the
> > > > > > > > assignment to its original state prior to the API call
> > > > > > > >
> > > > > > > > As always, you can re-read the KIP here
> > > > > > > >
> > > > > > >
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Stanislav
> > > > > > > >
> > > > > > > > On Wed, May 22, 2019 at 6:12 PM Colin McCabe <
> > cmcc...@apache.org
> > > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi George,
> > > > > > > > >
> > > > > > > > > Thanks for taking a look.  I am working on getting a PR
> done
> > > as a
> > > > > > > > > proof-of-concept.  I'll post it soon.  Then we'll finish up
> > the
> > > > > vote.
> > > > > > > > >
> > > > > > > > > best,
> > > > > > > > > Colin
> > > > > > > > >
> > > > > > > > > On Tue, May 21, 2019, at 17:33, George Li wrote:
> > > > > > > > > >  Hi Colin,
> > > > > > > > > >
> > > > > > > > > >  Great! Looking forward to these features.    +1
> > > (non-binding)
> > > > > > > > > >
> > > > > > > > > > What is the estimated timeline to have this implemented?
> > If
> > > any
> > > > > help
> > > > > > > > > > is needed in the implementation of cancelling
> > > reassignments,  I
> > > > > can
> > > > > > > > > > help if there is spare cycle.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > George
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >    On Thursday, May 16, 2019, 9:48:56 AM PDT, Colin
> McCabe
> > > > > > > > > > <cmcc...@apache.org> wrote:
> > > > > > > > > >
> > > > > > > > > >  Hi George,
> > > > > > > > > >
> > > > > > > > > > Yes, KIP-455 allows the reassignment of individual
> > > partitions to
> > > > > be
> > > > > > > > > > cancelled.  I think it's very important for these
> > operations
> > > to
> > > > > be at
> > > > > > > > > > the partition level.
> > > > > > > > > >
> > > > > > > > > > best,
> > > > > > > > > > Colin
> > > > > > > > > >
> > > > > > > > > > On Tue, May 14, 2019, at 16:34, George Li wrote:
> > > > > > > > > > >  Hi Colin,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for the updated KIP.  It has very good
> > improvements
> > > of
> > > > > Kafka
> > > > > > > > > > > reassignment operations.
> > > > > > > > > > >
> > > > > > > > > > > One question, looks like the KIP includes the
> > Cancellation
> > > of
> > > > > > > > > > > individual pending reassignments as well when the
> > > > > > > > > > > AlterPartitionReasisgnmentRequest has empty replicas
> for
> > > the
> > > > > > > > > > > topic/partition. Will you also be implementing the the
> > > > > partition
> > > > > > > > > > > cancellation/rollback in the PR ?    If yes,  it will
> > make
> > > > > KIP-236
> > > > > > > (it
> > > > > > > > > > > has PR already) trivial, since the cancel all pending
> > > > > > > reassignments,
> > > > > > > > > > > one just needs to do a ListPartitionRessignmentRequest,
> > > then
> > > > > submit
> > > > > > > > > > > empty replicas for all those topic/partitions in
> > > > > > > > > > > one AlterPartitionReasisgnmentRequest.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > George
> > > > > > > > > > >
> > > > > > > > > > >    On Friday, May 10, 2019, 8:44:31 PM PDT, Colin
> McCabe
> > > > > > > > > > > <cmcc...@apache.org> wrote:
> > > > > > > > > > >
> > > > > > > > > > >  On Fri, May 10, 2019, at 17:34, Colin McCabe wrote:
> > > > > > > > > > > > On Fri, May 10, 2019, at 16:43, Jason Gustafson
> wrote:
> > > > > > > > > > > > > Hi Colin,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think storing reassignment state at the partition
> > > level
> > > > > is
> > > > > > > the
> > > > > > > > > right move
> > > > > > > > > > > > > and I also agree that replicas should understand
> that
> > > > > there is
> > > > > > > a
> > > > > > > > > > > > > reassignment in progress. This makes KIP-352 a
> > trivial
> > > > > > > follow-up
> > > > > > > > > for
> > > > > > > > > > > > > example. The only doubt I have is whether the
> leader
> > > and
> > > > > isr
> > > > > > > znode
> > > > > > > > > is the
> > > > > > > > > > > > > right place to store the target reassignment. It
> is a
> > > bit
> > > > > odd
> > > > > > > to
> > > > > > > > > keep the
> > > > > > > > > > > > > target assignment in a separate place from the
> > current
> > > > > > > assignment,
> > > > > > > > > right? I
> > > > > > > > > > > > > assume the thinking is probably that although the
> > > current
> > > > > > > > > assignment should
> > > > > > > > > > > > > probably be in the leader and isr znode as well, it
> > is
> > > > > hard to
> > > > > > > > > move the
> > > > > > > > > > > > > state in a compatible way. Is that right? But if we
> > > have no
> > > > > > > plan
> > > > > > > > > to remove
> > > > > > > > > > > > > the assignment znode, do you see a downside to
> > storing
> > > the
> > > > > > > target
> > > > > > > > > > > > > assignment there as well?
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Jason,
> > > > > > > > > > > >
> > > > > > > > > > > > That's a good point -- it's probably better to keep
> the
> > > > > target
> > > > > > > > > > > > assignment in the same znode as the current
> assignment,
> > > for
> > > > > > > > > > > > consistency.  I'll change the KIP.
> > > > > > > > > > >
> > > > > > > > > > > Hi Jason,
> > > > > > > > > > >
> > > > > > > > > > > Thanks again for the review.
> > > > > > > > > > >
> > > > > > > > > > > I took another look at this, and I think we should
> stick
> > > with
> > > > > the
> > > > > > > > > > > initial proposal of putting the reassignment state into
> > > > > > > > > > > /brokers/topics/[topic]/partitions/[partitionId]/state.
> > > The
> > > > > > > reason is
> > > > > > > > > > > because we'll want to bump the leader epoch for the
> > > partition
> > > > > when
> > > > > > > > > > > changing the reassignment state, and the leader epoch
> > > resides
> > > > > in
> > > > > > > that
> > > > > > > > > > > znode anyway.  I agree there is some inconsistency
> here,
> > > but
> > > > > so be
> > > > > > > it:
> > > > > > > > > > > if we were to greenfield these zookeeper data
> structures,
> > > we
> > > > > might
> > > > > > > do
> > > > > > > > > > > it differently, but the proposed scheme will work fine
> > and
> > > be
> > > > > > > > > > > extensible for the future.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > A few additional questions:
> > > > > > > > > > > > >
> > > > > > > > > > > > > 1. Should `alterPartitionReassignments` be
> > > > > > > > > `alterPartitionAssignments`?
> > > > > > > > > > > > > It's the current assignment we're altering, right?
> > > > > > > > > > > >
> > > > > > > > > > > > That's fair.  AlterPartitionAssigments reads a little
> > > > > better, and
> > > > > > > > > I'll
> > > > > > > > > > > > change it to that.
> > > > > > > > > > >
> > > > > > > > > > > +1.  I've changed the RPC and API name in the wiki.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 2. Does this change affect the Metadata API? In
> other
> > > > > words,
> > > > > > > are
> > > > > > > > > clients
> > > > > > > > > > > > > aware of reassignments? If so, then we probably
> need
> > a
> > > > > change
> > > > > > > to
> > > > > > > > > > > > > UpdateMetadata as well. The only alternative I can
> > > think of
> > > > > > > would
> > > > > > > > > be to
> > > > > > > > > > > > > represent the replica set in the Metadata request
> as
> > > the
> > > > > union
> > > > > > > of
> > > > > > > > > the
> > > > > > > > > > > > > current and target replicas, but I can't think of
> any
> > > > > benefit
> > > > > > > to
> > > > > > > > > hiding
> > > > > > > > > > > > > reassignments. Note that if we did this, we
> probably
> > > > > wouldn't
> > > > > > > need
> > > > > > > > > a
> > > > > > > > > > > > > separate API to list reassignments.
> > > > > > > > > > > >
> > > > > > > > > > > > I thought about this a bit... and I think on balance,
> > > you're
> > > > > > > right.
> > > > > > > > > We
> > > > > > > > > > > > should keep this information together with the
> replica
> > > > > nodes, isr
> > > > > > > > > > > > nodes, and offline replicas, and that information is
> > > > > available in
> > > > > > > > > the
> > > > > > > > > > > > MetadataResponse.
> > > > > > > > > > > >  However, I do think in order to do this, we'll need
> a
> > > flag
> > > > > in
> > > > > > > the
> > > > > > > > > > > > MetadataRequest that specifiies "only show me
> > reassigning
> > > > > > > > > partitions".
> > > > > > > > > > > > I'll add this.
> > > > > > > > > > >
> > > > > > > > > > > I revisited this, and I think we should stick with the
> > > original
> > > > > > > > > > > proposal of having a separate
> ListPartitionReassignments
> > > API.
> > > > > > > There
> > > > > > > > > > > really is no use case where the Producer or Consumer
> > needs
> > > to
> > > > > know
> > > > > > > > > > > about a reassignment.  They should just be notified
> when
> > > the
> > > > > set of
> > > > > > > > > > > partitions changes, which doesn't require changes to
> > > > > > > > > > > MetadataRequest/Response.  The Admin client only cares
> if
> > > > > someone
> > > > > > > is
> > > > > > > > > > > managing the reassignment.  So adding this state to the
> > > > > > > > > > > MetadataResponse adds overhead for no real benefit.  In
> > the
> > > > > common
> > > > > > > > > case
> > > > > > > > > > > where there is no ongoing reassignment, it would be 4
> > > bytes per
> > > > > > > > > > > partition of extra overhead in the MetadataResponse.
> > > > > > > > > > >
> > > > > > > > > > > In general, I think we have a problem of oversharing in
> > the
> > > > > > > > > > > MetadataRequest/Response.  As we 10x or 100x the number
> > of
> > > > > > > partitions
> > > > > > > > > > > we support, we'll need to get stricter about giving
> > clients
> > > > > only
> > > > > > > the
> > > > > > > > > > > information they actually need, about the partitions
> they
> > > > > actually
> > > > > > > > > care
> > > > > > > > > > > about.  Reassignment state clearly falls in the
> category
> > of
> > > > > state
> > > > > > > that
> > > > > > > > > > > isn't needed by clients (except very specialized
> > > rebalancing
> > > > > > > programs).
> > > > > > > > > > >
> > > > > > > > > > > Another important consideration here is that someone
> > > managing
> > > > > an
> > > > > > > > > > > ongoing reassignment wants the most up-to-date
> > information,
> > > > > which
> > > > > > > is
> > > > > > > > > to
> > > > > > > > > > > be found on the controller.  Therefore adding this
> state
> > to
> > > > > > > listTopics
> > > > > > > > > > > or describeTopics, which could contact any node in the
> > > > > cluster, is
> > > > > > > > > > > sub-optimal.
> > > > > > > > > > >
> > > > > > > > > > > Finally, adding this to listTopics or describeTopics
> > feels
> > > > > like a
> > > > > > > > > warty
> > > > > > > > > > > API.  It's an extra boolean which interacts with other
> > > extra
> > > > > > > booleans
> > > > > > > > > > > like "show internal", etc. in weird ways.  I think a
> > > separate
> > > > > API
> > > > > > > is
> > > > > > > > > > > cleaner.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > 3. As replicas come into sync, they will join the
> > ISR.
> > > > > Will we
> > > > > > > > > await all
> > > > > > > > > > > > > target replicas joining the ISR before taking the
> > > replica
> > > > > out
> > > > > > > of
> > > > > > > > > the target
> > > > > > > > > > > > > replicas set? Also, I assume that target replicas
> can
> > > > > still be
> > > > > > > > > elected as
> > > > > > > > > > > > > leader?
> > > > > > > > > > > >
> > > > > > > > > > > > We'll take a replica out of the target replicas set
> as
> > > soon
> > > > > as
> > > > > > > that
> > > > > > > > > > > > replica is in the ISR.  Let me clarify this in the
> KIP.
> > > > > > > > > > > >
> > > > > > > > > > > > > 4. Probably useful to mention permissions for the
> new
> > > APIs.
> > > > > > > > > > > >
> > > > > > > > > > > > Good point.  I think alterPartitionAssignments should
> > > require
> > > > > > > ALTER
> > > > > > > > > on
> > > > > > > > > > > > CLUSTER.  MetadataRequest permissions will be
> > unchanged.
> > > > > > > > > > >
> > > > > > > > > > > I added permission information.
> > > > > > > > > > >
> > > > > > > > > > > best,
> > > > > > > > > > > Colin
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > best,
> > > > > > > > > > > > Colin
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Jason
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Fri, May 10, 2019 at 9:30 AM Gwen Shapira <
> > > > > > > g...@confluent.io>
> > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > +1 (binding)
> > > > > > > > > > > > > > Looks great, and will be awesome to have this new
> > > > > capability.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Wed, May 8, 2019 at 10:23 PM Colin McCabe <
> > > > > > > cmcc...@apache.org>
> > > > > > > > > wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I'd like to start the vote for KIP-455: Create
> an
> > > > > > > > > Administrative API for
> > > > > > > > > > > > > > > Replica Reassignment.  I think this KIP is
> > > important
> > > > > since
> > > > > > > it
> > > > > > > > > will unlock
> > > > > > > > > > > > > > > many follow-on improvements to Kafka
> reassignment
> > > (see
> > > > > the
> > > > > > > > > "Future work"
> > > > > > > > > > > > > > > section, plus a lot of the other discussions
> > we've
> > > had
> > > > > > > > > recently about
> > > > > > > > > > > > > > > reassignment).  It also furthers the important
> > > KIP-4
> > > > > goal
> > > > > > > of
> > > > > > > > > removing
> > > > > > > > > > > > > > > direct access to ZK.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I made a few changes based on the discussion in
> > the
> > > > > > > [DISCUSS]
> > > > > > > > > thread.  As
> > > > > > > > > > > > > > > Robert suggested, I removed the need to
> > explicitly
> > > > > cancel a
> > > > > > > > > reassignment
> > > > > > > > > > > > > > > for a partition before setting up a different
> > > > > reassignment
> > > > > > > for
> > > > > > > > > that
> > > > > > > > > > > > > > > specific partition.  I also simplified the API
> a
> > > bit by
> > > > > > > adding
> > > > > > > > > a
> > > > > > > > > > > > > > > PartitionReassignment class which is used by
> both
> > > the
> > > > > alter
> > > > > > > > > and list
> > > > > > > > > > > > > > APIs.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I modified the proposal so that we now
> deprecate
> > > the
> > > > > old
> > > > > > > > > znode-based API
> > > > > > > > > > > > > > > rather than removing it completely.  That
> should
> > > give
> > > > > > > external
> > > > > > > > > > > > > > rebalancing
> > > > > > > > > > > > > > > tools some time to transition to the new API.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > To clarify a question Viktor asked, I added a
> > note
> > > > > that the
> > > > > > > > > > > > > > > kafka-reassign-partitions.sh will now use a
> > > > > > > --bootstrap-server
> > > > > > > > > argument
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > contact the admin APIs.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > thanks,
> > > > > > > > > > > > > > > Colin
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > *Gwen Shapira*
> > > > > > > > > > > > > > Product Manager | Confluent
> > > > > > > > > > > > > > 650.450.2760 | @gwenshap
> > > > > > > > > > > > > > Follow us: Twitter <
> > https://twitter.com/ConfluentInc>
> > > |
> > > > > blog
> > > > > > > > > > > > > > <http://www.confluent.io/blog>
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Gwen Shapira
> > > > > Product Manager | Confluent
> > > > > 650.450.2760 | @gwenshap
> > > > > Follow us: Twitter | blog
> > > > >
> > > >
> > >
> >
>

Reply via email to