Hi, Colin,

Thanks for the KIP. Sorry for the late reply. LGTM overall. A few detailed
comments below.

10. The KIP adds two new fields (AddingReplicas and RemovingReplicas) to
LeaderAndIsr request. Could you explain how these 2 fields will be used?
Should we include those two fields in UpdateMetadata and potentially
Metadata requests too?

11. "If a new reassignment is issued during an on-going one, we cancel the
current one by emptying out both AR and RR, constructing them from (the
updated from the last-reassignment) R and TR, and starting anew." In this
case, it seems that the controller needs to issue a StopReplica request to
remove those unneeded replicas.

12. "Essentially, once a cancellation is called we subtract AR from R,
empty out both AR and RR, and send LeaderAndIsr requests to cancel the
replica movements that have not yet completed." Similar to the above, it
seems the controller needs to issue a StopReplica request to remove those
unneeded replicas.

13. Since we changed the format of the topics/[topic] zNode, should we bump
up the version number in the json value?

Jun

On Mon, Jul 22, 2019 at 8:38 AM Colin McCabe <cmcc...@apache.org> wrote:

> Hi all,
>
> With three non-binding +1 votes from Viktor Somogyi-Vass, Robert Barrett,
> and George Li, and 3 binding +1 votes from Gwen Shapira, Jason Gustafson,
> and myself, the vote passes.  Thanks, everyone!
>
> best,
> Colin
>
> On Fri, Jul 19, 2019, at 08:55, Robert Barrett wrote:
> > +1 (non-binding). Thanks for the KIP!
> >
> > On Thu, Jul 18, 2019 at 5:59 PM George Li <sql_consult...@yahoo.com
> .invalid>
> > wrote:
> >
> > >  +1 (non-binding)
> > >
> > >
> > >
> > > Thanks for addressing the comments.
> > > George
> > >
> > >     On Thursday, July 18, 2019, 05:03:58 PM PDT, Gwen Shapira <
> > > g...@confluent.io> wrote:
> > >
> > >  Renewing my +1, thank you Colin and Stan for working through all the
> > > questions, edge cases, requests and alternatives. We ended up with a
> > > great protocol.
> > >
> > > On Thu, Jul 18, 2019 at 4:54 PM Jason Gustafson <ja...@confluent.io>
> > > wrote:
> > > >
> > > > +1 Thanks for the KIP. Really looking forward to this!
> > > >
> > > > -Jason
> > > >
> > > > On Wed, Jul 17, 2019 at 1:41 PM Colin McCabe <cmcc...@apache.org>
> wrote:
> > > >
> > > > > Thanks, Stanislav.  Let's restart the vote to reflect the fact that
> > > we've
> > > > > made significant changes.  The new vote will go for 3 days as
> usual.
> > > > >
> > > > > I'll start with my +1 (binding).
> > > > >
> > > > > best,
> > > > > Colin
> > > > >
> > > > >
> > > > > On Wed, Jul 17, 2019, at 08:56, Stanislav Kozlovski wrote:
> > > > > > Hey everybody,
> > > > > >
> > > > > > We have further iterated on the KIP in the accompanying
> discussion
> > > thread
> > > > > > and I'd like to propose we resume the vote.
> > > > > >
> > > > > > Some notable changes:
> > > > > > - we will store reassignment information in the
> > > `/brokers/topics/[topic]`
> > > > > > - we will internally use two collections to represent a
> reassignment
> > > -
> > > > > > "addingReplicas" and "removingReplicas". LeaderAndIsr has been
> > > updated
> > > > > > accordingly
> > > > > > - the Alter API will still use the "targetReplicas" collection,
> but
> > > the
> > > > > > List API will now return three separate collections - the full
> > > replica
> > > > > set,
> > > > > > the replicas we are adding as part of this reassignment
> > > > > ("addingReplicas")
> > > > > > and the replicas we are removing ("removingReplicas")
> > > > > > - cancellation of a reassignment now means a proper rollback of
> the
> > > > > > assignment to its original state prior to the API call
> > > > > >
> > > > > > As always, you can re-read the KIP here
> > > > > >
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-455%3A+Create+an+Administrative+API+for+Replica+Reassignment
> > > > > >
> > > > > > Best,
> > > > > > Stanislav
> > > > > >
> > > > > > On Wed, May 22, 2019 at 6:12 PM Colin McCabe <cmcc...@apache.org
> >
> > > wrote:
> > > > > >
> > > > > > > Hi George,
> > > > > > >
> > > > > > > Thanks for taking a look.  I am working on getting a PR done
> as a
> > > > > > > proof-of-concept.  I'll post it soon.  Then we'll finish up the
> > > vote.
> > > > > > >
> > > > > > > best,
> > > > > > > Colin
> > > > > > >
> > > > > > > On Tue, May 21, 2019, at 17:33, George Li wrote:
> > > > > > > >  Hi Colin,
> > > > > > > >
> > > > > > > >  Great! Looking forward to these features.    +1
> (non-binding)
> > > > > > > >
> > > > > > > > What is the estimated timeline to have this implemented?  If
> any
> > > help
> > > > > > > > is needed in the implementation of cancelling
> reassignments,  I
> > > can
> > > > > > > > help if there is spare cycle.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > George
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >    On Thursday, May 16, 2019, 9:48:56 AM PDT, Colin McCabe
> > > > > > > > <cmcc...@apache.org> wrote:
> > > > > > > >
> > > > > > > >  Hi George,
> > > > > > > >
> > > > > > > > Yes, KIP-455 allows the reassignment of individual
> partitions to
> > > be
> > > > > > > > cancelled.  I think it's very important for these operations
> to
> > > be at
> > > > > > > > the partition level.
> > > > > > > >
> > > > > > > > best,
> > > > > > > > Colin
> > > > > > > >
> > > > > > > > On Tue, May 14, 2019, at 16:34, George Li wrote:
> > > > > > > > >  Hi Colin,
> > > > > > > > >
> > > > > > > > > Thanks for the updated KIP.  It has very good improvements
> of
> > > Kafka
> > > > > > > > > reassignment operations.
> > > > > > > > >
> > > > > > > > > One question, looks like the KIP includes the Cancellation
> of
> > > > > > > > > individual pending reassignments as well when the
> > > > > > > > > AlterPartitionReasisgnmentRequest has empty replicas for
> the
> > > > > > > > > topic/partition. Will you also be implementing the the
> > > partition
> > > > > > > > > cancellation/rollback in the PR ?    If yes,  it will make
> > > KIP-236
> > > > > (it
> > > > > > > > > has PR already) trivial, since the cancel all pending
> > > > > reassignments,
> > > > > > > > > one just needs to do a ListPartitionRessignmentRequest,
> then
> > > submit
> > > > > > > > > empty replicas for all those topic/partitions in
> > > > > > > > > one AlterPartitionReasisgnmentRequest.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > George
> > > > > > > > >
> > > > > > > > >    On Friday, May 10, 2019, 8:44:31 PM PDT, Colin McCabe
> > > > > > > > > <cmcc...@apache.org> wrote:
> > > > > > > > >
> > > > > > > > >  On Fri, May 10, 2019, at 17:34, Colin McCabe wrote:
> > > > > > > > > > On Fri, May 10, 2019, at 16:43, Jason Gustafson wrote:
> > > > > > > > > > > Hi Colin,
> > > > > > > > > > >
> > > > > > > > > > > I think storing reassignment state at the partition
> level
> > > is
> > > > > the
> > > > > > > right move
> > > > > > > > > > > and I also agree that replicas should understand that
> > > there is
> > > > > a
> > > > > > > > > > > reassignment in progress. This makes KIP-352 a trivial
> > > > > follow-up
> > > > > > > for
> > > > > > > > > > > example. The only doubt I have is whether the leader
> and
> > > isr
> > > > > znode
> > > > > > > is the
> > > > > > > > > > > right place to store the target reassignment. It is a
> bit
> > > odd
> > > > > to
> > > > > > > keep the
> > > > > > > > > > > target assignment in a separate place from the current
> > > > > assignment,
> > > > > > > right? I
> > > > > > > > > > > assume the thinking is probably that although the
> current
> > > > > > > assignment should
> > > > > > > > > > > probably be in the leader and isr znode as well, it is
> > > hard to
> > > > > > > move the
> > > > > > > > > > > state in a compatible way. Is that right? But if we
> have no
> > > > > plan
> > > > > > > to remove
> > > > > > > > > > > the assignment znode, do you see a downside to storing
> the
> > > > > target
> > > > > > > > > > > assignment there as well?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi Jason,
> > > > > > > > > >
> > > > > > > > > > That's a good point -- it's probably better to keep the
> > > target
> > > > > > > > > > assignment in the same znode as the current assignment,
> for
> > > > > > > > > > consistency.  I'll change the KIP.
> > > > > > > > >
> > > > > > > > > Hi Jason,
> > > > > > > > >
> > > > > > > > > Thanks again for the review.
> > > > > > > > >
> > > > > > > > > I took another look at this, and I think we should stick
> with
> > > the
> > > > > > > > > initial proposal of putting the reassignment state into
> > > > > > > > > /brokers/topics/[topic]/partitions/[partitionId]/state.
> The
> > > > > reason is
> > > > > > > > > because we'll want to bump the leader epoch for the
> partition
> > > when
> > > > > > > > > changing the reassignment state, and the leader epoch
> resides
> > > in
> > > > > that
> > > > > > > > > znode anyway.  I agree there is some inconsistency here,
> but
> > > so be
> > > > > it:
> > > > > > > > > if we were to greenfield these zookeeper data structures,
> we
> > > might
> > > > > do
> > > > > > > > > it differently, but the proposed scheme will work fine and
> be
> > > > > > > > > extensible for the future.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > A few additional questions:
> > > > > > > > > > >
> > > > > > > > > > > 1. Should `alterPartitionReassignments` be
> > > > > > > `alterPartitionAssignments`?
> > > > > > > > > > > It's the current assignment we're altering, right?
> > > > > > > > > >
> > > > > > > > > > That's fair.  AlterPartitionAssigments reads a little
> > > better, and
> > > > > > > I'll
> > > > > > > > > > change it to that.
> > > > > > > > >
> > > > > > > > > +1.  I've changed the RPC and API name in the wiki.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 2. Does this change affect the Metadata API? In other
> > > words,
> > > > > are
> > > > > > > clients
> > > > > > > > > > > aware of reassignments? If so, then we probably need a
> > > change
> > > > > to
> > > > > > > > > > > UpdateMetadata as well. The only alternative I can
> think of
> > > > > would
> > > > > > > be to
> > > > > > > > > > > represent the replica set in the Metadata request as
> the
> > > union
> > > > > of
> > > > > > > the
> > > > > > > > > > > current and target replicas, but I can't think of any
> > > benefit
> > > > > to
> > > > > > > hiding
> > > > > > > > > > > reassignments. Note that if we did this, we probably
> > > wouldn't
> > > > > need
> > > > > > > a
> > > > > > > > > > > separate API to list reassignments.
> > > > > > > > > >
> > > > > > > > > > I thought about this a bit... and I think on balance,
> you're
> > > > > right.
> > > > > > > We
> > > > > > > > > > should keep this information together with the replica
> > > nodes, isr
> > > > > > > > > > nodes, and offline replicas, and that information is
> > > available in
> > > > > > > the
> > > > > > > > > > MetadataResponse.
> > > > > > > > > >  However, I do think in order to do this, we'll need a
> flag
> > > in
> > > > > the
> > > > > > > > > > MetadataRequest that specifiies "only show me reassigning
> > > > > > > partitions".
> > > > > > > > > > I'll add this.
> > > > > > > > >
> > > > > > > > > I revisited this, and I think we should stick with the
> original
> > > > > > > > > proposal of having a separate ListPartitionReassignments
> API.
> > > > > There
> > > > > > > > > really is no use case where the Producer or Consumer needs
> to
> > > know
> > > > > > > > > about a reassignment.  They should just be notified when
> the
> > > set of
> > > > > > > > > partitions changes, which doesn't require changes to
> > > > > > > > > MetadataRequest/Response.  The Admin client only cares if
> > > someone
> > > > > is
> > > > > > > > > managing the reassignment.  So adding this state to the
> > > > > > > > > MetadataResponse adds overhead for no real benefit.  In the
> > > common
> > > > > > > case
> > > > > > > > > where there is no ongoing reassignment, it would be 4
> bytes per
> > > > > > > > > partition of extra overhead in the MetadataResponse.
> > > > > > > > >
> > > > > > > > > In general, I think we have a problem of oversharing in the
> > > > > > > > > MetadataRequest/Response.  As we 10x or 100x the number of
> > > > > partitions
> > > > > > > > > we support, we'll need to get stricter about giving clients
> > > only
> > > > > the
> > > > > > > > > information they actually need, about the partitions they
> > > actually
> > > > > > > care
> > > > > > > > > about.  Reassignment state clearly falls in the category of
> > > state
> > > > > that
> > > > > > > > > isn't needed by clients (except very specialized
> rebalancing
> > > > > programs).
> > > > > > > > >
> > > > > > > > > Another important consideration here is that someone
> managing
> > > an
> > > > > > > > > ongoing reassignment wants the most up-to-date information,
> > > which
> > > > > is
> > > > > > > to
> > > > > > > > > be found on the controller.  Therefore adding this state to
> > > > > listTopics
> > > > > > > > > or describeTopics, which could contact any node in the
> > > cluster, is
> > > > > > > > > sub-optimal.
> > > > > > > > >
> > > > > > > > > Finally, adding this to listTopics or describeTopics feels
> > > like a
> > > > > > > warty
> > > > > > > > > API.  It's an extra boolean which interacts with other
> extra
> > > > > booleans
> > > > > > > > > like "show internal", etc. in weird ways.  I think a
> separate
> > > API
> > > > > is
> > > > > > > > > cleaner.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 3. As replicas come into sync, they will join the ISR.
> > > Will we
> > > > > > > await all
> > > > > > > > > > > target replicas joining the ISR before taking the
> replica
> > > out
> > > > > of
> > > > > > > the target
> > > > > > > > > > > replicas set? Also, I assume that target replicas can
> > > still be
> > > > > > > elected as
> > > > > > > > > > > leader?
> > > > > > > > > >
> > > > > > > > > > We'll take a replica out of the target replicas set as
> soon
> > > as
> > > > > that
> > > > > > > > > > replica is in the ISR.  Let me clarify this in the KIP.
> > > > > > > > > >
> > > > > > > > > > > 4. Probably useful to mention permissions for the new
> APIs.
> > > > > > > > > >
> > > > > > > > > > Good point.  I think alterPartitionAssignments should
> require
> > > > > ALTER
> > > > > > > on
> > > > > > > > > > CLUSTER.  MetadataRequest permissions will be unchanged.
> > > > > > > > >
> > > > > > > > > I added permission information.
> > > > > > > > >
> > > > > > > > > best,
> > > > > > > > > Colin
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > best,
> > > > > > > > > > Colin
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Jason
> > > > > > > > > > >
> > > > > > > > > > > On Fri, May 10, 2019 at 9:30 AM Gwen Shapira <
> > > > > g...@confluent.io>
> > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > +1 (binding)
> > > > > > > > > > > > Looks great, and will be awesome to have this new
> > > capability.
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, May 8, 2019 at 10:23 PM Colin McCabe <
> > > > > cmcc...@apache.org>
> > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > Hi all,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I'd like to start the vote for KIP-455: Create an
> > > > > > > Administrative API for
> > > > > > > > > > > > > Replica Reassignment.  I think this KIP is
> important
> > > since
> > > > > it
> > > > > > > will unlock
> > > > > > > > > > > > > many follow-on improvements to Kafka reassignment
> (see
> > > the
> > > > > > > "Future work"
> > > > > > > > > > > > > section, plus a lot of the other discussions we've
> had
> > > > > > > recently about
> > > > > > > > > > > > > reassignment).  It also furthers the important
> KIP-4
> > > goal
> > > > > of
> > > > > > > removing
> > > > > > > > > > > > > direct access to ZK.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I made a few changes based on the discussion in the
> > > > > [DISCUSS]
> > > > > > > thread.  As
> > > > > > > > > > > > > Robert suggested, I removed the need to explicitly
> > > cancel a
> > > > > > > reassignment
> > > > > > > > > > > > > for a partition before setting up a different
> > > reassignment
> > > > > for
> > > > > > > that
> > > > > > > > > > > > > specific partition.  I also simplified the API a
> bit by
> > > > > adding
> > > > > > > a
> > > > > > > > > > > > > PartitionReassignment class which is used by both
> the
> > > alter
> > > > > > > and list
> > > > > > > > > > > > APIs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I modified the proposal so that we now deprecate
> the
> > > old
> > > > > > > znode-based API
> > > > > > > > > > > > > rather than removing it completely.  That should
> give
> > > > > external
> > > > > > > > > > > > rebalancing
> > > > > > > > > > > > > tools some time to transition to the new API.
> > > > > > > > > > > > >
> > > > > > > > > > > > > To clarify a question Viktor asked, I added a note
> > > that the
> > > > > > > > > > > > > kafka-reassign-partitions.sh will now use a
> > > > > --bootstrap-server
> > > > > > > argument
> > > > > > > > > > > > to
> > > > > > > > > > > > > contact the admin APIs.
> > > > > > > > > > > > >
> > > > > > > > > > > > > thanks,
> > > > > > > > > > > > > Colin
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > *Gwen Shapira*
> > > > > > > > > > > > Product Manager | Confluent
> > > > > > > > > > > > 650.450.2760 | @gwenshap
> > > > > > > > > > > > Follow us: Twitter <https://twitter.com/ConfluentInc>
> |
> > > blog
> > > > > > > > > > > > <http://www.confluent.io/blog>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> > >
> > >
> > > --
> > > Gwen Shapira
> > > Product Manager | Confluent
> > > 650.450.2760 | @gwenshap
> > > Follow us: Twitter | blog
> > >
> >
>

Reply via email to