Re: [DISCUSS] KIP-1335: Bounded concurrency for partition reassignment via kafka-reassign-partitions.sh

Manan Gupta Mon, 08 Jun 2026 02:09:37 -0700

Hey Luke

Am I able to answer your questions? LMK if you need any additional info.


Regards,
Manan Gupta

On Tue, Jun 2, 2026 at 11:37 AM Manan Gupta <[email protected]> wrote:

> Hey Luke
>
> Thanks, that is a fair concern when the reassignment tool is embedded in
> something that assumes kafka-reassign-partitions.sh returns quickly (for
> example a short-lived script or a controller reconcile loop that blocks on
> one subprocess).
> A few clarifications on what is going on today:
>
> Where the “long run” lives
> The pacing loops run inside the tool process (or an in-process Admin if
> someone calls the command entry point from Java). They do not change the
> broker contract: each alterPartitionReassignments call is still bounded and
> already returns per-partition futures for acceptance of the reassignment.
> What is long-running is the optional wait between steps (non-incremental)
> or the pipeline driver (incremental), which repeatedly uses the normal read
> APIs (listPartitionReassignments, metadata/describe-style reads) until
> replicas match the target. That “wait for replication” work cannot be
> turned into a single future today; the cluster does not expose
> “reassignment fully complete” as one shot per partition on the alter result
> itself, so any implementation—tool or operator—must poll or re-check state
> unless it exits and delegates that to something else (as with separate
> --verify).
>
> Relationship to the classic execute vs verify split
> The non-blocking pattern you describe is already the legacy model:
> --execute submits and returns; --verify (or another process) observes
> progress. This KIP adds optional blocking in the tool on purpose so
> operators who want pacing do not have to hand-chunk JSON and orchestrate
> waves themselves. If a deployment must not hold a process open, they can
> still use --reassignment-batch-size 0 (legacy one-shot execute + verify),
> or external automation that submits smaller JSON files and sleeps between
> runs—same traffic shape, more moving parts for the operator.
>
> > Return futures so the admin client can check
> For the submit step, the client already gets futures per partition from
> alterPartitionReassignments. For completion, there is no single future to
> return that replaces polling; you would either keep polling inside the
> client library (same duration, different API shape) or push that
> responsibility to the caller. Refactoring the shell tool into a stateful
> “resume” CLI or a library API that streams progress events could be useful,
> but it is a larger follow-up (new UX, persistence, idempotency) rather than
> a small tweak to this KIP.
>
> Practical guidance for K8s / operators
> For controllers that cannot block, the intended pattern is to not wrap the
> blocking paced mode in the reconcile path: run reassignment as a Job, a
> sidecar, or use Admin directly with your own bounded reconcile loop and
> timeouts. Paced mode targets interactive or batch maintenance workflows
> where holding one client open is acceptable.
> Paced --execute only blocks inside the tool process; broker and Admin RPC
> semantics are unchanged. --verify already polls for completion, so a
> long-lived client for observation is not new—this KIP adds optional waits
> between submits so operators are not forced to hand-chunk JSON.
>
> If you want a non-blocking paced mode (e.g. “submit only this step and
> exit” with a marker file), that would be worth a separate discussion or KIP
> so we do not overload this one.
>
> Regards
> Manan Gupta
>
> On Tue, Jun 2, 2026 at 8:10 AM Luke Chen <[email protected]> wrote:
>
>> Hi Manan,
>>
>> LC3: Thanks for updating the KIP to make it clear.
>>
>> LC4: Thanks for the explanation.
>> But that makes me realize that the batch mode (incremental or
>> non-incremental) is a long-running admin client process.
>> If I remember correctly, in admin client, we try not to make each
>> operation a long-running process, so we can see there are operations that
>> return futures to the admin client, or like the "--execute" and "--verify"
>> example in reassignment operations.
>> Making it a long-running operation will block other operations if it's run
>> within a script or K8S operator.
>> Could we change that?
>> For example, we return a list of futures for each partition, and the admin
>> client can check the future status to know if the specific partition has
>> submitted or not?
>>
>> Thanks,
>> Luke
>>
>> On Mon, Jun 1, 2026 at 6:18 PM Manan Gupta <[email protected]> wrote:
>>
>> > Hey Luke
>> >
>> > LC1: Sure, I have updated the KIP now with the example.
>> >
>> > > LC3: How does the batch mode know that all N partitions are completed?
>> >
>> > Batch mode does poll. After each alterPartitionReassignments call for a
>> > step, the tool does not infer completion from that RPC alone—the alter
>> > returns when the controller has accepted the reassignment, not when
>> > replication has fully caught up.
>> > Between steps, the tool enters a wait loop: it uses the Admin client to
>> > read the cluster’s current reassignment and replica state for the
>> > partitions in that step, applies the same completion idea the
>> reassignment
>> > tool already uses for verification (partition no longer in an active
>> > reassignment and the live replica set matches the target in the JSON),
>> > sleeps for --reassignment-poll-interval-ms, and repeats until every
>> > partition in that step satisfies that condition. Only then does it
>> submit
>> > the next step.
>> > So “wait until complete” is implemented as repeated observation + sleep,
>> > not a single blocking call that magically completes when replication
>> > finishes. The KIP text has been updated to spell this out so it is not
>> > mistaken for a passive wait with no polling.
>> >
>> >
>> > > LC4: What will it show when some partitions are still waiting to be
>> > progressed?
>> >
>> > We can separate two things: stdout from --execute, and --verify
>> (separate
>> > command).
>> > Non-incremental batch (--reassignment-batch-size without --incremental)
>> > The tool prints how many batches there will be, then for each step lines
>> > such as "starting batch i of n" and "waiting for batch i to complete
>> before
>> > the next." That matches what we saw in testing, for example:
>> >
>> > ```Submitting partition reassignments in 6 batches of up to 2 partitions
>> > each.
>> > Starting reassignment batch 1 of 6 (2 partitions)...
>> > Waiting for reassignment batch 1 of 6 to complete before starting the
>> next
>> > batch.
>> > then the same pattern for batch 2, and so on.```
>> >
>> > During the “Waiting …” phase there is no per-partition line item for
>> “still
>> > copying” or for partitions not yet submitted in later batches; those
>> > partitions are simply not in flight until their batch starts. If someone
>> > needs partition-level status during that time, they can run --verify in
>> > another terminal or use cluster metrics; --verify still only
>> distinguishes
>> > completed vs in progress for partitions that are part of the plan and
>> > reflectable in metadata / reassignment state, not “waiting in a future
>> > batch” as a distinct label.
>> >
>> > Incremental (--incremental)
>> > After the one-line mode banner, the tool emits a line each time a
>> partition
>> > finishes and the next is submitted, for example:
>> >
>> > ```Incremental mode: keeping up to 2 partition reassignments in flight
>> > until all have been submitted.
>> > Partition test-1-0 finished reassignment; submitting next from queue if
>> > any.
>> > (and similarly for test-1-1, test-10-1, test-10-0, …)```
>> > So incremental mode already gives clearer liveness than batch-only
>> waits:
>> > you see completions as they happen, which helps distinguish “working”
>> from
>> > “stuck” better than the batch wait lines alone.
>> >
>> >
>> > > LC5: indefinite polling
>> > Today there is no maximum wait time on the batch-completion loops: the
>> tool
>> > keeps periodically re-reading cluster state until every partition in the
>> > current step satisfies the completion condition, or the operator stops
>> the
>> > process. If reassignments are slow rather than stuck—which is common
>> when
>> > strict inter-broker or replica throttles are applied—the wait can
>> > legitimately take a long time; that is expected and not by itself a
>> sign of
>> > a hang.
>> > Because there is no built-in deadline yet, operators who need to stop
>> > should interrupt the tool and use the supported cancel path (--cancel
>> with
>> > an appropriate JSON) if they want to back out active reassignments, then
>> > reassess throttles, plan size, or pacing. Adding a dedicated
>> reassignment
>> > wait timeout would be a follow-up: it needs clear semantics (what
>> happens
>> > on expiry, how that interacts with partial plans and the existing
>> --timeout
>> > flag used for log directory moves), which is why this KIP does not
>> > introduce that knob yet.
>> >
>> >
>> > > LC6: Default poll interval
>> >
>> > Agreed that a 500 ms default is aggressive from a controller-load
>> > perspective for clusters that already list reassignments often. The
>> > implementation default has been raised to 1000 ms (1 second) for both
>> the
>> > inter-step wait path and the incremental loop, and the KIP documents
>> that
>> > default accordingly. Operators who want less Admin traffic can set
>> > --reassignment-poll-interval-ms higher (for example 3–5 seconds); the
>> flag
>> > exists so that trade-off is explicit and tunable per environment.
>> >
>> > Regards,
>> > Manan Gupta
>> >
>> > On Mon, Jun 1, 2026 at 1:16 PM Luke Chen <[email protected]> wrote:
>> >
>> > > Hi Manan,
>> > >
>> > > LC1: Thanks for the explanation. It's clear to me now.
>> > > I think we should also put this example and the "How to choose" part
>> in
>> > the
>> > > KIP.
>> > >
>> > > Some more questions:
>> > > LC3. How does the batch mode know that all N partitions are completed
>> and
>> > > then start the next batch?
>> > > It looks like we don't poll the status when in batch mode. How do we
>> know
>> > > that?
>> > >
>> > > LC4. What will it show when some partitions are still waiting to be
>> > > progressed?
>> > > Currently, the --verify only shows "is completed" or "is still in
>> > > progress".
>> > > Should we have an output for the partitions that are sitting in the
>> batch
>> > > queue?
>> > >
>> > > LC5. As you've pointed out, there could be a possibility that it will
>> > poll
>> > > indefinitely.
>> > > Why can't we set a timer for it?
>> > > Any concerns about it?
>> > >
>> > > LC6. "reassignment-poll-interval-ms" default to 500ms is too
>> aggressive.
>> > > I think from users' perspective, any interval < 3 seconds or 5
>> seconds is
>> > > considered acceptable.
>> > > So could we increase it to at least 1 second?
>> > >
>> > > Thank you,
>> > > Luke
>> > >
>> > > On Mon, Jun 1, 2026 at 3:50 PM Manan Gupta <[email protected]>
>> wrote:
>> > >
>> > > > Hey Luke
>> > > > Thank you for reviewing the proposal.
>> > > >
>> > > > LC1:
>> > > > Please excuse me if my explanation of the two different modes was
>> > > unclear.
>> > > >
>> > > > In non-incremental mode the tool walks the plan in steps. Each step
>> > > submits
>> > > > up to N partition reassignments, then waits until every partition in
>> > that
>> > > > step has finished before it opens the next step. The slowest
>> partition
>> > in
>> > > > the current step holds up the entire next step.
>> > > >
>> > > > In incremental mode N is not “how big each step is.” It is how many
>> > > > partition reassignments from this plan may be active at the same
>> time.
>> > > The
>> > > > tool keeps refilling up to N: whenever any single partition
>> completes,
>> > it
>> > > > can start the next one from the queue. There is no rule that the
>> whole
>> > > > group of N must finish together before new work starts.
>> > > >
>> > > > Example: 10 partitions in sorted order P1 through P10, N equals 3.
>> > > >
>> > > > Non-incremental: Step one submits P1 P2 P3 and waits until all three
>> > are
>> > > > done. Step two submits P4 P5 P6 and waits until all three are done.
>> > Step
>> > > > three submits P7 P8 P9 and waits until all three are done. Step four
>> > > > submits P10 only. If P3 is slow, P4 cannot start until P3 finishes,
>> > even
>> > > if
>> > > > P1 and P2 are already done.
>> > > >
>> > > > Incremental: The tool first submits P1 P2 P3 so three reasginemnts
>> are
>> > > > active. If P2 finishes first, it can submit P4 while P1 and P3 are
>> > still
>> > > > running, still keeping three active when possible. It continues that
>> > way
>> > > > until every partition in the plan has been submitted and the
>> in-flight
>> > > work
>> > > > drains according to the tool semantics. If P3 is slow, P4 can still
>> > start
>> > > > as soon as some other slot frees up.
>> > > >
>> > > > How to choose: use non-incremental if you want clear steps and a
>> strict
>> > > > “this whole batch finished before the next batch starts” story. Use
>> > > > incremental if you want steadier utilization when finish times
>> differ
>> > and
>> > > > you do not want one slow partition to block starting unrelated
>> > partitions
>> > > > beyond the cap of N at once.
>> > > >
>> > > > LC2:
>> > > > Both these values are the same, I have updated the KIP to reflect
>> that
>> > > now.
>> > > >
>> > > > Regards
>> > > > Manan Gupta
>> > > >
>> > > >
>> > > > On Mon, Jun 1, 2026 at 9:52 AM Luke Chen <[email protected]> wrote:
>> > > >
>> > > > > Hi Manan,
>> > > > >
>> > > > > Thanks for the KIP.
>> > > > > This is a good improvement.
>> > > > >
>> > > > > Questions:
>> > > > > 1. After reading the KIP, I still don't understand the difference
>> > > between
>> > > > > "incremental mode" and "non-incremental mode".
>> > > > > From what I can see is that they both run with
>> > reassignment-batch-size
>> > > > once
>> > > > > time.
>> > > > > What's the difference between them?
>> > > > > Could you explain more?
>> > > > > Maybe some examples would be helpful to help users know the
>> > difference
>> > > > and
>> > > > > how they choose them.
>> > > > >
>> > > > >
>> > > > > 2. I see there are "INCREMENTAL_REASSIGNMENT_POLL_INTERVAL_MS" and
>> > > > > "reassignment-poll-interval-ms".
>> > > > > What's the difference between them?
>> > > > >
>> > > > >
>> > > > > Thank you,
>> > > > > Luke
>> > > > >
>> > > > >
>> > > > > On Mon, May 25, 2026 at 11:06 PM Manan Gupta <
>> [email protected]>
>> > > > wrote:
>> > > > >
>> > > > > > Hey TaiJuWu
>> > > > > >
>> > > > > > Thank you for reviewhing the KIP, my response is inline.
>> > > > > >
>> > > > > > > TJ00: If we have multiple batch requests, how do you handle
>> > single
>> > > > > batch
>> > > > > > failure?
>> > > > > > - If a submit step fails, the tool returns immediately with
>> errors
>> > > and
>> > > > > does
>> > > > > > not enqueue the rest; partitions already submitted stay under
>> the
>> > > > > > controller’s reassignment as they do today.
>> > > > > > - The process exits with a TerseException listing the failed
>> > > partitions
>> > > > > and
>> > > > > > the error message from the broker/controller (the same pattern
>> as a
>> > > > > > single-shot execute when some alters fail).
>> > > > > >
>> > > > > > > TJ01: If there is a long time operation, how can the users
>> know
>> > it
>> > > > > still
>> > > > > > running instead of hang?
>> > > > > > - Controller / cluster side: ongoing reassignments and
>> replication
>> > > > > > (metrics, kafka-reassign-partitions --list, Admin / JMX).
>> > > > > > - verify in another terminal shows progress toward the target.
>> > > > > > Batch wait is mostly quiet; incremental is a bit chattier; true
>> > > > progress
>> > > > > is
>> > > > > > best observed from cluster state or --verify, not only from
>> stdout
>> > > > during
>> > > > > > the wait loop.
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Manan Gupta
>> > > > > >
>> > > > > > On Mon, May 25, 2026 at 6:06 PM TaiJu Wu <[email protected]>
>> > wrote:
>> > > > > >
>> > > > > > > Hi Manan,
>> > > > > > >
>> > > > > > > Thanks for the KIP, just for some question.
>> > > > > > >
>> > > > > > > TJ00: If we have multiple batch requests, how do you handle
>> > single
>> > > > > batch
>> > > > > > > failure?
>> > > > > > >
>> > > > > > > TJ01: If there is a long time operation, how can the users
>> know
>> > it
>> > > > > still
>> > > > > > > running instead of hang?
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > > TaiJuWu
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > Manan Gupta <[email protected]> 於 2026年5月18日週一 下午6:09寫道：
>> > > > > > >
>> > > > > > > > Hey Kamal
>> > > > > > > >
>> > > > > > > > Thank you for your comments.
>> > > > > > > >
>> > > > > > > > > Should we have a configurable list poll interval?
>> > > > > > > > The current fixed interval of 500ms should not degrade the
>> > > > controller
>> > > > > > > but I
>> > > > > > > > agree that operators should have an option to change this
>> > value,
>> > > > > > updated
>> > > > > > > > the KIP to also take another parameter
>> > > > reassignment-poll-interval-ms
>> > > > > to
>> > > > > > > > update the default value from 500 ms.
>> > > > > > > >
>> > > > > > > > > Shall we extend the batching logic to also
>> > > kafka-leader-election
>> > > > > > > script?
>> > > > > > > > Good point, I will pick this up as a separate KIP as a
>> followup
>> > > to
>> > > > > this
>> > > > > > > > KIP.
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > > Manan
>> > > > > > > >
>> > > > > > > > On Mon, May 18, 2026 at 2:52 PM Kamal Chandraprakash <
>> > > > > > > > [email protected]> wrote:
>> > > > > > > >
>> > > > > > > > > Hi Manan,
>> > > > > > > > >
>> > > > > > > > > Thanks for improving the user-facing tools! Overall LGTM.
>> Few
>> > > > > > > questions:
>> > > > > > > > >
>> > > > > > > > > 1. Should we have a configurable list poll interval? With
>> > > 500ms,
>> > > > > does
>> > > > > > > it
>> > > > > > > > > poll the controller often to list the currently running
>> > > > > reassignments
>> > > > > > > for
>> > > > > > > > > large partitions?
>> > > > > > > > > 2. Shall we extend the batching logic to also
>> > > > kafka-leader-election
>> > > > > > > > script?
>> > > > > > > > > It will be useful when running with
>> --all-topic-partitions.
>> > > > > > > > >
>> > > > > > > > > Thanks,
>> > > > > > > > > Kamal
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Mon, May 11, 2026 at 8:55 AM Manan Gupta <
>> > > > [email protected]>
>> > > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > Hello
>> > > > > > > > > >
>> > > > > > > > > > Gentle reminder to review the KIP.
>> > > > > > > > > >
>> > > > > > > > > > Thanks,
>> > > > > > > > > > Manan
>> > > > > > > > > >
>> > > > > > > > > > On Wed, May 6, 2026 at 7:52 PM Manan Gupta <
>> > > > [email protected]
>> > > > > >
>> > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > Hi all,
>> > > > > > > > > > >
>> > > > > > > > > > > This email starts the discussion thread for *KIP-1335:
>> > > > Bounded
>> > > > > > > > > > > concurrency for partition reassignment via
>> > > > > > > > > kafka-reassign-partitions.sh*.
>> > > > > > > > > > > The proposal adds optional reassignment-batch-size and
>> > > > > > incremental
>> > > > > > > > > > > parameters to kafka-reassign-partitions.sh so
>> operators
>> > can
>> > > > cap
>> > > > > > how
>> > > > > > > > > many
>> > > > > > > > > > > partition reassignments are submitted or kept in
>> flight
>> > at
>> > > > once
>> > > > > > > using
>> > > > > > > > > > > existing Admin API,
>> > > > > > > > > > >
>> > > > > > > > > > > I will appreciate your initial thoughts and feedback
>> on
>> > the
>> > > > > > > proposal.
>> > > > > > > > > > >
>> > > > > > > > > > > https://cwiki.apache.org/confluence/x/8ZAmGQ
>> > > > > > > > > > >
>> > > > > > > > > > > Thanks,
>> > > > > > > > > > > Manan
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: [DISCUSS] KIP-1335: Bounded concurrency for partition reassignment via kafka-reassign-partitions.sh

Reply via email to