Re: [DISCUSS] KIP-1335: Bounded concurrency for partition reassignment via kafka-reassign-partitions.sh

Manan Gupta Fri, 26 Jun 2026 01:52:09 -0700

Hey
Gentle reminder on this.

Regards
Manan Gupta


On Mon, Jun 8, 2026 at 2:38 PM Manan Gupta <[email protected]> wrote:

> Hey Luke
>
> Am I able to answer your questions? LMK if you need any additional info.
>
> Regards,
> Manan Gupta
>
> On Tue, Jun 2, 2026 at 11:37 AM Manan Gupta <[email protected]> wrote:
>
>> Hey Luke
>>
>> Thanks, that is a fair concern when the reassignment tool is embedded in
>> something that assumes kafka-reassign-partitions.sh returns quickly (for
>> example a short-lived script or a controller reconcile loop that blocks on
>> one subprocess).
>> A few clarifications on what is going on today:
>>
>> Where the “long run” lives
>> The pacing loops run inside the tool process (or an in-process Admin if
>> someone calls the command entry point from Java). They do not change the
>> broker contract: each alterPartitionReassignments call is still bounded and
>> already returns per-partition futures for acceptance of the reassignment.
>> What is long-running is the optional wait between steps (non-incremental)
>> or the pipeline driver (incremental), which repeatedly uses the normal read
>> APIs (listPartitionReassignments, metadata/describe-style reads) until
>> replicas match the target. That “wait for replication” work cannot be
>> turned into a single future today; the cluster does not expose
>> “reassignment fully complete” as one shot per partition on the alter result
>> itself, so any implementation—tool or operator—must poll or re-check state
>> unless it exits and delegates that to something else (as with separate
>> --verify).
>>
>> Relationship to the classic execute vs verify split
>> The non-blocking pattern you describe is already the legacy model:
>> --execute submits and returns; --verify (or another process) observes
>> progress. This KIP adds optional blocking in the tool on purpose so
>> operators who want pacing do not have to hand-chunk JSON and orchestrate
>> waves themselves. If a deployment must not hold a process open, they can
>> still use --reassignment-batch-size 0 (legacy one-shot execute + verify),
>> or external automation that submits smaller JSON files and sleeps between
>> runs—same traffic shape, more moving parts for the operator.
>>
>> > Return futures so the admin client can check
>> For the submit step, the client already gets futures per partition from
>> alterPartitionReassignments. For completion, there is no single future to
>> return that replaces polling; you would either keep polling inside the
>> client library (same duration, different API shape) or push that
>> responsibility to the caller. Refactoring the shell tool into a stateful
>> “resume” CLI or a library API that streams progress events could be useful,
>> but it is a larger follow-up (new UX, persistence, idempotency) rather than
>> a small tweak to this KIP.
>>
>> Practical guidance for K8s / operators
>> For controllers that cannot block, the intended pattern is to not wrap
>> the blocking paced mode in the reconcile path: run reassignment as a Job, a
>> sidecar, or use Admin directly with your own bounded reconcile loop and
>> timeouts. Paced mode targets interactive or batch maintenance workflows
>> where holding one client open is acceptable.
>> Paced --execute only blocks inside the tool process; broker and Admin RPC
>> semantics are unchanged. --verify already polls for completion, so a
>> long-lived client for observation is not new—this KIP adds optional waits
>> between submits so operators are not forced to hand-chunk JSON.
>>
>> If you want a non-blocking paced mode (e.g. “submit only this step and
>> exit” with a marker file), that would be worth a separate discussion or KIP
>> so we do not overload this one.
>>
>> Regards
>> Manan Gupta
>>
>> On Tue, Jun 2, 2026 at 8:10 AM Luke Chen <[email protected]> wrote:
>>
>>> Hi Manan,
>>>
>>> LC3: Thanks for updating the KIP to make it clear.
>>>
>>> LC4: Thanks for the explanation.
>>> But that makes me realize that the batch mode (incremental or
>>> non-incremental) is a long-running admin client process.
>>> If I remember correctly, in admin client, we try not to make each
>>> operation a long-running process, so we can see there are operations that
>>> return futures to the admin client, or like the "--execute" and
>>> "--verify"
>>> example in reassignment operations.
>>> Making it a long-running operation will block other operations if it's
>>> run
>>> within a script or K8S operator.
>>> Could we change that?
>>> For example, we return a list of futures for each partition, and the
>>> admin
>>> client can check the future status to know if the specific partition has
>>> submitted or not?
>>>
>>> Thanks,
>>> Luke
>>>
>>> On Mon, Jun 1, 2026 at 6:18 PM Manan Gupta <[email protected]> wrote:
>>>
>>> > Hey Luke
>>> >
>>> > LC1: Sure, I have updated the KIP now with the example.
>>> >
>>> > > LC3: How does the batch mode know that all N partitions are
>>> completed?
>>> >
>>> > Batch mode does poll. After each alterPartitionReassignments call for a
>>> > step, the tool does not infer completion from that RPC alone—the alter
>>> > returns when the controller has accepted the reassignment, not when
>>> > replication has fully caught up.
>>> > Between steps, the tool enters a wait loop: it uses the Admin client to
>>> > read the cluster’s current reassignment and replica state for the
>>> > partitions in that step, applies the same completion idea the
>>> reassignment
>>> > tool already uses for verification (partition no longer in an active
>>> > reassignment and the live replica set matches the target in the JSON),
>>> > sleeps for --reassignment-poll-interval-ms, and repeats until every
>>> > partition in that step satisfies that condition. Only then does it
>>> submit
>>> > the next step.
>>> > So “wait until complete” is implemented as repeated observation +
>>> sleep,
>>> > not a single blocking call that magically completes when replication
>>> > finishes. The KIP text has been updated to spell this out so it is not
>>> > mistaken for a passive wait with no polling.
>>> >
>>> >
>>> > > LC4: What will it show when some partitions are still waiting to be
>>> > progressed?
>>> >
>>> > We can separate two things: stdout from --execute, and --verify
>>> (separate
>>> > command).
>>> > Non-incremental batch (--reassignment-batch-size without --incremental)
>>> > The tool prints how many batches there will be, then for each step
>>> lines
>>> > such as "starting batch i of n" and "waiting for batch i to complete
>>> before
>>> > the next." That matches what we saw in testing, for example:
>>> >
>>> > ```Submitting partition reassignments in 6 batches of up to 2
>>> partitions
>>> > each.
>>> > Starting reassignment batch 1 of 6 (2 partitions)...
>>> > Waiting for reassignment batch 1 of 6 to complete before starting the
>>> next
>>> > batch.
>>> > then the same pattern for batch 2, and so on.```
>>> >
>>> > During the “Waiting …” phase there is no per-partition line item for
>>> “still
>>> > copying” or for partitions not yet submitted in later batches; those
>>> > partitions are simply not in flight until their batch starts. If
>>> someone
>>> > needs partition-level status during that time, they can run --verify in
>>> > another terminal or use cluster metrics; --verify still only
>>> distinguishes
>>> > completed vs in progress for partitions that are part of the plan and
>>> > reflectable in metadata / reassignment state, not “waiting in a future
>>> > batch” as a distinct label.
>>> >
>>> > Incremental (--incremental)
>>> > After the one-line mode banner, the tool emits a line each time a
>>> partition
>>> > finishes and the next is submitted, for example:
>>> >
>>> > ```Incremental mode: keeping up to 2 partition reassignments in flight
>>> > until all have been submitted.
>>> > Partition test-1-0 finished reassignment; submitting next from queue if
>>> > any.
>>> > (and similarly for test-1-1, test-10-1, test-10-0, …)```
>>> > So incremental mode already gives clearer liveness than batch-only
>>> waits:
>>> > you see completions as they happen, which helps distinguish “working”
>>> from
>>> > “stuck” better than the batch wait lines alone.
>>> >
>>> >
>>> > > LC5: indefinite polling
>>> > Today there is no maximum wait time on the batch-completion loops: the
>>> tool
>>> > keeps periodically re-reading cluster state until every partition in
>>> the
>>> > current step satisfies the completion condition, or the operator stops
>>> the
>>> > process. If reassignments are slow rather than stuck—which is common
>>> when
>>> > strict inter-broker or replica throttles are applied—the wait can
>>> > legitimately take a long time; that is expected and not by itself a
>>> sign of
>>> > a hang.
>>> > Because there is no built-in deadline yet, operators who need to stop
>>> > should interrupt the tool and use the supported cancel path (--cancel
>>> with
>>> > an appropriate JSON) if they want to back out active reassignments,
>>> then
>>> > reassess throttles, plan size, or pacing. Adding a dedicated
>>> reassignment
>>> > wait timeout would be a follow-up: it needs clear semantics (what
>>> happens
>>> > on expiry, how that interacts with partial plans and the existing
>>> --timeout
>>> > flag used for log directory moves), which is why this KIP does not
>>> > introduce that knob yet.
>>> >
>>> >
>>> > > LC6: Default poll interval
>>> >
>>> > Agreed that a 500 ms default is aggressive from a controller-load
>>> > perspective for clusters that already list reassignments often. The
>>> > implementation default has been raised to 1000 ms (1 second) for both
>>> the
>>> > inter-step wait path and the incremental loop, and the KIP documents
>>> that
>>> > default accordingly. Operators who want less Admin traffic can set
>>> > --reassignment-poll-interval-ms higher (for example 3–5 seconds); the
>>> flag
>>> > exists so that trade-off is explicit and tunable per environment.
>>> >
>>> > Regards,
>>> > Manan Gupta
>>> >
>>> > On Mon, Jun 1, 2026 at 1:16 PM Luke Chen <[email protected]> wrote:
>>> >
>>> > > Hi Manan,
>>> > >
>>> > > LC1: Thanks for the explanation. It's clear to me now.
>>> > > I think we should also put this example and the "How to choose" part
>>> in
>>> > the
>>> > > KIP.
>>> > >
>>> > > Some more questions:
>>> > > LC3. How does the batch mode know that all N partitions are
>>> completed and
>>> > > then start the next batch?
>>> > > It looks like we don't poll the status when in batch mode. How do we
>>> know
>>> > > that?
>>> > >
>>> > > LC4. What will it show when some partitions are still waiting to be
>>> > > progressed?
>>> > > Currently, the --verify only shows "is completed" or "is still in
>>> > > progress".
>>> > > Should we have an output for the partitions that are sitting in the
>>> batch
>>> > > queue?
>>> > >
>>> > > LC5. As you've pointed out, there could be a possibility that it will
>>> > poll
>>> > > indefinitely.
>>> > > Why can't we set a timer for it?
>>> > > Any concerns about it?
>>> > >
>>> > > LC6. "reassignment-poll-interval-ms" default to 500ms is too
>>> aggressive.
>>> > > I think from users' perspective, any interval < 3 seconds or 5
>>> seconds is
>>> > > considered acceptable.
>>> > > So could we increase it to at least 1 second?
>>> > >
>>> > > Thank you,
>>> > > Luke
>>> > >
>>> > > On Mon, Jun 1, 2026 at 3:50 PM Manan Gupta <[email protected]>
>>> wrote:
>>> > >
>>> > > > Hey Luke
>>> > > > Thank you for reviewing the proposal.
>>> > > >
>>> > > > LC1:
>>> > > > Please excuse me if my explanation of the two different modes was
>>> > > unclear.
>>> > > >
>>> > > > In non-incremental mode the tool walks the plan in steps. Each step
>>> > > submits
>>> > > > up to N partition reassignments, then waits until every partition
>>> in
>>> > that
>>> > > > step has finished before it opens the next step. The slowest
>>> partition
>>> > in
>>> > > > the current step holds up the entire next step.
>>> > > >
>>> > > > In incremental mode N is not “how big each step is.” It is how many
>>> > > > partition reassignments from this plan may be active at the same
>>> time.
>>> > > The
>>> > > > tool keeps refilling up to N: whenever any single partition
>>> completes,
>>> > it
>>> > > > can start the next one from the queue. There is no rule that the
>>> whole
>>> > > > group of N must finish together before new work starts.
>>> > > >
>>> > > > Example: 10 partitions in sorted order P1 through P10, N equals 3.
>>> > > >
>>> > > > Non-incremental: Step one submits P1 P2 P3 and waits until all
>>> three
>>> > are
>>> > > > done. Step two submits P4 P5 P6 and waits until all three are done.
>>> > Step
>>> > > > three submits P7 P8 P9 and waits until all three are done. Step
>>> four
>>> > > > submits P10 only. If P3 is slow, P4 cannot start until P3 finishes,
>>> > even
>>> > > if
>>> > > > P1 and P2 are already done.
>>> > > >
>>> > > > Incremental: The tool first submits P1 P2 P3 so three reasginemnts
>>> are
>>> > > > active. If P2 finishes first, it can submit P4 while P1 and P3 are
>>> > still
>>> > > > running, still keeping three active when possible. It continues
>>> that
>>> > way
>>> > > > until every partition in the plan has been submitted and the
>>> in-flight
>>> > > work
>>> > > > drains according to the tool semantics. If P3 is slow, P4 can still
>>> > start
>>> > > > as soon as some other slot frees up.
>>> > > >
>>> > > > How to choose: use non-incremental if you want clear steps and a
>>> strict
>>> > > > “this whole batch finished before the next batch starts” story. Use
>>> > > > incremental if you want steadier utilization when finish times
>>> differ
>>> > and
>>> > > > you do not want one slow partition to block starting unrelated
>>> > partitions
>>> > > > beyond the cap of N at once.
>>> > > >
>>> > > > LC2:
>>> > > > Both these values are the same, I have updated the KIP to reflect
>>> that
>>> > > now.
>>> > > >
>>> > > > Regards
>>> > > > Manan Gupta
>>> > > >
>>> > > >
>>> > > > On Mon, Jun 1, 2026 at 9:52 AM Luke Chen <[email protected]>
>>> wrote:
>>> > > >
>>> > > > > Hi Manan,
>>> > > > >
>>> > > > > Thanks for the KIP.
>>> > > > > This is a good improvement.
>>> > > > >
>>> > > > > Questions:
>>> > > > > 1. After reading the KIP, I still don't understand the difference
>>> > > between
>>> > > > > "incremental mode" and "non-incremental mode".
>>> > > > > From what I can see is that they both run with
>>> > reassignment-batch-size
>>> > > > once
>>> > > > > time.
>>> > > > > What's the difference between them?
>>> > > > > Could you explain more?
>>> > > > > Maybe some examples would be helpful to help users know the
>>> > difference
>>> > > > and
>>> > > > > how they choose them.
>>> > > > >
>>> > > > >
>>> > > > > 2. I see there are "INCREMENTAL_REASSIGNMENT_POLL_INTERVAL_MS"
>>> and
>>> > > > > "reassignment-poll-interval-ms".
>>> > > > > What's the difference between them?
>>> > > > >
>>> > > > >
>>> > > > > Thank you,
>>> > > > > Luke
>>> > > > >
>>> > > > >
>>> > > > > On Mon, May 25, 2026 at 11:06 PM Manan Gupta <
>>> [email protected]>
>>> > > > wrote:
>>> > > > >
>>> > > > > > Hey TaiJuWu
>>> > > > > >
>>> > > > > > Thank you for reviewhing the KIP, my response is inline.
>>> > > > > >
>>> > > > > > > TJ00: If we have multiple batch requests, how do you handle
>>> > single
>>> > > > > batch
>>> > > > > > failure?
>>> > > > > > - If a submit step fails, the tool returns immediately with
>>> errors
>>> > > and
>>> > > > > does
>>> > > > > > not enqueue the rest; partitions already submitted stay under
>>> the
>>> > > > > > controller’s reassignment as they do today.
>>> > > > > > - The process exits with a TerseException listing the failed
>>> > > partitions
>>> > > > > and
>>> > > > > > the error message from the broker/controller (the same pattern
>>> as a
>>> > > > > > single-shot execute when some alters fail).
>>> > > > > >
>>> > > > > > > TJ01: If there is a long time operation, how can the users
>>> know
>>> > it
>>> > > > > still
>>> > > > > > running instead of hang?
>>> > > > > > - Controller / cluster side: ongoing reassignments and
>>> replication
>>> > > > > > (metrics, kafka-reassign-partitions --list, Admin / JMX).
>>> > > > > > - verify in another terminal shows progress toward the target.
>>> > > > > > Batch wait is mostly quiet; incremental is a bit chattier; true
>>> > > > progress
>>> > > > > is
>>> > > > > > best observed from cluster state or --verify, not only from
>>> stdout
>>> > > > during
>>> > > > > > the wait loop.
>>> > > > > >
>>> > > > > > Thanks,
>>> > > > > > Manan Gupta
>>> > > > > >
>>> > > > > > On Mon, May 25, 2026 at 6:06 PM TaiJu Wu <[email protected]>
>>> > wrote:
>>> > > > > >
>>> > > > > > > Hi Manan,
>>> > > > > > >
>>> > > > > > > Thanks for the KIP, just for some question.
>>> > > > > > >
>>> > > > > > > TJ00: If we have multiple batch requests, how do you handle
>>> > single
>>> > > > > batch
>>> > > > > > > failure?
>>> > > > > > >
>>> > > > > > > TJ01: If there is a long time operation, how can the users
>>> know
>>> > it
>>> > > > > still
>>> > > > > > > running instead of hang?
>>> > > > > > >
>>> > > > > > > Thanks,
>>> > > > > > > TaiJuWu
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > Manan Gupta <[email protected]> 於 2026年5月18日週一 下午6:09寫道：
>>> > > > > > >
>>> > > > > > > > Hey Kamal
>>> > > > > > > >
>>> > > > > > > > Thank you for your comments.
>>> > > > > > > >
>>> > > > > > > > > Should we have a configurable list poll interval?
>>> > > > > > > > The current fixed interval of 500ms should not degrade the
>>> > > > controller
>>> > > > > > > but I
>>> > > > > > > > agree that operators should have an option to change this
>>> > value,
>>> > > > > > updated
>>> > > > > > > > the KIP to also take another parameter
>>> > > > reassignment-poll-interval-ms
>>> > > > > to
>>> > > > > > > > update the default value from 500 ms.
>>> > > > > > > >
>>> > > > > > > > > Shall we extend the batching logic to also
>>> > > kafka-leader-election
>>> > > > > > > script?
>>> > > > > > > > Good point, I will pick this up as a separate KIP as a
>>> followup
>>> > > to
>>> > > > > this
>>> > > > > > > > KIP.
>>> > > > > > > >
>>> > > > > > > > Thanks,
>>> > > > > > > > Manan
>>> > > > > > > >
>>> > > > > > > > On Mon, May 18, 2026 at 2:52 PM Kamal Chandraprakash <
>>> > > > > > > > [email protected]> wrote:
>>> > > > > > > >
>>> > > > > > > > > Hi Manan,
>>> > > > > > > > >
>>> > > > > > > > > Thanks for improving the user-facing tools! Overall
>>> LGTM. Few
>>> > > > > > > questions:
>>> > > > > > > > >
>>> > > > > > > > > 1. Should we have a configurable list poll interval? With
>>> > > 500ms,
>>> > > > > does
>>> > > > > > > it
>>> > > > > > > > > poll the controller often to list the currently running
>>> > > > > reassignments
>>> > > > > > > for
>>> > > > > > > > > large partitions?
>>> > > > > > > > > 2. Shall we extend the batching logic to also
>>> > > > kafka-leader-election
>>> > > > > > > > script?
>>> > > > > > > > > It will be useful when running with
>>> --all-topic-partitions.
>>> > > > > > > > >
>>> > > > > > > > > Thanks,
>>> > > > > > > > > Kamal
>>> > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > > > On Mon, May 11, 2026 at 8:55 AM Manan Gupta <
>>> > > > [email protected]>
>>> > > > > > > > wrote:
>>> > > > > > > > >
>>> > > > > > > > > > Hello
>>> > > > > > > > > >
>>> > > > > > > > > > Gentle reminder to review the KIP.
>>> > > > > > > > > >
>>> > > > > > > > > > Thanks,
>>> > > > > > > > > > Manan
>>> > > > > > > > > >
>>> > > > > > > > > > On Wed, May 6, 2026 at 7:52 PM Manan Gupta <
>>> > > > [email protected]
>>> > > > > >
>>> > > > > > > > wrote:
>>> > > > > > > > > >
>>> > > > > > > > > > > Hi all,
>>> > > > > > > > > > >
>>> > > > > > > > > > > This email starts the discussion thread for
>>> *KIP-1335:
>>> > > > Bounded
>>> > > > > > > > > > > concurrency for partition reassignment via
>>> > > > > > > > > kafka-reassign-partitions.sh*.
>>> > > > > > > > > > > The proposal adds optional reassignment-batch-size
>>> and
>>> > > > > > incremental
>>> > > > > > > > > > > parameters to kafka-reassign-partitions.sh so
>>> operators
>>> > can
>>> > > > cap
>>> > > > > > how
>>> > > > > > > > > many
>>> > > > > > > > > > > partition reassignments are submitted or kept in
>>> flight
>>> > at
>>> > > > once
>>> > > > > > > using
>>> > > > > > > > > > > existing Admin API,
>>> > > > > > > > > > >
>>> > > > > > > > > > > I will appreciate your initial thoughts and feedback
>>> on
>>> > the
>>> > > > > > > proposal.
>>> > > > > > > > > > >
>>> > > > > > > > > > > https://cwiki.apache.org/confluence/x/8ZAmGQ
>>> > > > > > > > > > >
>>> > > > > > > > > > > Thanks,
>>> > > > > > > > > > > Manan
>>> > > > > > > > > > >
>>> > > > > > > > > >
>>> > > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>

Re: [DISCUSS] KIP-1335: Bounded concurrency for partition reassignment via kafka-reassign-partitions.sh

Reply via email to