Hey Gentle reminder on this. Regards Manan Gupta
On Mon, Jun 8, 2026 at 2:38 PM Manan Gupta <[email protected]> wrote: > Hey Luke > > Am I able to answer your questions? LMK if you need any additional info. > > Regards, > Manan Gupta > > On Tue, Jun 2, 2026 at 11:37 AM Manan Gupta <[email protected]> wrote: > >> Hey Luke >> >> Thanks, that is a fair concern when the reassignment tool is embedded in >> something that assumes kafka-reassign-partitions.sh returns quickly (for >> example a short-lived script or a controller reconcile loop that blocks on >> one subprocess). >> A few clarifications on what is going on today: >> >> Where the “long run” lives >> The pacing loops run inside the tool process (or an in-process Admin if >> someone calls the command entry point from Java). They do not change the >> broker contract: each alterPartitionReassignments call is still bounded and >> already returns per-partition futures for acceptance of the reassignment. >> What is long-running is the optional wait between steps (non-incremental) >> or the pipeline driver (incremental), which repeatedly uses the normal read >> APIs (listPartitionReassignments, metadata/describe-style reads) until >> replicas match the target. That “wait for replication” work cannot be >> turned into a single future today; the cluster does not expose >> “reassignment fully complete” as one shot per partition on the alter result >> itself, so any implementation—tool or operator—must poll or re-check state >> unless it exits and delegates that to something else (as with separate >> --verify). >> >> Relationship to the classic execute vs verify split >> The non-blocking pattern you describe is already the legacy model: >> --execute submits and returns; --verify (or another process) observes >> progress. This KIP adds optional blocking in the tool on purpose so >> operators who want pacing do not have to hand-chunk JSON and orchestrate >> waves themselves. If a deployment must not hold a process open, they can >> still use --reassignment-batch-size 0 (legacy one-shot execute + verify), >> or external automation that submits smaller JSON files and sleeps between >> runs—same traffic shape, more moving parts for the operator. >> >> > Return futures so the admin client can check >> For the submit step, the client already gets futures per partition from >> alterPartitionReassignments. For completion, there is no single future to >> return that replaces polling; you would either keep polling inside the >> client library (same duration, different API shape) or push that >> responsibility to the caller. Refactoring the shell tool into a stateful >> “resume” CLI or a library API that streams progress events could be useful, >> but it is a larger follow-up (new UX, persistence, idempotency) rather than >> a small tweak to this KIP. >> >> Practical guidance for K8s / operators >> For controllers that cannot block, the intended pattern is to not wrap >> the blocking paced mode in the reconcile path: run reassignment as a Job, a >> sidecar, or use Admin directly with your own bounded reconcile loop and >> timeouts. Paced mode targets interactive or batch maintenance workflows >> where holding one client open is acceptable. >> Paced --execute only blocks inside the tool process; broker and Admin RPC >> semantics are unchanged. --verify already polls for completion, so a >> long-lived client for observation is not new—this KIP adds optional waits >> between submits so operators are not forced to hand-chunk JSON. >> >> If you want a non-blocking paced mode (e.g. “submit only this step and >> exit” with a marker file), that would be worth a separate discussion or KIP >> so we do not overload this one. >> >> Regards >> Manan Gupta >> >> On Tue, Jun 2, 2026 at 8:10 AM Luke Chen <[email protected]> wrote: >> >>> Hi Manan, >>> >>> LC3: Thanks for updating the KIP to make it clear. >>> >>> LC4: Thanks for the explanation. >>> But that makes me realize that the batch mode (incremental or >>> non-incremental) is a long-running admin client process. >>> If I remember correctly, in admin client, we try not to make each >>> operation a long-running process, so we can see there are operations that >>> return futures to the admin client, or like the "--execute" and >>> "--verify" >>> example in reassignment operations. >>> Making it a long-running operation will block other operations if it's >>> run >>> within a script or K8S operator. >>> Could we change that? >>> For example, we return a list of futures for each partition, and the >>> admin >>> client can check the future status to know if the specific partition has >>> submitted or not? >>> >>> Thanks, >>> Luke >>> >>> On Mon, Jun 1, 2026 at 6:18 PM Manan Gupta <[email protected]> wrote: >>> >>> > Hey Luke >>> > >>> > LC1: Sure, I have updated the KIP now with the example. >>> > >>> > > LC3: How does the batch mode know that all N partitions are >>> completed? >>> > >>> > Batch mode does poll. After each alterPartitionReassignments call for a >>> > step, the tool does not infer completion from that RPC alone—the alter >>> > returns when the controller has accepted the reassignment, not when >>> > replication has fully caught up. >>> > Between steps, the tool enters a wait loop: it uses the Admin client to >>> > read the cluster’s current reassignment and replica state for the >>> > partitions in that step, applies the same completion idea the >>> reassignment >>> > tool already uses for verification (partition no longer in an active >>> > reassignment and the live replica set matches the target in the JSON), >>> > sleeps for --reassignment-poll-interval-ms, and repeats until every >>> > partition in that step satisfies that condition. Only then does it >>> submit >>> > the next step. >>> > So “wait until complete” is implemented as repeated observation + >>> sleep, >>> > not a single blocking call that magically completes when replication >>> > finishes. The KIP text has been updated to spell this out so it is not >>> > mistaken for a passive wait with no polling. >>> > >>> > >>> > > LC4: What will it show when some partitions are still waiting to be >>> > progressed? >>> > >>> > We can separate two things: stdout from --execute, and --verify >>> (separate >>> > command). >>> > Non-incremental batch (--reassignment-batch-size without --incremental) >>> > The tool prints how many batches there will be, then for each step >>> lines >>> > such as "starting batch i of n" and "waiting for batch i to complete >>> before >>> > the next." That matches what we saw in testing, for example: >>> > >>> > ```Submitting partition reassignments in 6 batches of up to 2 >>> partitions >>> > each. >>> > Starting reassignment batch 1 of 6 (2 partitions)... >>> > Waiting for reassignment batch 1 of 6 to complete before starting the >>> next >>> > batch. >>> > then the same pattern for batch 2, and so on.``` >>> > >>> > During the “Waiting …” phase there is no per-partition line item for >>> “still >>> > copying” or for partitions not yet submitted in later batches; those >>> > partitions are simply not in flight until their batch starts. If >>> someone >>> > needs partition-level status during that time, they can run --verify in >>> > another terminal or use cluster metrics; --verify still only >>> distinguishes >>> > completed vs in progress for partitions that are part of the plan and >>> > reflectable in metadata / reassignment state, not “waiting in a future >>> > batch” as a distinct label. >>> > >>> > Incremental (--incremental) >>> > After the one-line mode banner, the tool emits a line each time a >>> partition >>> > finishes and the next is submitted, for example: >>> > >>> > ```Incremental mode: keeping up to 2 partition reassignments in flight >>> > until all have been submitted. >>> > Partition test-1-0 finished reassignment; submitting next from queue if >>> > any. >>> > (and similarly for test-1-1, test-10-1, test-10-0, …)``` >>> > So incremental mode already gives clearer liveness than batch-only >>> waits: >>> > you see completions as they happen, which helps distinguish “working” >>> from >>> > “stuck” better than the batch wait lines alone. >>> > >>> > >>> > > LC5: indefinite polling >>> > Today there is no maximum wait time on the batch-completion loops: the >>> tool >>> > keeps periodically re-reading cluster state until every partition in >>> the >>> > current step satisfies the completion condition, or the operator stops >>> the >>> > process. If reassignments are slow rather than stuck—which is common >>> when >>> > strict inter-broker or replica throttles are applied—the wait can >>> > legitimately take a long time; that is expected and not by itself a >>> sign of >>> > a hang. >>> > Because there is no built-in deadline yet, operators who need to stop >>> > should interrupt the tool and use the supported cancel path (--cancel >>> with >>> > an appropriate JSON) if they want to back out active reassignments, >>> then >>> > reassess throttles, plan size, or pacing. Adding a dedicated >>> reassignment >>> > wait timeout would be a follow-up: it needs clear semantics (what >>> happens >>> > on expiry, how that interacts with partial plans and the existing >>> --timeout >>> > flag used for log directory moves), which is why this KIP does not >>> > introduce that knob yet. >>> > >>> > >>> > > LC6: Default poll interval >>> > >>> > Agreed that a 500 ms default is aggressive from a controller-load >>> > perspective for clusters that already list reassignments often. The >>> > implementation default has been raised to 1000 ms (1 second) for both >>> the >>> > inter-step wait path and the incremental loop, and the KIP documents >>> that >>> > default accordingly. Operators who want less Admin traffic can set >>> > --reassignment-poll-interval-ms higher (for example 3–5 seconds); the >>> flag >>> > exists so that trade-off is explicit and tunable per environment. >>> > >>> > Regards, >>> > Manan Gupta >>> > >>> > On Mon, Jun 1, 2026 at 1:16 PM Luke Chen <[email protected]> wrote: >>> > >>> > > Hi Manan, >>> > > >>> > > LC1: Thanks for the explanation. It's clear to me now. >>> > > I think we should also put this example and the "How to choose" part >>> in >>> > the >>> > > KIP. >>> > > >>> > > Some more questions: >>> > > LC3. How does the batch mode know that all N partitions are >>> completed and >>> > > then start the next batch? >>> > > It looks like we don't poll the status when in batch mode. How do we >>> know >>> > > that? >>> > > >>> > > LC4. What will it show when some partitions are still waiting to be >>> > > progressed? >>> > > Currently, the --verify only shows "is completed" or "is still in >>> > > progress". >>> > > Should we have an output for the partitions that are sitting in the >>> batch >>> > > queue? >>> > > >>> > > LC5. As you've pointed out, there could be a possibility that it will >>> > poll >>> > > indefinitely. >>> > > Why can't we set a timer for it? >>> > > Any concerns about it? >>> > > >>> > > LC6. "reassignment-poll-interval-ms" default to 500ms is too >>> aggressive. >>> > > I think from users' perspective, any interval < 3 seconds or 5 >>> seconds is >>> > > considered acceptable. >>> > > So could we increase it to at least 1 second? >>> > > >>> > > Thank you, >>> > > Luke >>> > > >>> > > On Mon, Jun 1, 2026 at 3:50 PM Manan Gupta <[email protected]> >>> wrote: >>> > > >>> > > > Hey Luke >>> > > > Thank you for reviewing the proposal. >>> > > > >>> > > > LC1: >>> > > > Please excuse me if my explanation of the two different modes was >>> > > unclear. >>> > > > >>> > > > In non-incremental mode the tool walks the plan in steps. Each step >>> > > submits >>> > > > up to N partition reassignments, then waits until every partition >>> in >>> > that >>> > > > step has finished before it opens the next step. The slowest >>> partition >>> > in >>> > > > the current step holds up the entire next step. >>> > > > >>> > > > In incremental mode N is not “how big each step is.” It is how many >>> > > > partition reassignments from this plan may be active at the same >>> time. >>> > > The >>> > > > tool keeps refilling up to N: whenever any single partition >>> completes, >>> > it >>> > > > can start the next one from the queue. There is no rule that the >>> whole >>> > > > group of N must finish together before new work starts. >>> > > > >>> > > > Example: 10 partitions in sorted order P1 through P10, N equals 3. >>> > > > >>> > > > Non-incremental: Step one submits P1 P2 P3 and waits until all >>> three >>> > are >>> > > > done. Step two submits P4 P5 P6 and waits until all three are done. >>> > Step >>> > > > three submits P7 P8 P9 and waits until all three are done. Step >>> four >>> > > > submits P10 only. If P3 is slow, P4 cannot start until P3 finishes, >>> > even >>> > > if >>> > > > P1 and P2 are already done. >>> > > > >>> > > > Incremental: The tool first submits P1 P2 P3 so three reasginemnts >>> are >>> > > > active. If P2 finishes first, it can submit P4 while P1 and P3 are >>> > still >>> > > > running, still keeping three active when possible. It continues >>> that >>> > way >>> > > > until every partition in the plan has been submitted and the >>> in-flight >>> > > work >>> > > > drains according to the tool semantics. If P3 is slow, P4 can still >>> > start >>> > > > as soon as some other slot frees up. >>> > > > >>> > > > How to choose: use non-incremental if you want clear steps and a >>> strict >>> > > > “this whole batch finished before the next batch starts” story. Use >>> > > > incremental if you want steadier utilization when finish times >>> differ >>> > and >>> > > > you do not want one slow partition to block starting unrelated >>> > partitions >>> > > > beyond the cap of N at once. >>> > > > >>> > > > LC2: >>> > > > Both these values are the same, I have updated the KIP to reflect >>> that >>> > > now. >>> > > > >>> > > > Regards >>> > > > Manan Gupta >>> > > > >>> > > > >>> > > > On Mon, Jun 1, 2026 at 9:52 AM Luke Chen <[email protected]> >>> wrote: >>> > > > >>> > > > > Hi Manan, >>> > > > > >>> > > > > Thanks for the KIP. >>> > > > > This is a good improvement. >>> > > > > >>> > > > > Questions: >>> > > > > 1. After reading the KIP, I still don't understand the difference >>> > > between >>> > > > > "incremental mode" and "non-incremental mode". >>> > > > > From what I can see is that they both run with >>> > reassignment-batch-size >>> > > > once >>> > > > > time. >>> > > > > What's the difference between them? >>> > > > > Could you explain more? >>> > > > > Maybe some examples would be helpful to help users know the >>> > difference >>> > > > and >>> > > > > how they choose them. >>> > > > > >>> > > > > >>> > > > > 2. I see there are "INCREMENTAL_REASSIGNMENT_POLL_INTERVAL_MS" >>> and >>> > > > > "reassignment-poll-interval-ms". >>> > > > > What's the difference between them? >>> > > > > >>> > > > > >>> > > > > Thank you, >>> > > > > Luke >>> > > > > >>> > > > > >>> > > > > On Mon, May 25, 2026 at 11:06 PM Manan Gupta < >>> [email protected]> >>> > > > wrote: >>> > > > > >>> > > > > > Hey TaiJuWu >>> > > > > > >>> > > > > > Thank you for reviewhing the KIP, my response is inline. >>> > > > > > >>> > > > > > > TJ00: If we have multiple batch requests, how do you handle >>> > single >>> > > > > batch >>> > > > > > failure? >>> > > > > > - If a submit step fails, the tool returns immediately with >>> errors >>> > > and >>> > > > > does >>> > > > > > not enqueue the rest; partitions already submitted stay under >>> the >>> > > > > > controller’s reassignment as they do today. >>> > > > > > - The process exits with a TerseException listing the failed >>> > > partitions >>> > > > > and >>> > > > > > the error message from the broker/controller (the same pattern >>> as a >>> > > > > > single-shot execute when some alters fail). >>> > > > > > >>> > > > > > > TJ01: If there is a long time operation, how can the users >>> know >>> > it >>> > > > > still >>> > > > > > running instead of hang? >>> > > > > > - Controller / cluster side: ongoing reassignments and >>> replication >>> > > > > > (metrics, kafka-reassign-partitions --list, Admin / JMX). >>> > > > > > - verify in another terminal shows progress toward the target. >>> > > > > > Batch wait is mostly quiet; incremental is a bit chattier; true >>> > > > progress >>> > > > > is >>> > > > > > best observed from cluster state or --verify, not only from >>> stdout >>> > > > during >>> > > > > > the wait loop. >>> > > > > > >>> > > > > > Thanks, >>> > > > > > Manan Gupta >>> > > > > > >>> > > > > > On Mon, May 25, 2026 at 6:06 PM TaiJu Wu <[email protected]> >>> > wrote: >>> > > > > > >>> > > > > > > Hi Manan, >>> > > > > > > >>> > > > > > > Thanks for the KIP, just for some question. >>> > > > > > > >>> > > > > > > TJ00: If we have multiple batch requests, how do you handle >>> > single >>> > > > > batch >>> > > > > > > failure? >>> > > > > > > >>> > > > > > > TJ01: If there is a long time operation, how can the users >>> know >>> > it >>> > > > > still >>> > > > > > > running instead of hang? >>> > > > > > > >>> > > > > > > Thanks, >>> > > > > > > TaiJuWu >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > Manan Gupta <[email protected]> 於 2026年5月18日週一 下午6:09寫道: >>> > > > > > > >>> > > > > > > > Hey Kamal >>> > > > > > > > >>> > > > > > > > Thank you for your comments. >>> > > > > > > > >>> > > > > > > > > Should we have a configurable list poll interval? >>> > > > > > > > The current fixed interval of 500ms should not degrade the >>> > > > controller >>> > > > > > > but I >>> > > > > > > > agree that operators should have an option to change this >>> > value, >>> > > > > > updated >>> > > > > > > > the KIP to also take another parameter >>> > > > reassignment-poll-interval-ms >>> > > > > to >>> > > > > > > > update the default value from 500 ms. >>> > > > > > > > >>> > > > > > > > > Shall we extend the batching logic to also >>> > > kafka-leader-election >>> > > > > > > script? >>> > > > > > > > Good point, I will pick this up as a separate KIP as a >>> followup >>> > > to >>> > > > > this >>> > > > > > > > KIP. >>> > > > > > > > >>> > > > > > > > Thanks, >>> > > > > > > > Manan >>> > > > > > > > >>> > > > > > > > On Mon, May 18, 2026 at 2:52 PM Kamal Chandraprakash < >>> > > > > > > > [email protected]> wrote: >>> > > > > > > > >>> > > > > > > > > Hi Manan, >>> > > > > > > > > >>> > > > > > > > > Thanks for improving the user-facing tools! Overall >>> LGTM. Few >>> > > > > > > questions: >>> > > > > > > > > >>> > > > > > > > > 1. Should we have a configurable list poll interval? With >>> > > 500ms, >>> > > > > does >>> > > > > > > it >>> > > > > > > > > poll the controller often to list the currently running >>> > > > > reassignments >>> > > > > > > for >>> > > > > > > > > large partitions? >>> > > > > > > > > 2. Shall we extend the batching logic to also >>> > > > kafka-leader-election >>> > > > > > > > script? >>> > > > > > > > > It will be useful when running with >>> --all-topic-partitions. >>> > > > > > > > > >>> > > > > > > > > Thanks, >>> > > > > > > > > Kamal >>> > > > > > > > > >>> > > > > > > > > >>> > > > > > > > > On Mon, May 11, 2026 at 8:55 AM Manan Gupta < >>> > > > [email protected]> >>> > > > > > > > wrote: >>> > > > > > > > > >>> > > > > > > > > > Hello >>> > > > > > > > > > >>> > > > > > > > > > Gentle reminder to review the KIP. >>> > > > > > > > > > >>> > > > > > > > > > Thanks, >>> > > > > > > > > > Manan >>> > > > > > > > > > >>> > > > > > > > > > On Wed, May 6, 2026 at 7:52 PM Manan Gupta < >>> > > > [email protected] >>> > > > > > >>> > > > > > > > wrote: >>> > > > > > > > > > >>> > > > > > > > > > > Hi all, >>> > > > > > > > > > > >>> > > > > > > > > > > This email starts the discussion thread for >>> *KIP-1335: >>> > > > Bounded >>> > > > > > > > > > > concurrency for partition reassignment via >>> > > > > > > > > kafka-reassign-partitions.sh*. >>> > > > > > > > > > > The proposal adds optional reassignment-batch-size >>> and >>> > > > > > incremental >>> > > > > > > > > > > parameters to kafka-reassign-partitions.sh so >>> operators >>> > can >>> > > > cap >>> > > > > > how >>> > > > > > > > > many >>> > > > > > > > > > > partition reassignments are submitted or kept in >>> flight >>> > at >>> > > > once >>> > > > > > > using >>> > > > > > > > > > > existing Admin API, >>> > > > > > > > > > > >>> > > > > > > > > > > I will appreciate your initial thoughts and feedback >>> on >>> > the >>> > > > > > > proposal. >>> > > > > > > > > > > >>> > > > > > > > > > > https://cwiki.apache.org/confluence/x/8ZAmGQ >>> > > > > > > > > > > >>> > > > > > > > > > > Thanks, >>> > > > > > > > > > > Manan >>> > > > > > > > > > > >>> > > > > > > > > > >>> > > > > > > > > >>> > > > > > > > >>> > > > > > > >>> > > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >>
