Hey Luke Am I able to answer your questions? LMK if you need any additional info.
Regards, Manan Gupta On Tue, Jun 2, 2026 at 11:37 AM Manan Gupta <[email protected]> wrote: > Hey Luke > > Thanks, that is a fair concern when the reassignment tool is embedded in > something that assumes kafka-reassign-partitions.sh returns quickly (for > example a short-lived script or a controller reconcile loop that blocks on > one subprocess). > A few clarifications on what is going on today: > > Where the “long run” lives > The pacing loops run inside the tool process (or an in-process Admin if > someone calls the command entry point from Java). They do not change the > broker contract: each alterPartitionReassignments call is still bounded and > already returns per-partition futures for acceptance of the reassignment. > What is long-running is the optional wait between steps (non-incremental) > or the pipeline driver (incremental), which repeatedly uses the normal read > APIs (listPartitionReassignments, metadata/describe-style reads) until > replicas match the target. That “wait for replication” work cannot be > turned into a single future today; the cluster does not expose > “reassignment fully complete” as one shot per partition on the alter result > itself, so any implementation—tool or operator—must poll or re-check state > unless it exits and delegates that to something else (as with separate > --verify). > > Relationship to the classic execute vs verify split > The non-blocking pattern you describe is already the legacy model: > --execute submits and returns; --verify (or another process) observes > progress. This KIP adds optional blocking in the tool on purpose so > operators who want pacing do not have to hand-chunk JSON and orchestrate > waves themselves. If a deployment must not hold a process open, they can > still use --reassignment-batch-size 0 (legacy one-shot execute + verify), > or external automation that submits smaller JSON files and sleeps between > runs—same traffic shape, more moving parts for the operator. > > > Return futures so the admin client can check > For the submit step, the client already gets futures per partition from > alterPartitionReassignments. For completion, there is no single future to > return that replaces polling; you would either keep polling inside the > client library (same duration, different API shape) or push that > responsibility to the caller. Refactoring the shell tool into a stateful > “resume” CLI or a library API that streams progress events could be useful, > but it is a larger follow-up (new UX, persistence, idempotency) rather than > a small tweak to this KIP. > > Practical guidance for K8s / operators > For controllers that cannot block, the intended pattern is to not wrap the > blocking paced mode in the reconcile path: run reassignment as a Job, a > sidecar, or use Admin directly with your own bounded reconcile loop and > timeouts. Paced mode targets interactive or batch maintenance workflows > where holding one client open is acceptable. > Paced --execute only blocks inside the tool process; broker and Admin RPC > semantics are unchanged. --verify already polls for completion, so a > long-lived client for observation is not new—this KIP adds optional waits > between submits so operators are not forced to hand-chunk JSON. > > If you want a non-blocking paced mode (e.g. “submit only this step and > exit” with a marker file), that would be worth a separate discussion or KIP > so we do not overload this one. > > Regards > Manan Gupta > > On Tue, Jun 2, 2026 at 8:10 AM Luke Chen <[email protected]> wrote: > >> Hi Manan, >> >> LC3: Thanks for updating the KIP to make it clear. >> >> LC4: Thanks for the explanation. >> But that makes me realize that the batch mode (incremental or >> non-incremental) is a long-running admin client process. >> If I remember correctly, in admin client, we try not to make each >> operation a long-running process, so we can see there are operations that >> return futures to the admin client, or like the "--execute" and "--verify" >> example in reassignment operations. >> Making it a long-running operation will block other operations if it's run >> within a script or K8S operator. >> Could we change that? >> For example, we return a list of futures for each partition, and the admin >> client can check the future status to know if the specific partition has >> submitted or not? >> >> Thanks, >> Luke >> >> On Mon, Jun 1, 2026 at 6:18 PM Manan Gupta <[email protected]> wrote: >> >> > Hey Luke >> > >> > LC1: Sure, I have updated the KIP now with the example. >> > >> > > LC3: How does the batch mode know that all N partitions are completed? >> > >> > Batch mode does poll. After each alterPartitionReassignments call for a >> > step, the tool does not infer completion from that RPC alone—the alter >> > returns when the controller has accepted the reassignment, not when >> > replication has fully caught up. >> > Between steps, the tool enters a wait loop: it uses the Admin client to >> > read the cluster’s current reassignment and replica state for the >> > partitions in that step, applies the same completion idea the >> reassignment >> > tool already uses for verification (partition no longer in an active >> > reassignment and the live replica set matches the target in the JSON), >> > sleeps for --reassignment-poll-interval-ms, and repeats until every >> > partition in that step satisfies that condition. Only then does it >> submit >> > the next step. >> > So “wait until complete” is implemented as repeated observation + sleep, >> > not a single blocking call that magically completes when replication >> > finishes. The KIP text has been updated to spell this out so it is not >> > mistaken for a passive wait with no polling. >> > >> > >> > > LC4: What will it show when some partitions are still waiting to be >> > progressed? >> > >> > We can separate two things: stdout from --execute, and --verify >> (separate >> > command). >> > Non-incremental batch (--reassignment-batch-size without --incremental) >> > The tool prints how many batches there will be, then for each step lines >> > such as "starting batch i of n" and "waiting for batch i to complete >> before >> > the next." That matches what we saw in testing, for example: >> > >> > ```Submitting partition reassignments in 6 batches of up to 2 partitions >> > each. >> > Starting reassignment batch 1 of 6 (2 partitions)... >> > Waiting for reassignment batch 1 of 6 to complete before starting the >> next >> > batch. >> > then the same pattern for batch 2, and so on.``` >> > >> > During the “Waiting …” phase there is no per-partition line item for >> “still >> > copying” or for partitions not yet submitted in later batches; those >> > partitions are simply not in flight until their batch starts. If someone >> > needs partition-level status during that time, they can run --verify in >> > another terminal or use cluster metrics; --verify still only >> distinguishes >> > completed vs in progress for partitions that are part of the plan and >> > reflectable in metadata / reassignment state, not “waiting in a future >> > batch” as a distinct label. >> > >> > Incremental (--incremental) >> > After the one-line mode banner, the tool emits a line each time a >> partition >> > finishes and the next is submitted, for example: >> > >> > ```Incremental mode: keeping up to 2 partition reassignments in flight >> > until all have been submitted. >> > Partition test-1-0 finished reassignment; submitting next from queue if >> > any. >> > (and similarly for test-1-1, test-10-1, test-10-0, …)``` >> > So incremental mode already gives clearer liveness than batch-only >> waits: >> > you see completions as they happen, which helps distinguish “working” >> from >> > “stuck” better than the batch wait lines alone. >> > >> > >> > > LC5: indefinite polling >> > Today there is no maximum wait time on the batch-completion loops: the >> tool >> > keeps periodically re-reading cluster state until every partition in the >> > current step satisfies the completion condition, or the operator stops >> the >> > process. If reassignments are slow rather than stuck—which is common >> when >> > strict inter-broker or replica throttles are applied—the wait can >> > legitimately take a long time; that is expected and not by itself a >> sign of >> > a hang. >> > Because there is no built-in deadline yet, operators who need to stop >> > should interrupt the tool and use the supported cancel path (--cancel >> with >> > an appropriate JSON) if they want to back out active reassignments, then >> > reassess throttles, plan size, or pacing. Adding a dedicated >> reassignment >> > wait timeout would be a follow-up: it needs clear semantics (what >> happens >> > on expiry, how that interacts with partial plans and the existing >> --timeout >> > flag used for log directory moves), which is why this KIP does not >> > introduce that knob yet. >> > >> > >> > > LC6: Default poll interval >> > >> > Agreed that a 500 ms default is aggressive from a controller-load >> > perspective for clusters that already list reassignments often. The >> > implementation default has been raised to 1000 ms (1 second) for both >> the >> > inter-step wait path and the incremental loop, and the KIP documents >> that >> > default accordingly. Operators who want less Admin traffic can set >> > --reassignment-poll-interval-ms higher (for example 3–5 seconds); the >> flag >> > exists so that trade-off is explicit and tunable per environment. >> > >> > Regards, >> > Manan Gupta >> > >> > On Mon, Jun 1, 2026 at 1:16 PM Luke Chen <[email protected]> wrote: >> > >> > > Hi Manan, >> > > >> > > LC1: Thanks for the explanation. It's clear to me now. >> > > I think we should also put this example and the "How to choose" part >> in >> > the >> > > KIP. >> > > >> > > Some more questions: >> > > LC3. How does the batch mode know that all N partitions are completed >> and >> > > then start the next batch? >> > > It looks like we don't poll the status when in batch mode. How do we >> know >> > > that? >> > > >> > > LC4. What will it show when some partitions are still waiting to be >> > > progressed? >> > > Currently, the --verify only shows "is completed" or "is still in >> > > progress". >> > > Should we have an output for the partitions that are sitting in the >> batch >> > > queue? >> > > >> > > LC5. As you've pointed out, there could be a possibility that it will >> > poll >> > > indefinitely. >> > > Why can't we set a timer for it? >> > > Any concerns about it? >> > > >> > > LC6. "reassignment-poll-interval-ms" default to 500ms is too >> aggressive. >> > > I think from users' perspective, any interval < 3 seconds or 5 >> seconds is >> > > considered acceptable. >> > > So could we increase it to at least 1 second? >> > > >> > > Thank you, >> > > Luke >> > > >> > > On Mon, Jun 1, 2026 at 3:50 PM Manan Gupta <[email protected]> >> wrote: >> > > >> > > > Hey Luke >> > > > Thank you for reviewing the proposal. >> > > > >> > > > LC1: >> > > > Please excuse me if my explanation of the two different modes was >> > > unclear. >> > > > >> > > > In non-incremental mode the tool walks the plan in steps. Each step >> > > submits >> > > > up to N partition reassignments, then waits until every partition in >> > that >> > > > step has finished before it opens the next step. The slowest >> partition >> > in >> > > > the current step holds up the entire next step. >> > > > >> > > > In incremental mode N is not “how big each step is.” It is how many >> > > > partition reassignments from this plan may be active at the same >> time. >> > > The >> > > > tool keeps refilling up to N: whenever any single partition >> completes, >> > it >> > > > can start the next one from the queue. There is no rule that the >> whole >> > > > group of N must finish together before new work starts. >> > > > >> > > > Example: 10 partitions in sorted order P1 through P10, N equals 3. >> > > > >> > > > Non-incremental: Step one submits P1 P2 P3 and waits until all three >> > are >> > > > done. Step two submits P4 P5 P6 and waits until all three are done. >> > Step >> > > > three submits P7 P8 P9 and waits until all three are done. Step four >> > > > submits P10 only. If P3 is slow, P4 cannot start until P3 finishes, >> > even >> > > if >> > > > P1 and P2 are already done. >> > > > >> > > > Incremental: The tool first submits P1 P2 P3 so three reasginemnts >> are >> > > > active. If P2 finishes first, it can submit P4 while P1 and P3 are >> > still >> > > > running, still keeping three active when possible. It continues that >> > way >> > > > until every partition in the plan has been submitted and the >> in-flight >> > > work >> > > > drains according to the tool semantics. If P3 is slow, P4 can still >> > start >> > > > as soon as some other slot frees up. >> > > > >> > > > How to choose: use non-incremental if you want clear steps and a >> strict >> > > > “this whole batch finished before the next batch starts” story. Use >> > > > incremental if you want steadier utilization when finish times >> differ >> > and >> > > > you do not want one slow partition to block starting unrelated >> > partitions >> > > > beyond the cap of N at once. >> > > > >> > > > LC2: >> > > > Both these values are the same, I have updated the KIP to reflect >> that >> > > now. >> > > > >> > > > Regards >> > > > Manan Gupta >> > > > >> > > > >> > > > On Mon, Jun 1, 2026 at 9:52 AM Luke Chen <[email protected]> wrote: >> > > > >> > > > > Hi Manan, >> > > > > >> > > > > Thanks for the KIP. >> > > > > This is a good improvement. >> > > > > >> > > > > Questions: >> > > > > 1. After reading the KIP, I still don't understand the difference >> > > between >> > > > > "incremental mode" and "non-incremental mode". >> > > > > From what I can see is that they both run with >> > reassignment-batch-size >> > > > once >> > > > > time. >> > > > > What's the difference between them? >> > > > > Could you explain more? >> > > > > Maybe some examples would be helpful to help users know the >> > difference >> > > > and >> > > > > how they choose them. >> > > > > >> > > > > >> > > > > 2. I see there are "INCREMENTAL_REASSIGNMENT_POLL_INTERVAL_MS" and >> > > > > "reassignment-poll-interval-ms". >> > > > > What's the difference between them? >> > > > > >> > > > > >> > > > > Thank you, >> > > > > Luke >> > > > > >> > > > > >> > > > > On Mon, May 25, 2026 at 11:06 PM Manan Gupta < >> [email protected]> >> > > > wrote: >> > > > > >> > > > > > Hey TaiJuWu >> > > > > > >> > > > > > Thank you for reviewhing the KIP, my response is inline. >> > > > > > >> > > > > > > TJ00: If we have multiple batch requests, how do you handle >> > single >> > > > > batch >> > > > > > failure? >> > > > > > - If a submit step fails, the tool returns immediately with >> errors >> > > and >> > > > > does >> > > > > > not enqueue the rest; partitions already submitted stay under >> the >> > > > > > controller’s reassignment as they do today. >> > > > > > - The process exits with a TerseException listing the failed >> > > partitions >> > > > > and >> > > > > > the error message from the broker/controller (the same pattern >> as a >> > > > > > single-shot execute when some alters fail). >> > > > > > >> > > > > > > TJ01: If there is a long time operation, how can the users >> know >> > it >> > > > > still >> > > > > > running instead of hang? >> > > > > > - Controller / cluster side: ongoing reassignments and >> replication >> > > > > > (metrics, kafka-reassign-partitions --list, Admin / JMX). >> > > > > > - verify in another terminal shows progress toward the target. >> > > > > > Batch wait is mostly quiet; incremental is a bit chattier; true >> > > > progress >> > > > > is >> > > > > > best observed from cluster state or --verify, not only from >> stdout >> > > > during >> > > > > > the wait loop. >> > > > > > >> > > > > > Thanks, >> > > > > > Manan Gupta >> > > > > > >> > > > > > On Mon, May 25, 2026 at 6:06 PM TaiJu Wu <[email protected]> >> > wrote: >> > > > > > >> > > > > > > Hi Manan, >> > > > > > > >> > > > > > > Thanks for the KIP, just for some question. >> > > > > > > >> > > > > > > TJ00: If we have multiple batch requests, how do you handle >> > single >> > > > > batch >> > > > > > > failure? >> > > > > > > >> > > > > > > TJ01: If there is a long time operation, how can the users >> know >> > it >> > > > > still >> > > > > > > running instead of hang? >> > > > > > > >> > > > > > > Thanks, >> > > > > > > TaiJuWu >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > Manan Gupta <[email protected]> 於 2026年5月18日週一 下午6:09寫道: >> > > > > > > >> > > > > > > > Hey Kamal >> > > > > > > > >> > > > > > > > Thank you for your comments. >> > > > > > > > >> > > > > > > > > Should we have a configurable list poll interval? >> > > > > > > > The current fixed interval of 500ms should not degrade the >> > > > controller >> > > > > > > but I >> > > > > > > > agree that operators should have an option to change this >> > value, >> > > > > > updated >> > > > > > > > the KIP to also take another parameter >> > > > reassignment-poll-interval-ms >> > > > > to >> > > > > > > > update the default value from 500 ms. >> > > > > > > > >> > > > > > > > > Shall we extend the batching logic to also >> > > kafka-leader-election >> > > > > > > script? >> > > > > > > > Good point, I will pick this up as a separate KIP as a >> followup >> > > to >> > > > > this >> > > > > > > > KIP. >> > > > > > > > >> > > > > > > > Thanks, >> > > > > > > > Manan >> > > > > > > > >> > > > > > > > On Mon, May 18, 2026 at 2:52 PM Kamal Chandraprakash < >> > > > > > > > [email protected]> wrote: >> > > > > > > > >> > > > > > > > > Hi Manan, >> > > > > > > > > >> > > > > > > > > Thanks for improving the user-facing tools! Overall LGTM. >> Few >> > > > > > > questions: >> > > > > > > > > >> > > > > > > > > 1. Should we have a configurable list poll interval? With >> > > 500ms, >> > > > > does >> > > > > > > it >> > > > > > > > > poll the controller often to list the currently running >> > > > > reassignments >> > > > > > > for >> > > > > > > > > large partitions? >> > > > > > > > > 2. Shall we extend the batching logic to also >> > > > kafka-leader-election >> > > > > > > > script? >> > > > > > > > > It will be useful when running with >> --all-topic-partitions. >> > > > > > > > > >> > > > > > > > > Thanks, >> > > > > > > > > Kamal >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > On Mon, May 11, 2026 at 8:55 AM Manan Gupta < >> > > > [email protected]> >> > > > > > > > wrote: >> > > > > > > > > >> > > > > > > > > > Hello >> > > > > > > > > > >> > > > > > > > > > Gentle reminder to review the KIP. >> > > > > > > > > > >> > > > > > > > > > Thanks, >> > > > > > > > > > Manan >> > > > > > > > > > >> > > > > > > > > > On Wed, May 6, 2026 at 7:52 PM Manan Gupta < >> > > > [email protected] >> > > > > > >> > > > > > > > wrote: >> > > > > > > > > > >> > > > > > > > > > > Hi all, >> > > > > > > > > > > >> > > > > > > > > > > This email starts the discussion thread for *KIP-1335: >> > > > Bounded >> > > > > > > > > > > concurrency for partition reassignment via >> > > > > > > > > kafka-reassign-partitions.sh*. >> > > > > > > > > > > The proposal adds optional reassignment-batch-size and >> > > > > > incremental >> > > > > > > > > > > parameters to kafka-reassign-partitions.sh so >> operators >> > can >> > > > cap >> > > > > > how >> > > > > > > > > many >> > > > > > > > > > > partition reassignments are submitted or kept in >> flight >> > at >> > > > once >> > > > > > > using >> > > > > > > > > > > existing Admin API, >> > > > > > > > > > > >> > > > > > > > > > > I will appreciate your initial thoughts and feedback >> on >> > the >> > > > > > > proposal. >> > > > > > > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/x/8ZAmGQ >> > > > > > > > > > > >> > > > > > > > > > > Thanks, >> > > > > > > > > > > Manan >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> >
