Re: [DISCUSS] KIP-1332: Dynamic memory allocation for the Kafka producer

Lianet Magrans Fri, 12 Jun 2026 12:16:57 -0700

Thanks for the feedback Jun,

JR5: agreed, update the names (turns out we were using the "incremental"
word a lot already)


JR6: Good point. Makes sense to simply reuse the same ratio-aware
estimation that trunk already uses to determine if a batch "hasRoom" but
including the new record. I updated the KIP with the details, it's pretty
consistent with the current logic, let me know your thoughts.

Thanks!


On Thu, Jun 11, 2026 at 12:13 PM Jun Rao via dev <[email protected]>
wrote:

> Hi, Lianet,
>
> Thanks for the updated KIP. A couple of more minor comments.
>
> JR5. "buffer.memory.allocation.strategy" type String ->  “static”,
> “dynamic”
> static vs dynamic doesn't quite capture the essence of the strategy. How
> about sth like full vs incremental?
>
> JR6. "The number of chunks needed for a record is calculated based on the
> record's uncompressed upper-bound size."
> Currently, the allocated size is based on the record's estimated size using
> the compression ratio. Do we plan to change that and if so, what's the
> motivation behind it?
>
> Jun
>
> On Wed, May 27, 2026 at 12:48 PM Lianet Magrans <[email protected]>
> wrote:
>
> > Thanks for the review Jun! Fixed.
> >
> > Cheers,
> > Lianet
> >
> > On Wed, May 27, 2026 at 1:40 PM Jun Rao via dev <[email protected]>
> > wrote:
> >
> > > Hi, Lianet,
> > >
> > > Thanks for the updated KIP. Just a minor comment. "but the key
> different"
> > > should be "but the key difference".
> > >
> > > Jun
> > >
> > > On Fri, May 22, 2026 at 5:22 AM Lianet Magrans <[email protected]>
> > wrote:
> > >
> > > > Hi Jun, thanks for the feedback!  (sorry for the delay,
> > > Current/travelling)
> > > >
> > > > JR1: Agreed, I updated the example (aligned with the same mixed
> > workload
> > > > case mentioned at the beginning of the motivation)
> > > >
> > > > JR3: The "scale down" was referring to the reservation only (memory
> > held
> > > by
> > > > open batches, less during low-traffic vs high traffic period).
> > Clarified
> > > in
> > > > the KIP to make clear that it's not about pool memory scaling down,
> > just
> > > > about reservation in open batches (pool memory free for other
> > partitions
> > > if
> > > > needed).
> > > >
> > > > JR4: Yes, it was confusing indeed. The intention was just to refer to
> > the
> > > > producer thread marking the batch for closing (not the actual close).
> > > This
> > > > will all be the same as today when the batch fills up, as you
> described
> > > > (producer just "marks for close", sender does the actual close and
> > frees
> > > > memory up). I clarified it all in the KIP to be accurate.
> > > >
> > > > Thanks!
> > > > Lianet
> > > >
> > > >
> > > > On Fri, May 15, 2026 at 9:39 PM Jun Rao via dev <
> [email protected]>
> > > > wrote:
> > > >
> > > > > Hi, Lianet,
> > > > >
> > > > > Thanks for the reply.
> > > > >
> > > > > JR1. "Memory usage: under the current static strategy, a producer
> > > writing
> > > > > 10 MiB/s of aggregate throughput to a 1000-partition topic with
> > > > > RoundRobinPartitioner struggles to achieve a meaningful fraction of
> > > that
> > > > at
> > > > > the default 16384 bytes "batch.size". Each partition only sends
> 16384
> > > > bytes
> > > > > at a time over a high-latency link, so per-partition throughput is
> > > > bounded
> > > > > by "16384 bytes / RTT". Increasing "batch.size" to 4 MiB unblocks
> > > > > throughput but the producer would need 4 MiB × 1000 partitions = 4
> > GiB
> > > of
> > > > > pool memory (regardless of actual volume of data flowing per
> > > partition).
> > > > > Under the dynamic strategy and the same batch.size = 4 MiB, target
> > > > > throughput of ~10 KiB/s and linger.ms = 100ms, per-partition
> memory
> > > > > becomes
> > > > > ≈ ~1 KiB (10 KiB throughput × 100 ms linger), so total memory ≈
> 1000
> > ×
> > > 1
> > > > > KiB = ~1 MiB (orders of magnitude less than the 4 GiB used under
> the
> > > > static
> > > > > allocation). Similar savings apply to any workload where per-batch
> > data
> > > > > falls short of batch.size: hot-cold partition distributions (skewed
> > key
> > > > > traffic), bursty workloads with quiet periods, and over-provisioned
> > > > > batch.size settings."
> > > > > This example is still not very convincing. It's true that one can
> set
> > > > > batch.size=4MB without running out of memory, but it doesn't
> achieve
> > > the
> > > > > batching benefit. So, why will a user bother setting a high batch
> > size?
> > > > One
> > > > > possible example is a client that publishes to a high volume topic
> > > > without
> > > > > keys, and to a low-volume topic with keys, using the default
> > > partitioning
> > > > > strategy. When a high batch size is set, the static approach may
> > > exhaust
> > > > > the buffer pool, whereas the dynamic approach avoids exhausting the
> > > pool
> > > > > and still achieves the batching benefit for the high volume topic.
> > > > >
> > > > >
> > > > > JR3. "Dynamic uses aggregate_throughput × linger.ms, which
> operators
> > > > > control. During lower-traffic periods, static still reserves 400
> MiB
> > > > until
> > > > > batches close; dynamic scales down proportionally."
> > > > > Hmm, if the dynamic approach ever allocates 400MB worth of chunks,
> it
> > > > never
> > > > > deallocates them right? Then, how will dynamic scale down?
> > > > >
> > > > >
> > > > > JR4. "If the non-blocking acquire fails (pool exhausted), the
> > producer
> > > > will
> > > > > close the current batch (making it eligible to drain), and blocks
> on
> > > the
> > > > > pool to allocate the chunks for the new record (up to max.block.ms
> > )."
> > > > > To be precise, currently, when the buffer pool is exhausted, the
> > > producer
> > > > > doesn't close the batch directly. The background sender thread
> drains
> > > and
> > > > > closes the batch.
> > > > >
> > > > > Jun
> > > > >
> > > > > On Thu, May 14, 2026 at 2:30 PM Lianet Magrans <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > Hi Jun,
> > > > > >
> > > > > > JR1: The example's point was about the case where flow remains
> > under
> > > > the
> > > > > > batch limit (those are the cases where we would get significant
> > > memory
> > > > > > improvement/differences). But I do get your point and agree: in
> > > > scenarios
> > > > > > where the full batch is used, the dynamic strategy would end up
> > using
> > > > the
> > > > > > same amount of memory. Still, in those cases the value comes from
> > the
> > > > > > predictability/tuning of the buffer.memory (memory consumption
> > > depends
> > > > on
> > > > > > known factors, not workload-dependant ones). I clarified the
> first
> > > > > example,
> > > > > > and added a second one to showcase the case where it's not about
> > > memory
> > > > > > gains but about predictability.
> > > > > >
> > > > > > JR2: The main concern with keeping the same close-and-block as
> > trunk
> > > in
> > > > > > this case was the change it would bring into the send() blocking
> > > > pattern.
> > > > > > On trunk, send only blocks for memory for the first record of a
> > > batch,
> > > > > but
> > > > > > never mid-batch. Applying this close-and-block to the dynamic
> > > strategy
> > > > > > would change this (send() could block on any record regardless of
> > an
> > > > open
> > > > > > batch). I leaned initially toward avoiding changing the blocking
> > > > > behaviour
> > > > > > (and pay the extra direct allocation with visibility), but on
> > second
> > > > > > thoughts I agree it's cleaner to surface the situation to the API
> > > > > (blocking
> > > > > > on send, aligned with what trunk does on new batch only, and
> > dynamic
> > > > > would
> > > > > > do at the record level). It's no change to the send or
> max.bloc.ms
> > > > > > <
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__http://max.bloc.ms__;!!Ayb5sqE7!tij1l2481b8WaW403I3sX_JjzjVmH3SFTImLH03m5t8v-95dzvTWsUQaSx4vXv1j93EM6pqXPd0mg4my$
> > > > > >
> > > > > > contract really, just a different pattern that seems sensible
> given
> > > the
> > > > > > "on-demand" allocation. I updated the KIP with this, and left a
> > > > rejected
> > > > > > alternative for the record. Also, with this I opted for dropping
> > the
> > > > new
> > > > > > metric I had (which was mainly to have visbility over this new
> > > > > > direct-allocation path, now removed)
> > > > > >
> > > > > > Hi TengYao:
> > > > > >
> > > > > > TYC1: interesting point, agree that your suggested metric would
> > give
> > > > > > visibility on what's actually allocated from the pool (which is
> > > dynamic
> > > > > > now, didn't make too much sense before because it was "static",
> > > > > > ~batch.size). I believe that for some of the scenarios you
> shared,
> > we
> > > > > would
> > > > > > be covered with the metrics that already exist in trunk (e.g.,
> > > > > > bufferpool-wait-*, buffer-available/total-bytes, batch-size-avg),
> > > > still,
> > > > > > it's a fact that the new strategy allocates differently from the
> > > pool,
> > > > > > dynamically, and only a metric like you suggest would let us see
> > how
> > > > that
> > > > > > goes (batch-size-avg is the closest but is post-compression so
> not
> > > the
> > > > > > same). I just wonder if it would make more sense to represent it
> in
> > > > > bytes,
> > > > > > rather than in chunks?? (e.g, "batch-pool-bytes-avg"). It would
> > align
> > > > > > better with existing metrics in this space, all in bytes. Also I
> > > expect
> > > > > > operators probably think in bytes (not a new "chunk" concept,
> which
> > > is
> > > > > just
> > > > > > an internal implementation grouping bytes), and maybe better not
> to
> > > > > expose
> > > > > > chunk as a unit of measure to make sure the metric ages well even
> > if
> > > > the
> > > > > > internal chunk details move). What do you think? Will wait to
> hear
> > > back
> > > > > and
> > > > > > align before updating the KIP
> > > > > >
> > > > > > Thanks both!
> > > > > > Cheers,
> > > > > > Lianet
> > > > > >
> > > > > >
> > > > > > On Wed, May 13, 2026 at 4:40 PM Jun Rao via dev <
> > > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > >> Hi, Lianet,
> > > > > >>
> > > > > >> Thanks for the reply.
> > > > > >>
> > > > > >> JR1. "As an example: a producer writing 10 MiB/s of aggregate
> > > > throughput
> > > > > >> to
> > > > > >> a 1000-partition topic with RoundRobinPartitioner struggles to
> > > > achieve a
> > > > > >> meaningful fraction of that at the default 16384 bytes
> > "batch.size".
> > > > > Each
> > > > > >> partition only sends 16384 bytes at a time over a high-latency
> > link,
> > > > so
> > > > > >> per-partition throughput is bounded by "16384 bytes / RTT".
> > > Increasing
> > > > > >> "batch.size" to 4 MiB unblocks throughput but the producer would
> > > need
> > > > 4
> > > > > >> MiB
> > > > > >> × 1000 partitions = 4 GiB of pool memory to accommodate all
> > > partitions
> > > > > >> simultaneously (regardless of actual volume of data flowing per
> > > > > >> partition)."
> > > > > >> This example does not seem strong. In this case, the producer
> > still
> > > > > >> requires 4GB of memory even with the proposed KIP to achieve
> high
> > > > > >> throughput because all 1000 partitions are active.
> > > > > >>
> > > > > >> JR2. "When a new record arrives mid-batch and the pool is
> > exhausted,
> > > > it
> > > > > >> will perform direct heap allocation to allocate all the chunks
> > > > estimated
> > > > > >> needed for the record uncompressed size."
> > > > > >> Why do we need to introduce this new case for direct allocation?
> > > This
> > > > > case
> > > > > >> exists in the static allocation approach. If the buffer pool is
> > > > > exhausted,
> > > > > >> the send() call blocks but all pending batches become drainable
> to
> > > > > prevent
> > > > > >> deadlock. Is there any issue with using the same mechanism for
> > > dynamic
> > > > > >> allocation?
> > > > > >>
> > > > > >> Jun
> > > > > >>
> > > > > >>
> > > > > >> On Wed, May 13, 2026 at 8:53 AM Lianet Magrans <
> > [email protected]>
> > > > > >> wrote:
> > > > > >>
> > > > > >> > Hi Jun,
> > > > > >> >
> > > > > >> > JR1: Agreed, I updated the motivation section to clarify the
> > > > different
> > > > > >> > scenarios based on keys and partitioner, and under which
> > > situations
> > > > it
> > > > > >> > becomes problematic.
> > > > > >> >
> > > > > >> > JR2: The KIP preserves the 2 existing direct allocation
> triggers
> > > you
> > > > > >> > mentioned (compressed data exceeding allocation and batch
> > split),
> > > > and
> > > > > >> also
> > > > > >> > introduces a new one (on new record mid-batch when pool
> > exhausted,
> > > > > >> > basically due to the per-record reservation approach). To
> > > mitigate,
> > > > > >> direct
> > > > > >> > allocation is limited to one record's worth of growth per
> batch
> > > > (batch
> > > > > >> > closed right after it), and we're also introducing the new
> > metric
> > > to
> > > > > >> have
> > > > > >> > visblity and allow to tune buffer.memory. Under normal pool
> > > > > conditions,
> > > > > >> > direct allocations with the new strategy should happen less
> > often
> > > > than
> > > > > >> with
> > > > > >> > the current behaviour, mainly because of the proposed
> > improvement
> > > to
> > > > > try
> > > > > >> > the pool first, non-blocking before falling back to heap
> > > > allocation. I
> > > > > >> > clarified it all in the Internal allocation strategy section
> > > > > (extending
> > > > > >> on
> > > > > >> > new sections "Blocking behaviour" and "Direct heap
> allocation").
> > > > > Please
> > > > > >> > take a look and let me know.
> > > > > >> >
> > > > > >> > Thanks for the review!
> > > > > >> > Lianet
> > > > > >> >
> > > > > >> > PS: addressing TengYao's feedback shortly, thanks!
> > > > > >> >
> > > > > >> > On Tue, May 12, 2026 at 11:31 AM TengYao Chi <
> > > [email protected]
> > > > >
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > Hi Lianet,
> > > > > >> > >
> > > > > >> > > Thanks for this great KIP.
> > > > > >> > >
> > > > > >> > > TYC1. I have one consideration regarding observability: Do
> we
> > > > need a
> > > > > >> new
> > > > > >> > > metric for average-chunks-per-batch? With the introduction
> of
> > > the
> > > > > >> > > chunked-buffer strategy, memory usage per partition is no
> > > longer a
> > > > > >> fixed
> > > > > >> > > batch.size. While this significantly improves memory
> > efficiency,
> > > > it
> > > > > >> might
> > > > > >> > > be beneficial for operators to understand the actual "chunk
> > > > > >> utilization"
> > > > > >> > or
> > > > > >> > > fragmentation under different workloads. Specifically, I
> think
> > > > this
> > > > > >> > metric
> > > > > >> > > would be valuable when combined with the proposed
> > > > > bufferpool-overflow
> > > > > >> > > metrics: it would help operators distinguish whether memory
> > > > pressure
> > > > > >> is
> > > > > >> > > being driven by a large number of active partitions (many
> > small
> > > > > >> batches)
> > > > > >> > or
> > > > > >> > > by individual batches becoming unexpectedly large (many
> chunks
> > > per
> > > > > >> batch,
> > > > > >> > > perhaps due to large records or low compression ratios).
> What
> > do
> > > > you
> > > > > >> > think?
> > > > > >> > >
> > > > > >> > > Best,
> > > > > >> > > TengYao Chi
> > > > > >> > >
> > > > > >> > > On 2026/05/11 23:03:53 Jun Rao via dev wrote:
> > > > > >> > > > Hi, Lianet,
> > > > > >> > > >
> > > > > >> > > > Thanks for the KIP.
> > > > > >> > > >
> > > > > >> > > > JR1. It would be useful to provide a bit more motivation
> for
> > > the
> > > > > >> KIP.
> > > > > >> > The
> > > > > >> > > > batches allocated from the buffer pool are proportional to
> > the
> > > > > >> number
> > > > > >> > of
> > > > > >> > > > active partitions. For publishing records without keys,
> the
> > > > active
> > > > > >> > > > partition is 1 by default, independent of the number of
> > > > partitions
> > > > > >> in a
> > > > > >> > > > topic. It's only when publishing records with keys that
> the
> > > > active
> > > > > >> > > > partition can be the total number of partitions in a
> topic.
> > > So,
> > > > a
> > > > > >> > > possible
> > > > > >> > > > scenario is that a client publishes records without keys
> to
> > > one
> > > > > >> topic
> > > > > >> > > while
> > > > > >> > > > publishing records with keys to another.
> > > > > >> > > >
> > > > > >> > > > JR2. "Following records appended to the batch do not block
> > or
> > > > > throw.
> > > > > >> > They
> > > > > >> > > > attempt non-blocking pool allocation and fall back to
> direct
> > > > heap
> > > > > if
> > > > > >> > the
> > > > > >> > > > pool is exhausted.
> > > > > >> > > > Ensures not blocking on pool memory while already holding
> > some
> > > > > for a
> > > > > >> > > batch".
> > > > > >> > > >
> > > > > >> > > > Currently, the producer only allocates memory exceeding
> the
> > > > > >> configured
> > > > > >> > > > buffer pool size in two cases.
> > > > > >> > > > (1) Compressed data exceeding the estimated size
> > > > > >> > > > (2) When a batch is too large for the broker's
> > > max.message.bytes
> > > > > and
> > > > > >> > gets
> > > > > >> > > > split, each sub-batch is allocated via
> > > > > >> ByteBuffer.allocate(initialSize)
> > > > > >> > > > directly.
> > > > > >> > > >
> > > > > >> > > > With the KIP, are we introducing new cases in addition to
> > the
> > > > > above
> > > > > >> > two?
> > > > > >> > > >
> > > > > >> > > > Jun
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > On Fri, May 1, 2026 at 6:03 AM Lianet Magrans <
> > > > [email protected]
> > > > > >
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Thanks for the feedback Jaisen! I like your proposed
> > > "static"
> > > > > for
> > > > > >> the
> > > > > >> > > > > current behaviour, it aligns nicely. All updated.
> > > > > >> > > > >
> > > > > >> > > > > Best!
> > > > > >> > > > > Lianet
> > > > > >> > > > >
> > > > > >> > > > > On Thu, Apr 30, 2026 at 4:27 PM Jaisen Mathai via dev <
> > > > > >> > > > > [email protected]>
> > > > > >> > > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Thanks Lianet.
> > > > > >> > > > > >
> > > > > >> > > > > > I like the proposal.
> > > > > >> > > > > >
> > > > > >> > > > > > I suggest a descriptive name such as static or fixed
> > > instead
> > > > > of
> > > > > >> > > legacy
> > > > > >> > > > > for
> > > > > >> > > > > > the default configuration value. I think these will
> age
> > > > better
> > > > > >> > while
> > > > > >> > > > > still
> > > > > >> > > > > > communicating that users should strongly consider
> using
> > > the
> > > > > >> > > non-default
> > > > > >> > > > > > value of dynamic.
> > > > > >> > > > > >
> > > > > >> > > > > > Jaisen
> > > > > >> > > > > >
> > > > > >> > > > > > On Thu, Apr 30, 2026 at 8:02 AM Lianet Magrans <
> > > > > >> [email protected]
> > > > > >> > >
> > > > > >> > > > > wrote:
> > > > > >> > > > > >
> > > > > >> > > > > > > Hi all,
> > > > > >> > > > > > >
> > > > > >> > > > > > > I would like to start a discussion on KIP-1332 that
> > > > > proposes a
> > > > > >> > > dynamic
> > > > > >> > > > > > > memory allocation strategy for the Kafka producer,
> to
> > > > unlock
> > > > > >> > > > > high-latency
> > > > > >> > > > > > > scenarios increasingly common as Kafka moves toward
> > > object
> > > > > >> > storage.
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__https://cwiki.apache.org/confluence/display/KAFKA/KIP-1332*3A*Dynamic*memory*allocation*for*the*Kafka*producer__;JSsrKysrKys!!Ayb5sqE7!t4yI-C5BwMxJ6dMJC7tuQhu94KuolbgKXyEnl4GChJGLYY2eS4NXk-GZYlnVPnuw3ESrGwKjyPDr5Bjp0Gk$
> > > > > >> > > > > > >
> > > > > >> > > > > > > Thanks!
> > > > > >> > > > > > > Lianet
> > > > > >> > > > > > >
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] KIP-1332: Dynamic memory allocation for the Kafka producer

Reply via email to