Hi Giri, Thanks for your additional context. Good to see you're running experiments!
> How to achieve workload among a different number of producers..could you > suggest java code for this. This is an api calling from dB stored procedure > and they can call any number of times this api with messages as payload. This is something specific to your application or framework, I won't be able to give you a specific answer. You should work backwards from where the producers are configured and instantiated in order to figure out how to instantiate more of them. > How to find downstream bottleneck (network, brokers, partitions). You should examine the metrics emitted by clients and brokers to gather more information about the workload. Try changing configurations, reducing the incoming workload, etc, and see how the metrics respond. And check for resource exhaustion or over-utilization in your operating system. These are only general points about where to start. Optimizing Kafka is a large topic, so I would recommend searching online for more resources. Thanks, Greg On Tue, Feb 25, 2025 at 10:50 AM giri mungi <girimung...@gmail.com> wrote: > Hi Greg, > > Thanks for your insights! I tried increasing the timeouts to: > > request.timeout.ms = 60000 > delivery.timeout.ms = 900000 > However, the issue persists. Some messages are still failing intermittently > with a timeout, while others are successfully delivered. > > props.put(ProducerConfig.ACKS_CONFIG, "all"); > props.put(ProducerConfig.BATCH_SIZE_CONFIG, 16384); // Keep 32 KB > props.put(ProducerConfig.LINGER_MS_CONFIG, 10); // Allow 10ms to batch > messages > props.put(ProducerConfig.RETRIES_CONFIG, 10); > props.put(ProducerConfig.RETRY_BACKOFF_MS_CONFIG, 5000); // Increase > backoff to 5s > props.put(ProducerConfig.REQUEST_TIMEOUT_MS_CONFIG, 60000); > props.put(ProducerConfig.DELIVERY_TIMEOUT_MS_CONFIG, 900000); > props.put(ProducerConfig.BUFFER_MEMORY_CONFIG, 33554432); // Increase > buffer to 32MB > props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "lz4"); // Reduce > payload size > > > What I've Tried So Far: > 1) Increasing timeouts (as mentioned above) – but this hasn’t fully > resolved the issue. > 2) Tuning linger.ms and batch.size to optimize batching, but I haven’t > observed a significant improvement. > > Looking for Advice on Next Steps: > > you suggested we can consider dividing workload among a different number > ofproducers ==> how to do this. > > What would you suggest as the next steps to: > > How to achieve workload among a different number of producers..could you > suggest java code for this. This is an api calling from dB stored procedure > and they can call any number of times this api with messages as payload. > > How to find downstream bottleneck (network, brokers, partitions). > > On Tue, Feb 25, 2025 at 10:25 PM Greg Harris <greg.har...@aiven.io.invalid > > > wrote: > > > Hi Giri, > > > > Since nobody with more experience has answered yet, let me give you my > > amateur understanding of this error. > > > > The TimeoutException will appear whenever the load generation (code > calling > > the Producer) runs faster than all downstream components (Producer, > > Network, Brokers, etc) can handle. > > Records are accepted by send() but the throughput is not high enough to > > acknowledge the record before the delivery.timeout.ms expires. > > > > You may be able to change the behavior of the timeout to mitigate the > > errors without changing the throughput by configuring the > > delivery.timeout.ms, buffer.memory, max.block.ms, request.timeout.ms, > etc. > > You may also be able to improve the throughput of the producer by > > configuring the linger.ms, batch.size, compression, send.buffer.bytes, > > etc. > > If the performance bottleneck is elsewhere downstream, you will need to > > investigate and optimize that. > > You can also consider dividing your workload among a different number of > > producers, partitions, or brokers to see how the throughput behaves. > > > > Without knowing the details of your setup and debugging it directly, it's > > hard to give specific tuning advice. You can try looking online for > others' > > tuning strategy. > > > > Thanks, > > Greg > > > > On Fri, Feb 21, 2025 at 10:45 AM giri mungi <girimung...@gmail.com> > wrote: > > > > > Hi all iam encountering a TimeoutException while publishing messages > > to > > > Kafka during load testing of our Kafka producer. The error message > > > indicates that records are expiring before they can be sent > successfully: > > > > > > org.apache.kafka.common.errors.TimeoutException: Expiring 115 > > > record(s) for ORG_LT_17_APR_2024-5:120004 ms has passed since batch > > > creation > > > > > > > > > > > > The application is running in *6 pods*, and the current Kafka producer > > > configurations are as follows: > > > ProducerConfig values: > > > acks = -1 > > > auto.include.jmx.reporter = true > > > batch.size = 100000 > > > bootstrap.servers = [kaf-l1-01.test.com:9092, > > > kaf-l2-02.test.com:9092, kaf-l3-03.test.com:9092, ngkaf-lt2-01.ci > > > sco.com:9092, ngkaf-lt2-02.test.com:9092, ngkaf-lt2-03.test.com:9092] > > > buffer.memory = 67108864 > > > client.dns.lookup = use_all_dns_ips > > > client.id = > > > crt-pub-lt-rcdn-745cd95fc-kxplm-59152067-3214-4352-9e6d-af31a0c16489 > > > compression.type = snappy > > > connections.max.idle.ms = 540000 > > > delivery.timeout.ms = 120000 > > > enable.idempotence = true > > > interceptor.classes = [] > > > key.serializer = class > > > org.apache.kafka.common.serialization.StringSerializer > > > linger.ms = 50 > > > max.block.ms = 60000 > > > max.in.flight.requests.per.connection = 5 > > > max.request.size = 1048576 > > > metadata.max.age.ms = 300000 > > > metadata.max.idle.ms = 300000 > > > metric.reporters = [] > > > metrics.num.samples = 2 > > > metrics.recording.level = INFO > > > metrics.sample.window.ms = 30000 > > > partitioner.adaptive.partitioning.enable = true > > > partitioner.availability.timeout.ms = 0 > > > partitioner.class = null > > > partitioner.ignore.keys = false > > > receive.buffer.bytes = 32768 > > > reconnect.backoff.max.ms = 1000 > > > reconnect.backoff.ms = 50 > > > request.timeout.ms = 10000 > > > retries = 3 > > > retry.backoff.ms = 5000 > > > sasl.client.callback.handler.class = null > > > sasl.jaas.config = [hidden] > > > sasl.kerberos.kinit.cmd = /usr/bin/kinit > > > sasl.kerberos.min.time.before.relogin = 60000 > > > sasl.kerberos.service.name = null > > > sasl.kerberos.ticket.renew.jitter = 0.05 > > > sasl.kerberos.ticket.renew.window.factor = 0.8 > > > sasl.login.callback.handler.class = null > > > sasl.login.class = null > > > sasl.login.connect.timeout.ms = null > > > sasl.login.read.timeout.ms = null > > > sasl.login.refresh.buffer.seconds = 300 > > > sasl.login.refresh.min.period.seconds = 60 > > > sasl.login.refresh.window.factor = 0.8 > > > sasl.login.refresh.window.jitter = 0.05 > > > sasl.login.retry.backoff.max.ms = 10000 > > > sasl.login.retry.backoff.ms = 100 > > > sasl.mechanism = PLAIN > > > sasl.oauthbearer.clock.skew.seconds = 30 > > > sasl.oauthbearer.expected.audience = null > > > sasl.oauthbearer.expected.issuer = null > > > sasl.oauthbearer.jwks.endpoint.refresh.ms = 3600000 > > > sasl.oauthbearer.jwks.endpoint.retry.backoff.max.ms = 10000 > > > sasl.oauthbearer.jwks.endpoint.retry.backoff.ms = 100 > > > sasl.oauthbearer.jwks.endpoint.url = null > > > sasl.oauthbearer.scope.claim.name = scope > > > sasl.oauthbearer.sub.claim.name = sub > > > sasl.oauthbearer.token.endpoint.url = null > > > security.protocol = SASL_SSL > > > security.providers = null > > > send.buffer.bytes = 131072 > > > socket.connection.setup.timeout.max.ms = 30000 > > > socket.connection.setup.timeout.ms = 10000 > > > ssl.cipher.suites = null > > > ssl.enabled.protocols = [TLSv1.2, TLSv1.3] > > > ssl.endpoint.identification.algorithm = https > > > ssl.engine.factory.class = null > > > ssl.key.password = null > > > ssl.keymanager.algorithm = SunX509 > > > ssl.keystore.certificate.chain = null > > > ssl.keystore.key = null > > > ssl.keystore.location = null > > > ssl.keystore.password = null > > > ssl.keystore.type = JKS > > > ssl.protocol = TLSv1.3 > > > ssl.provider = null > > > ssl.secure.random.implementation = null > > > ssl.trustmanager.algorithm = PKIX > > > ssl.truststore.certificates = null > > > ssl.truststore.location = null > > > ssl.truststore.password = null > > > ssl.truststore.type = JKS > > > transaction.timeout.ms = 60000 > > > transactional.id = null > > > value.serializer = class > > > org.apache.kafka.common.serialization.StringSerializer > > > > > > > > > *Observed Issue:* > > > > > > During load testing, messages fail to be published due to > > TimeoutException, > > > indicating that they are not delivered within the delivery.timeout.ms > > > limit > > > of *120000 ms (2 minutes)*. The batch expiration is consistently > observed > > > when high load is applied. > > > > > > > > > *Request for Assistance:* > > > > > > 1. *Recommended Tuning Parameters:* How should we adjust our Kafka > > > producer configurations (batch size, linger time, timeout settings, > > > etc.) > > > for better performance in a multi-pod environment? > > > 2. *Scaling Considerations:* Are there any best practices for > > > configuring Kafka producers when running in *6 pods* to handle load > > > effectively? > > > 3. *Broker Side Analysis:* Could you provide any insights on > potential > > > broker-side issues (e.g., high latency, under-replicated partitions, > > > resource constraints) that might be contributing to this problem? > > > > > > I would appreciate your guidance on how to *optimize our producer > > > configurations* and ensure reliable message delivery under high load. > > > > > > > > > thankyou > > > > > >