Hi Giri,

Since nobody with more experience has answered yet, let me give you my
amateur understanding of this error.

The TimeoutException will appear whenever the load generation (code calling
the Producer) runs faster than all downstream components (Producer,
Network, Brokers, etc) can handle.
Records are accepted by send() but the throughput is not high enough to
acknowledge the record before the delivery.timeout.ms expires.

You may be able to change the behavior of the timeout to mitigate the
errors without changing the throughput by configuring the
delivery.timeout.ms, buffer.memory, max.block.ms, request.timeout.ms, etc.
You may also be able to improve the throughput of the producer by
configuring the linger.ms, batch.size, compression, send.buffer.bytes, etc.
If the performance bottleneck is elsewhere downstream, you will need to
investigate and optimize that.
You can also consider dividing your workload among a different number of
producers, partitions, or brokers to see how the throughput behaves.

Without knowing the details of your setup and debugging it directly, it's
hard to give specific tuning advice. You can try looking online for others'
tuning strategy.

Thanks,
Greg

On Fri, Feb 21, 2025 at 10:45 AM giri mungi <girimung...@gmail.com> wrote:

>    Hi all iam encountering a TimeoutException while publishing messages to
> Kafka during load testing of our Kafka producer. The error message
> indicates that records are expiring before they can be sent successfully:
>
> org.apache.kafka.common.errors.TimeoutException: Expiring 115
> record(s) for ORG_LT_17_APR_2024-5:120004 ms has passed since batch
> creation
>
>
>
> The application is running in *6 pods*, and the current Kafka producer
> configurations are as follows:
> ProducerConfig values:
>         acks = -1
>         auto.include.jmx.reporter = true
>         batch.size = 100000
>         bootstrap.servers = [kaf-l1-01.test.com:9092,
> kaf-l2-02.test.com:9092, kaf-l3-03.test.com:9092, ngkaf-lt2-01.ci
> sco.com:9092, ngkaf-lt2-02.test.com:9092, ngkaf-lt2-03.test.com:9092]
>         buffer.memory = 67108864
>         client.dns.lookup = use_all_dns_ips
>         client.id =
> crt-pub-lt-rcdn-745cd95fc-kxplm-59152067-3214-4352-9e6d-af31a0c16489
>         compression.type = snappy
>         connections.max.idle.ms = 540000
>         delivery.timeout.ms = 120000
>         enable.idempotence = true
>         interceptor.classes = []
>         key.serializer = class
> org.apache.kafka.common.serialization.StringSerializer
>         linger.ms = 50
>         max.block.ms = 60000
>         max.in.flight.requests.per.connection = 5
>         max.request.size = 1048576
>         metadata.max.age.ms = 300000
>         metadata.max.idle.ms = 300000
>         metric.reporters = []
>         metrics.num.samples = 2
>         metrics.recording.level = INFO
>         metrics.sample.window.ms = 30000
>         partitioner.adaptive.partitioning.enable = true
>         partitioner.availability.timeout.ms = 0
>         partitioner.class = null
>         partitioner.ignore.keys = false
>         receive.buffer.bytes = 32768
>         reconnect.backoff.max.ms = 1000
>         reconnect.backoff.ms = 50
>         request.timeout.ms = 10000
>         retries = 3
>         retry.backoff.ms = 5000
>         sasl.client.callback.handler.class = null
>         sasl.jaas.config = [hidden]
>         sasl.kerberos.kinit.cmd = /usr/bin/kinit
>         sasl.kerberos.min.time.before.relogin = 60000
>         sasl.kerberos.service.name = null
>         sasl.kerberos.ticket.renew.jitter = 0.05
>         sasl.kerberos.ticket.renew.window.factor = 0.8
>         sasl.login.callback.handler.class = null
>         sasl.login.class = null
>         sasl.login.connect.timeout.ms = null
>         sasl.login.read.timeout.ms = null
>         sasl.login.refresh.buffer.seconds = 300
>         sasl.login.refresh.min.period.seconds = 60
>         sasl.login.refresh.window.factor = 0.8
>         sasl.login.refresh.window.jitter = 0.05
>         sasl.login.retry.backoff.max.ms = 10000
>         sasl.login.retry.backoff.ms = 100
>         sasl.mechanism = PLAIN
>         sasl.oauthbearer.clock.skew.seconds = 30
>         sasl.oauthbearer.expected.audience = null
>         sasl.oauthbearer.expected.issuer = null
>         sasl.oauthbearer.jwks.endpoint.refresh.ms = 3600000
>         sasl.oauthbearer.jwks.endpoint.retry.backoff.max.ms = 10000
>         sasl.oauthbearer.jwks.endpoint.retry.backoff.ms = 100
>         sasl.oauthbearer.jwks.endpoint.url = null
>         sasl.oauthbearer.scope.claim.name = scope
>         sasl.oauthbearer.sub.claim.name = sub
>         sasl.oauthbearer.token.endpoint.url = null
>         security.protocol = SASL_SSL
>         security.providers = null
>         send.buffer.bytes = 131072
>         socket.connection.setup.timeout.max.ms = 30000
>         socket.connection.setup.timeout.ms = 10000
>         ssl.cipher.suites = null
>         ssl.enabled.protocols = [TLSv1.2, TLSv1.3]
>         ssl.endpoint.identification.algorithm = https
>         ssl.engine.factory.class = null
>         ssl.key.password = null
>         ssl.keymanager.algorithm = SunX509
>         ssl.keystore.certificate.chain = null
>         ssl.keystore.key = null
>         ssl.keystore.location = null
>         ssl.keystore.password = null
>         ssl.keystore.type = JKS
>         ssl.protocol = TLSv1.3
>         ssl.provider = null
>         ssl.secure.random.implementation = null
>         ssl.trustmanager.algorithm = PKIX
>         ssl.truststore.certificates = null
>         ssl.truststore.location = null
>         ssl.truststore.password = null
>         ssl.truststore.type = JKS
>         transaction.timeout.ms = 60000
>         transactional.id = null
>         value.serializer = class
> org.apache.kafka.common.serialization.StringSerializer
>
>
> *Observed Issue:*
>
> During load testing, messages fail to be published due to TimeoutException,
> indicating that they are not delivered within the delivery.timeout.ms
> limit
> of *120000 ms (2 minutes)*. The batch expiration is consistently observed
> when high load is applied.
>
>
> *Request for Assistance:*
>
>    1. *Recommended Tuning Parameters:* How should we adjust our Kafka
>    producer configurations (batch size, linger time, timeout settings,
> etc.)
>    for better performance in a multi-pod environment?
>    2. *Scaling Considerations:* Are there any best practices for
>    configuring Kafka producers when running in *6 pods* to handle load
>    effectively?
>    3. *Broker Side Analysis:* Could you provide any insights on potential
>    broker-side issues (e.g., high latency, under-replicated partitions,
>    resource constraints) that might be contributing to this problem?
>
> I would appreciate your guidance on how to *optimize our producer
> configurations* and ensure reliable message delivery under high load.
>
>
> thankyou
>

Reply via email to