Hi all iam encountering a TimeoutException while publishing messages to
Kafka during load testing of our Kafka producer. The error message
indicates that records are expiring before they can be sent successfully:

org.apache.kafka.common.errors.TimeoutException: Expiring 115
record(s) for ORG_LT_17_APR_2024-5:120004 ms has passed since batch
creation



The application is running in *6 pods*, and the current Kafka producer
configurations are as follows:
ProducerConfig values:
        acks = -1
        auto.include.jmx.reporter = true
        batch.size = 100000
        bootstrap.servers = [kaf-l1-01.test.com:9092,
kaf-l2-02.test.com:9092, kaf-l3-03.test.com:9092, ngkaf-lt2-01.ci
sco.com:9092, ngkaf-lt2-02.test.com:9092, ngkaf-lt2-03.test.com:9092]
        buffer.memory = 67108864
        client.dns.lookup = use_all_dns_ips
        client.id =
crt-pub-lt-rcdn-745cd95fc-kxplm-59152067-3214-4352-9e6d-af31a0c16489
        compression.type = snappy
        connections.max.idle.ms = 540000
        delivery.timeout.ms = 120000
        enable.idempotence = true
        interceptor.classes = []
        key.serializer = class
org.apache.kafka.common.serialization.StringSerializer
        linger.ms = 50
        max.block.ms = 60000
        max.in.flight.requests.per.connection = 5
        max.request.size = 1048576
        metadata.max.age.ms = 300000
        metadata.max.idle.ms = 300000
        metric.reporters = []
        metrics.num.samples = 2
        metrics.recording.level = INFO
        metrics.sample.window.ms = 30000
        partitioner.adaptive.partitioning.enable = true
        partitioner.availability.timeout.ms = 0
        partitioner.class = null
        partitioner.ignore.keys = false
        receive.buffer.bytes = 32768
        reconnect.backoff.max.ms = 1000
        reconnect.backoff.ms = 50
        request.timeout.ms = 10000
        retries = 3
        retry.backoff.ms = 5000
        sasl.client.callback.handler.class = null
        sasl.jaas.config = [hidden]
        sasl.kerberos.kinit.cmd = /usr/bin/kinit
        sasl.kerberos.min.time.before.relogin = 60000
        sasl.kerberos.service.name = null
        sasl.kerberos.ticket.renew.jitter = 0.05
        sasl.kerberos.ticket.renew.window.factor = 0.8
        sasl.login.callback.handler.class = null
        sasl.login.class = null
        sasl.login.connect.timeout.ms = null
        sasl.login.read.timeout.ms = null
        sasl.login.refresh.buffer.seconds = 300
        sasl.login.refresh.min.period.seconds = 60
        sasl.login.refresh.window.factor = 0.8
        sasl.login.refresh.window.jitter = 0.05
        sasl.login.retry.backoff.max.ms = 10000
        sasl.login.retry.backoff.ms = 100
        sasl.mechanism = PLAIN
        sasl.oauthbearer.clock.skew.seconds = 30
        sasl.oauthbearer.expected.audience = null
        sasl.oauthbearer.expected.issuer = null
        sasl.oauthbearer.jwks.endpoint.refresh.ms = 3600000
        sasl.oauthbearer.jwks.endpoint.retry.backoff.max.ms = 10000
        sasl.oauthbearer.jwks.endpoint.retry.backoff.ms = 100
        sasl.oauthbearer.jwks.endpoint.url = null
        sasl.oauthbearer.scope.claim.name = scope
        sasl.oauthbearer.sub.claim.name = sub
        sasl.oauthbearer.token.endpoint.url = null
        security.protocol = SASL_SSL
        security.providers = null
        send.buffer.bytes = 131072
        socket.connection.setup.timeout.max.ms = 30000
        socket.connection.setup.timeout.ms = 10000
        ssl.cipher.suites = null
        ssl.enabled.protocols = [TLSv1.2, TLSv1.3]
        ssl.endpoint.identification.algorithm = https
        ssl.engine.factory.class = null
        ssl.key.password = null
        ssl.keymanager.algorithm = SunX509
        ssl.keystore.certificate.chain = null
        ssl.keystore.key = null
        ssl.keystore.location = null
        ssl.keystore.password = null
        ssl.keystore.type = JKS
        ssl.protocol = TLSv1.3
        ssl.provider = null
        ssl.secure.random.implementation = null
        ssl.trustmanager.algorithm = PKIX
        ssl.truststore.certificates = null
        ssl.truststore.location = null
        ssl.truststore.password = null
        ssl.truststore.type = JKS
        transaction.timeout.ms = 60000
        transactional.id = null
        value.serializer = class
org.apache.kafka.common.serialization.StringSerializer


*Observed Issue:*

During load testing, messages fail to be published due to TimeoutException,
indicating that they are not delivered within the delivery.timeout.ms limit
of *120000 ms (2 minutes)*. The batch expiration is consistently observed
when high load is applied.


*Request for Assistance:*

   1. *Recommended Tuning Parameters:* How should we adjust our Kafka
   producer configurations (batch size, linger time, timeout settings, etc.)
   for better performance in a multi-pod environment?
   2. *Scaling Considerations:* Are there any best practices for
   configuring Kafka producers when running in *6 pods* to handle load
   effectively?
   3. *Broker Side Analysis:* Could you provide any insights on potential
   broker-side issues (e.g., high latency, under-replicated partitions,
   resource constraints) that might be contributing to this problem?

I would appreciate your guidance on how to *optimize our producer
configurations* and ensure reliable message delivery under high load.


thankyou

Reply via email to