Hi all iam encountering a TimeoutException while publishing messages to Kafka during load testing of our Kafka producer. The error message indicates that records are expiring before they can be sent successfully:
org.apache.kafka.common.errors.TimeoutException: Expiring 115 record(s) for ORG_LT_17_APR_2024-5:120004 ms has passed since batch creation The application is running in *6 pods*, and the current Kafka producer configurations are as follows: ProducerConfig values: acks = -1 auto.include.jmx.reporter = true batch.size = 100000 bootstrap.servers = [kaf-l1-01.test.com:9092, kaf-l2-02.test.com:9092, kaf-l3-03.test.com:9092, ngkaf-lt2-01.ci sco.com:9092, ngkaf-lt2-02.test.com:9092, ngkaf-lt2-03.test.com:9092] buffer.memory = 67108864 client.dns.lookup = use_all_dns_ips client.id = crt-pub-lt-rcdn-745cd95fc-kxplm-59152067-3214-4352-9e6d-af31a0c16489 compression.type = snappy connections.max.idle.ms = 540000 delivery.timeout.ms = 120000 enable.idempotence = true interceptor.classes = [] key.serializer = class org.apache.kafka.common.serialization.StringSerializer linger.ms = 50 max.block.ms = 60000 max.in.flight.requests.per.connection = 5 max.request.size = 1048576 metadata.max.age.ms = 300000 metadata.max.idle.ms = 300000 metric.reporters = [] metrics.num.samples = 2 metrics.recording.level = INFO metrics.sample.window.ms = 30000 partitioner.adaptive.partitioning.enable = true partitioner.availability.timeout.ms = 0 partitioner.class = null partitioner.ignore.keys = false receive.buffer.bytes = 32768 reconnect.backoff.max.ms = 1000 reconnect.backoff.ms = 50 request.timeout.ms = 10000 retries = 3 retry.backoff.ms = 5000 sasl.client.callback.handler.class = null sasl.jaas.config = [hidden] sasl.kerberos.kinit.cmd = /usr/bin/kinit sasl.kerberos.min.time.before.relogin = 60000 sasl.kerberos.service.name = null sasl.kerberos.ticket.renew.jitter = 0.05 sasl.kerberos.ticket.renew.window.factor = 0.8 sasl.login.callback.handler.class = null sasl.login.class = null sasl.login.connect.timeout.ms = null sasl.login.read.timeout.ms = null sasl.login.refresh.buffer.seconds = 300 sasl.login.refresh.min.period.seconds = 60 sasl.login.refresh.window.factor = 0.8 sasl.login.refresh.window.jitter = 0.05 sasl.login.retry.backoff.max.ms = 10000 sasl.login.retry.backoff.ms = 100 sasl.mechanism = PLAIN sasl.oauthbearer.clock.skew.seconds = 30 sasl.oauthbearer.expected.audience = null sasl.oauthbearer.expected.issuer = null sasl.oauthbearer.jwks.endpoint.refresh.ms = 3600000 sasl.oauthbearer.jwks.endpoint.retry.backoff.max.ms = 10000 sasl.oauthbearer.jwks.endpoint.retry.backoff.ms = 100 sasl.oauthbearer.jwks.endpoint.url = null sasl.oauthbearer.scope.claim.name = scope sasl.oauthbearer.sub.claim.name = sub sasl.oauthbearer.token.endpoint.url = null security.protocol = SASL_SSL security.providers = null send.buffer.bytes = 131072 socket.connection.setup.timeout.max.ms = 30000 socket.connection.setup.timeout.ms = 10000 ssl.cipher.suites = null ssl.enabled.protocols = [TLSv1.2, TLSv1.3] ssl.endpoint.identification.algorithm = https ssl.engine.factory.class = null ssl.key.password = null ssl.keymanager.algorithm = SunX509 ssl.keystore.certificate.chain = null ssl.keystore.key = null ssl.keystore.location = null ssl.keystore.password = null ssl.keystore.type = JKS ssl.protocol = TLSv1.3 ssl.provider = null ssl.secure.random.implementation = null ssl.trustmanager.algorithm = PKIX ssl.truststore.certificates = null ssl.truststore.location = null ssl.truststore.password = null ssl.truststore.type = JKS transaction.timeout.ms = 60000 transactional.id = null value.serializer = class org.apache.kafka.common.serialization.StringSerializer *Observed Issue:* During load testing, messages fail to be published due to TimeoutException, indicating that they are not delivered within the delivery.timeout.ms limit of *120000 ms (2 minutes)*. The batch expiration is consistently observed when high load is applied. *Request for Assistance:* 1. *Recommended Tuning Parameters:* How should we adjust our Kafka producer configurations (batch size, linger time, timeout settings, etc.) for better performance in a multi-pod environment? 2. *Scaling Considerations:* Are there any best practices for configuring Kafka producers when running in *6 pods* to handle load effectively? 3. *Broker Side Analysis:* Could you provide any insights on potential broker-side issues (e.g., high latency, under-replicated partitions, resource constraints) that might be contributing to this problem? I would appreciate your guidance on how to *optimize our producer configurations* and ensure reliable message delivery under high load. thankyou