Hi all iam encountering a TimeoutException while publishing messages to
Kafka during load testing of our Kafka producer. The error message
indicates that records are expiring before they can be sent successfully:
org.apache.kafka.common.errors.TimeoutException: Expiring 115
record(s) for ORG_LT_17_APR_2024-5:120004 ms has passed since batch
creation
The application is running in *6 pods*, and the current Kafka producer
configurations are as follows:
ProducerConfig values:
acks = -1
auto.include.jmx.reporter = true
batch.size = 100000
bootstrap.servers = [kaf-l1-01.test.com:9092,
kaf-l2-02.test.com:9092, kaf-l3-03.test.com:9092, ngkaf-lt2-01.ci
sco.com:9092, ngkaf-lt2-02.test.com:9092, ngkaf-lt2-03.test.com:9092]
buffer.memory = 67108864
client.dns.lookup = use_all_dns_ips
client.id =
crt-pub-lt-rcdn-745cd95fc-kxplm-59152067-3214-4352-9e6d-af31a0c16489
compression.type = snappy
connections.max.idle.ms = 540000
delivery.timeout.ms = 120000
enable.idempotence = true
interceptor.classes = []
key.serializer = class
org.apache.kafka.common.serialization.StringSerializer
linger.ms = 50
max.block.ms = 60000
max.in.flight.requests.per.connection = 5
max.request.size = 1048576
metadata.max.age.ms = 300000
metadata.max.idle.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partitioner.adaptive.partitioning.enable = true
partitioner.availability.timeout.ms = 0
partitioner.class = null
partitioner.ignore.keys = false
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 10000
retries = 3
retry.backoff.ms = 5000
sasl.client.callback.handler.class = null
sasl.jaas.config = [hidden]
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.connect.timeout.ms = null
sasl.login.read.timeout.ms = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.login.retry.backoff.max.ms = 10000
sasl.login.retry.backoff.ms = 100
sasl.mechanism = PLAIN
sasl.oauthbearer.clock.skew.seconds = 30
sasl.oauthbearer.expected.audience = null
sasl.oauthbearer.expected.issuer = null
sasl.oauthbearer.jwks.endpoint.refresh.ms = 3600000
sasl.oauthbearer.jwks.endpoint.retry.backoff.max.ms = 10000
sasl.oauthbearer.jwks.endpoint.retry.backoff.ms = 100
sasl.oauthbearer.jwks.endpoint.url = null
sasl.oauthbearer.scope.claim.name = scope
sasl.oauthbearer.sub.claim.name = sub
sasl.oauthbearer.token.endpoint.url = null
security.protocol = SASL_SSL
security.providers = null
send.buffer.bytes = 131072
socket.connection.setup.timeout.max.ms = 30000
socket.connection.setup.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.3]
ssl.endpoint.identification.algorithm = https
ssl.engine.factory.class = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.certificate.chain = null
ssl.keystore.key = null
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLSv1.3
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.certificates = null
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
transaction.timeout.ms = 60000
transactional.id = null
value.serializer = class
org.apache.kafka.common.serialization.StringSerializer
*Observed Issue:*
During load testing, messages fail to be published due to TimeoutException,
indicating that they are not delivered within the delivery.timeout.ms limit
of *120000 ms (2 minutes)*. The batch expiration is consistently observed
when high load is applied.
*Request for Assistance:*
1. *Recommended Tuning Parameters:* How should we adjust our Kafka
producer configurations (batch size, linger time, timeout settings, etc.)
for better performance in a multi-pod environment?
2. *Scaling Considerations:* Are there any best practices for
configuring Kafka producers when running in *6 pods* to handle load
effectively?
3. *Broker Side Analysis:* Could you provide any insights on potential
broker-side issues (e.g., high latency, under-replicated partitions,
resource constraints) that might be contributing to this problem?
I would appreciate your guidance on how to *optimize our producer
configurations* and ensure reliable message delivery under high load.
thankyou