[ https://issues.apache.org/jira/browse/KAFKA-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17890366#comment-17890366 ]
Stefan Huber commented on KAFKA-15796: -------------------------------------- Hi, we are having the same issue. We need to connect to a broker which randomly encounters issues with authentication. With default config, each thread just stops working after it encounters an Authentication Exception. When we added authorizationExceptionRetryInterval to the config, failed authentications are retried, but we also noticed that suddenly we are running out of CPU after a few hours. When looking at a profiler, I can see that massive amount of time is being spent in KafkaConsumer.poll() function. We were somewhat able to limit the impact by tuning backoff settings and enabling restartAfterAuthExceptions. However, after a few days max we usually see high cpu load again and we need to restart our app manually. I am not very familiar with Kafka internals, but if there is any data I can provide that helps to fix this issue, please let me know. > High CPU issue in Kafka Producer when Auth Failed > -------------------------------------------------- > > Key: KAFKA-15796 > URL: https://issues.apache.org/jira/browse/KAFKA-15796 > Project: Kafka > Issue Type: Bug > Components: clients, producer > Affects Versions: 3.2.2, 3.2.3, 3.3.1, 3.3.2, 3.5.0, 3.4.1, 3.6.0, 3.5.1 > Reporter: xiaotong.wang > Priority: Major > Attachments: image-2023-11-07-14-18-32-016.png > > > How to reproduce > 1、kafka-client 3.x.x Producer config enable.idempotence=true (this is > default) > 2、start kafka server , not contain client user auth info > 3、start client producer , after 3.x,producer will initProducerId and TCM > state trans to INITIALIZING > 4、server reject client reqesut , producer will raise > AuthenticationException > (org.apache.kafka.clients.producer.internals.Sender#maybeSendAndPollTransactionalRequest) > 5、kafka-client org.apache.kafka.clients.producer.internals.Sender#runOnce > catch > AuthenticationException > call transactionManager.authenticationFailed(e); > > synchronized void authenticationFailed(AuthenticationException e) > { for (TxnRequestHandler request : pendingRequests) > request.fatalError(e); } > this method only handle pendingRequest,but inflight request is missing > 6、 TCM state will alway in INITIALIZING > for judgment Condition: currentState != State.INITIALIZING && > !hasProducerId() > 7、producer send mesasge , mesasge go into batch queue,Sender will wake up > and set pollTimeout=0 , prepare to send message > 8、but , before Sender sendProducerData ,it will do message filter > ,RecordAccumulator drain > {-}{{-}}>drainBatchesForOneNode{{-}}{-}>shouldStopDrainBatchesForPartition > when producerIdAndEpoch.isValid()==false,return true, it will not > collect any message > 9、now kafka producer network thread CPU usage will go 100% > 10、even we add user auth info and permission in kafka server ,it can not > self-healing > > > > suggest : > also catch AuthenticationException in > org.apache.kafka.clients.producer.internals.Sender#maybeSendAndPollTransactionalRequest > and respone failed to inflight InitProducerId request > -- This message was sent by Atlassian Jira (v8.20.10#820010)