[jira] [Commented] (KAFKA-20109) Complete Kafka cluster dies on incorrect SSL config of a single controller

Jira Mon, 02 Mar 2026 08:10:59 -0800


    [ 
https://issues.apache.org/jira/browse/KAFKA-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18062166#comment-18062166
 ]


Gergely Harmadás commented on KAFKA-20109:
------------------------------------------

[~svdewitmam]
{quote}while the controller cannot authorize incoming connections from other 
nodes, the other nodes will happily accept connections from the misconfigured 
controller
{quote}
That is correct as the principal mapping works as expected on the other nodes.

I have also tested with 4.2 and noticed I hit this issue way less frequently. I 
think the reason is KAFKA-16926. Basically it is an optimization for the active 
controller heartbeat ({{{}BEGIN_QUORUM_EPOCH{}}}). The heartbeat is only sent 
to a given controller if it failed to send a {{FETCH}} request until the 
defined timeout. So I think what happens is that if Controller 3 starts up and 
sends a {{FETCH}} request to the active controller before it would receive the 
{{BEGIN_QUORUM_EPOCH}} request then the cascading failure does not happen. 
Controller 3 would never receive {{BEGIN_QUORUM_EPOCH}} from the active 
controller as long as it never lags behind too much. But when Controller 3 ever 
receives the {{BEGIN_QUORUM_EPOCH}} request from the active controller 
everything will be pretty much the same as on 4.1.

> Complete Kafka cluster dies on incorrect SSL config of a single controller
> --------------------------------------------------------------------------
>
>                 Key: KAFKA-20109
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20109
>             Project: Kafka
>          Issue Type: Bug
>          Components: config, controller
>    Affects Versions: 4.2.0, 4.1.1
>         Environment: Debian trixie x86_64, Apache Kafka 3.9.0 - 4.1.1
>            Reporter: Sven Dewit
>            Assignee: Gergely Harmadás
>            Priority: Major
>         Attachments: controller3.log, reproduce.tar.gz
>
>
> Hello,
> we've recently run into a bug in Apache Kafka in Kraft mode where a whole 
> mtls-enabled cluster (controllers + brokers) die if a single controller is 
> (re)started with bad ssl principal mapping rules.
> The bad config of course was appllied unintentionally when doing some changes 
> in the config management of the system, basically it led to 
> {{ssl.principal.mapping.rules}} missing for the controller listener on that 
> one node. As soon as this single controller was restarted, the whole cluster 
> died within seconds, both controllers and brokers, with this error message:
> {code:java}
> ERROR Encountered fatal fault: Unexpected error in raft IO thread 
> (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler)
> org.apache.kafka.common.errors.ClusterAuthorizationException: Received 
> cluster authorization error in response InboundResponse(correlationId=493, 
> data=BeginQuorumEpochResponseData(errorCode=31, topics=[], nodeEndpoints=[]), 
> source=controller-3:9093 (id: 103 rack: null isFenced: false)) {code}
> While the missing/bad ssl principal mapping is a major misconfiguration on a 
> cluster where in-cluster communication is based on mtls, this still should 
> not lead to the whole cluster terminating.
> The issue occurred on version 4.1.1 of Apache Kafka, but could be reproduced 
> back to 3.9.0.
> To reproduce, see the attached tarball containing
>  * {{gen-test-ca-and-certs.sh}} to create ca and certificates for brokers and 
> controllers to work in mtls mode
>  * {{compose.yml}} to spin up the cluster with {{podman compose}}
> Once the cluster is running, the following steps reproduce the error:
>  * {{podman compose down controller-3}} to stop controller 3
>  * uncomment line 53 in {{compose.yml}} to delete controller 3's 
> {{ssl.principal.mapping.rules}}
>  * {{podman compose up controller-3}} and watch the cluster go down the drain
>  
> In case I can provide you with any more information or support don't hesitate 
> to reach out to me.
>  
> Best regards,
> Sven



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-20109) Complete Kafka cluster dies on incorrect SSL config of a single controller

Reply via email to