[
https://issues.apache.org/jira/browse/KAFKA-20109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18062166#comment-18062166
]
Gergely Harmadás commented on KAFKA-20109:
------------------------------------------
[~svdewitmam]
{quote}while the controller cannot authorize incoming connections from other
nodes, the other nodes will happily accept connections from the misconfigured
controller
{quote}
That is correct as the principal mapping works as expected on the other nodes.
I have also tested with 4.2 and noticed I hit this issue way less frequently. I
think the reason is KAFKA-16926. Basically it is an optimization for the active
controller heartbeat ({{{}BEGIN_QUORUM_EPOCH{}}}). The heartbeat is only sent
to a given controller if it failed to send a {{FETCH}} request until the
defined timeout. So I think what happens is that if Controller 3 starts up and
sends a {{FETCH}} request to the active controller before it would receive the
{{BEGIN_QUORUM_EPOCH}} request then the cascading failure does not happen.
Controller 3 would never receive {{BEGIN_QUORUM_EPOCH}} from the active
controller as long as it never lags behind too much. But when Controller 3 ever
receives the {{BEGIN_QUORUM_EPOCH}} request from the active controller
everything will be pretty much the same as on 4.1.
> Complete Kafka cluster dies on incorrect SSL config of a single controller
> --------------------------------------------------------------------------
>
> Key: KAFKA-20109
> URL: https://issues.apache.org/jira/browse/KAFKA-20109
> Project: Kafka
> Issue Type: Bug
> Components: config, controller
> Affects Versions: 4.2.0, 4.1.1
> Environment: Debian trixie x86_64, Apache Kafka 3.9.0 - 4.1.1
> Reporter: Sven Dewit
> Assignee: Gergely Harmadás
> Priority: Major
> Attachments: controller3.log, reproduce.tar.gz
>
>
> Hello,
> we've recently run into a bug in Apache Kafka in Kraft mode where a whole
> mtls-enabled cluster (controllers + brokers) die if a single controller is
> (re)started with bad ssl principal mapping rules.
> The bad config of course was appllied unintentionally when doing some changes
> in the config management of the system, basically it led to
> {{ssl.principal.mapping.rules}} missing for the controller listener on that
> one node. As soon as this single controller was restarted, the whole cluster
> died within seconds, both controllers and brokers, with this error message:
> {code:java}
> ERROR Encountered fatal fault: Unexpected error in raft IO thread
> (org.apache.kafka.server.fault.ProcessTerminatingFaultHandler)
> org.apache.kafka.common.errors.ClusterAuthorizationException: Received
> cluster authorization error in response InboundResponse(correlationId=493,
> data=BeginQuorumEpochResponseData(errorCode=31, topics=[], nodeEndpoints=[]),
> source=controller-3:9093 (id: 103 rack: null isFenced: false)) {code}
> While the missing/bad ssl principal mapping is a major misconfiguration on a
> cluster where in-cluster communication is based on mtls, this still should
> not lead to the whole cluster terminating.
> The issue occurred on version 4.1.1 of Apache Kafka, but could be reproduced
> back to 3.9.0.
> To reproduce, see the attached tarball containing
> * {{gen-test-ca-and-certs.sh}} to create ca and certificates for brokers and
> controllers to work in mtls mode
> * {{compose.yml}} to spin up the cluster with {{podman compose}}
> Once the cluster is running, the following steps reproduce the error:
> * {{podman compose down controller-3}} to stop controller 3
> * uncomment line 53 in {{compose.yml}} to delete controller 3's
> {{ssl.principal.mapping.rules}}
> * {{podman compose up controller-3}} and watch the cluster go down the drain
>
> In case I can provide you with any more information or support don't hesitate
> to reach out to me.
>
> Best regards,
> Sven
--
This message was sent by Atlassian Jira
(v8.20.10#820010)