[
https://issues.apache.org/jira/browse/KAFKA-20651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085856#comment-18085856
]
ibenchhida commented on KAFKA-20651:
------------------------------------
Hi,
I'd like to draw attention to the criticality of this bug and encourage review
of the v1 patch.
*Problem*
Under specific (but realistic) ACL configurations — multiple principals with
PREFIXED patterns on a large topic set — StandardAuthorizer.authorize() causes
complete CPU saturation of all request handler threads, making the broker
unresponsive. The broker is eventually killed by the systemd watchdog.
*Root cause*
Two compounding algorithmic issues:
1. StandardAcl.kafkaPrincipal() parses and allocates a new KafkaPrincipal
object on every invocation, with no caching. This occurs at findResult() line
493:
if (!matchingPrincipals.contains(acl.kafkaPrincipal()))
2. findAclRule() iterates all ACLs in AclCache for every authorization check,
regardless of whether they belong to the requesting principal. With N
principals × M ACLs each, every handleTopicMetadataRequest call incurs O(N×M)
cost.
The result is O(N_ACLs × N_topics) per metadata request.
*Evidence from production*
Thread dump from an affected broker shows a request handler stuck in this loop:
at StandardAuthorizerData.findResult(493)
at StandardAuthorizerData.checkSection(428)
at StandardAuthorizerData.findAclRule(367)
at StandardAuthorizerData.authorize(247)
at StandardAuthorizer.authorize(143)
at AuthHelper.filterByAuthorized(113)
at KafkaApis.handleTopicMetadataRequest(1350)
at KafkaRequestHandler.run(159)
Handler #1 accumulated 656 seconds of CPU out of 24 minutes of uptime (~45% of
a single core) — and this was just one of 8 handlers. The handler was running
continuously on handleTopicMetadataRequest, iterating ACLs.
*Reproduction*
On a 6-VM test cluster with:
- 7,000 PREFIXED ACLs (topic-* pattern with distinct principals)
- 504 topics
- 60 concurrent clients
All 8 request handler threads are RUNNABLE pinned at ~98% CPU, exactly matching
the production pattern.
*Why this affects any KRaft cluster*
This is not a configuration mistake or an edge case. The bug is in
StandardAuthorizerData.findAclRule() which scans the entire ACL set linearly.
Any cluster using KRaft's built-in StandardAuthorizer with non-trivial ACLs —
especially PREFIXED patterns with multiple principals — will hit this under
sufficient metadata request load.
The issue exists since StandardAuthorizer became the default for KRaft (Kafka
3.4.0+). There is no effective workaround for users who need fine-grained
access control with many principals.
*The fix*
Two complementary changes:
1. Cache KafkaPrincipal in StandardAcl — a ConcurrentHashMap caches parsed
principals by their string representation, eliminating repeated allocations.
(patch v1 of {*}https://issues.apache.org/jira/browse/KAFKA-20651{*})
2. Principal index in AclCache (currently being finalized) — a secondary
aclsByPrincipal map + rewritten findAclRule iterates only over ACLs belonging
to the requesting principal, reducing per-request ACL scanning from O(N_total)
to O(N_per_principal).
Together, these reduce a single handleTopicMetadataRequest authorization from
iterating thousands of unrelated ACLs to just 1-2 per principal.
*Call for review*
I'd appreciate reviews of the v1 patch. This bug causes production outages
(brokers killed by watchdog, load reaching 58 on 128-core machines) and affects
any KRaft user with non-trivial ACL configurations.
> Cache parsed KafkaPrincipal in StandardAcl.kafkaPrincipal()
> -----------------------------------------------------------
>
> Key: KAFKA-20651
> URL: https://issues.apache.org/jira/browse/KAFKA-20651
> Project: Kafka
> Issue Type: Bug
> Components: core
> Affects Versions: 3.9.2
> Environment: KRaft clusters using StandardAuthorizer (3.4.0+)
> Reporter: ibenchhida
> Priority: Critical
> Labels: authorization, performance
> Attachments: KAFKA-20651.patch
>
>
> kafkaPrincipal() is called frequently during authorization (once per matching
> ACL). Each call parses the principal string and allocates a new
> KafkaPrincipal object.
> This adds a ConcurrentHashMap<String, KafkaPrincipal> cache to avoid
> redundant parsing and allocation. The cache is bounded by the number of
> distinct principal strings in the ACL store (typically orders of magnitude
> smaller than total ACL count).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)