[ 
https://issues.apache.org/jira/browse/KAFKA-20651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085856#comment-18085856
 ] 

ibenchhida edited comment on KAFKA-20651 at 6/5/26 8:20 AM:
------------------------------------------------------------

Hi,
I'd like to draw attention to the criticality of this bug and encourage review 
of the v1 patch.
*Problem*
Under specific (but realistic) ACL configurations — multiple principals with 
access controls on a large topic set — StandardAuthorizer.authorize() causes 
complete CPU saturation of all request handler threads, making the broker 
unresponsive. The broker is eventually killed by the systemd watchdog.
*Root cause*
Two compounding algorithmic issues in StandardAuthorizerData.findResult():
1. StandardAcl.kafkaPrincipal() (line 106) parses and allocates a new 
KafkaPrincipal object on every invocation, with no caching. This occurs at 
findResult() line 493:
if (!matchingPrincipals.contains(acl.kafkaPrincipal()))
2. findAclRule() scans the entire AclCache linearly for every authorization 
check, regardless of whether the ACL belongs to the requesting principal. With 
a large number of ACLs — whether LITERAL or PREFIXED — every 
handleTopicMetadataRequest call incurs O(N_ACLs × N_topics) cost.
*Evidence from production*
Thread dump from an affected broker shows a request handler stuck in this loop:
at StandardAuthorizerData.findResult(493)
at StandardAuthorizerData.checkSection(428)
at StandardAuthorizerData.findAclRule(367)
at StandardAuthorizerData.authorize(247)
at StandardAuthorizer.authorize(143)
at AuthHelper.filterByAuthorized(113)
at KafkaApis.handleTopicMetadataRequest(1350)
at KafkaRequestHandler.run(159)
Handler #81 accumulated 656 seconds of CPU out of 24 minutes of uptime (~45% of 
a single core) — and this was just one of 8 handlers. The handler was running 
continuously on handleTopicMetadataRequest, iterating ACLs. Across 7 brokers, 
the cluster reached load 58 on 128-core machines with 756% CPU per broker.
*Reproduction*
On a 6-VM test cluster with:
 - 7,000 ACLs (6,992 LITERAL + 8 PREFIXED) across 89 distinct principals
 - 504 topics
 - 60 concurrent clients
All request handler threads are CPU-saturated with load >70, exactly matching 
the production stack trace. The bug reproduces identically with both LITERAL 
and PREFIXED patterns — the linear scan is the issue, not the pattern type.
Why this affects any KRaft cluster
This is not a configuration mistake or an edge case. The bug is in 
StandardAuthorizerData.findAclRule() which scans the entire ACL set linearly. 
Any cluster using KRaft's built-in StandardAuthorizer with non-trivial ACLs — 
regardless of pattern type — will hit this under sufficient metadata request 
load.
The issue exists since StandardAuthorizer became the default for KRaft (Kafka 
3.4.0+). There is no effective workaround for users who need fine-grained 
access control with many principals.
The fix
Two complementary changes:
1. Cache KafkaPrincipal in StandardAcl — a ConcurrentHashMap caches parsed 
principals by their string representation, eliminating repeated allocations. 
(patch v1 of https://issues.apache.org/jira/browse/KAFKA-20651)
2. Principal index in AclCache (currently being finalized) — a secondary 
aclsByPrincipal map + rewritten findAclRule iterates only over ACLs belonging 
to the requesting principal, reducing per-request ACL scanning from O(N_total) 
to O(N_per_principal).
Together, these reduce a single handleTopicMetadataRequest authorization from 
iterating thousands of unrelated ACLs to just 1-2 per principal.
*Call for review*
I'd appreciate reviews of the v1 patch. This bug causes production outages 
(brokers killed by watchdog, load reaching 58 on 128-core machines) and affects 
any KRaft user with non-trivial ACL configurations.


was (Author: JIRAUSER311406):
Hi,
I'd like to draw attention to the criticality of this bug and encourage review 
of the v1 patch.
Problem
Under specific (but realistic) ACL configurations — multiple principals with 
access controls on a large topic set — StandardAuthorizer.authorize() causes 
complete CPU saturation of all request handler threads, making the broker 
unresponsive. The broker is eventually killed by the systemd watchdog.
Root cause
Two compounding algorithmic issues in StandardAuthorizerData.findResult():
1. StandardAcl.kafkaPrincipal() (line 106) parses and allocates a new 
KafkaPrincipal object on every invocation, with no caching. This occurs at 
findResult() line 493:
if (!matchingPrincipals.contains(acl.kafkaPrincipal()))
2. findAclRule() scans the entire AclCache linearly for every authorization 
check, regardless of whether the ACL belongs to the requesting principal. With 
a large number of ACLs — whether LITERAL or PREFIXED — every 
handleTopicMetadataRequest call incurs O(N_ACLs × N_topics) cost.
Evidence from production
Thread dump from an affected broker shows a request handler stuck in this loop:
at StandardAuthorizerData.findResult(493)
at StandardAuthorizerData.checkSection(428)
at StandardAuthorizerData.findAclRule(367)
at StandardAuthorizerData.authorize(247)
at StandardAuthorizer.authorize(143)
at AuthHelper.filterByAuthorized(113)
at KafkaApis.handleTopicMetadataRequest(1350)
at KafkaRequestHandler.run(159)
Handler #81 accumulated 656 seconds of CPU out of 24 minutes of uptime (~45% of 
a single core) — and this was just one of 8 handlers. The handler was running 
continuously on handleTopicMetadataRequest, iterating ACLs. Across 7 brokers, 
the cluster reached load 58 on 128-core machines with 756% CPU per broker.
Reproduction
On a 6-VM test cluster with:
- 7,000 ACLs (6,992 LITERAL + 8 PREFIXED) across 89 distinct principals
- 504 topics
- 60 concurrent clients
All request handler threads are CPU-saturated with load >70, exactly matching 
the production stack trace. The bug reproduces identically with both LITERAL 
and PREFIXED patterns — the linear scan is the issue, not the pattern type.
Why this affects any KRaft cluster
This is not a configuration mistake or an edge case. The bug is in 
StandardAuthorizerData.findAclRule() which scans the entire ACL set linearly. 
Any cluster using KRaft's built-in StandardAuthorizer with non-trivial ACLs — 
regardless of pattern type — will hit this under sufficient metadata request 
load.
The issue exists since StandardAuthorizer became the default for KRaft (Kafka 
3.4.0+). There is no effective workaround for users who need fine-grained 
access control with many principals.
The fix
Two complementary changes:
1. Cache KafkaPrincipal in StandardAcl — a ConcurrentHashMap caches parsed 
principals by their string representation, eliminating repeated allocations. 
(patch v1 of https://issues.apache.org/jira/browse/KAFKA-20651)
2. Principal index in AclCache (currently being finalized) — a secondary 
aclsByPrincipal map + rewritten findAclRule iterates only over ACLs belonging 
to the requesting principal, reducing per-request ACL scanning from O(N_total) 
to O(N_per_principal).
Together, these reduce a single handleTopicMetadataRequest authorization from 
iterating thousands of unrelated ACLs to just 1-2 per principal.
Call for review
I'd appreciate reviews of the v1 patch. This bug causes production outages 
(brokers killed by watchdog, load reaching 58 on 128-core machines) and affects 
any KRaft user with non-trivial ACL configurations.

> Cache parsed KafkaPrincipal in StandardAcl.kafkaPrincipal()
> -----------------------------------------------------------
>
>                 Key: KAFKA-20651
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20651
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.9.2
>         Environment: KRaft clusters using StandardAuthorizer (3.4.0+)
>            Reporter: ibenchhida
>            Priority: Critical
>              Labels: authorization, performance
>         Attachments: KAFKA-20651.patch
>
>
> kafkaPrincipal() is called frequently during authorization (once per matching 
> ACL). Each call parses the principal string and allocates a new 
> KafkaPrincipal object.
> This adds a ConcurrentHashMap<String, KafkaPrincipal> cache to avoid 
> redundant parsing and allocation. The cache is bounded by the number of 
> distinct principal strings in the ACL store (typically orders of magnitude 
> smaller than total ACL count).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to