[ 
https://issues.apache.org/jira/browse/KAFKA-20651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085856#comment-18085856
 ] 

ibenchhida commented on KAFKA-20651:
------------------------------------

Hi,
 
I'd like to draw attention to the criticality of this bug and encourage review 
of the v1 patch.



*Problem*
Under specific (but realistic) ACL configurations — multiple principals with 
PREFIXED patterns on a large topic set — StandardAuthorizer.authorize() causes 
complete CPU saturation of all request handler threads, making the broker 
unresponsive. The broker is eventually killed by the systemd watchdog.
 
*Root cause*
Two compounding algorithmic issues:
1. StandardAcl.kafkaPrincipal() parses and allocates a new KafkaPrincipal 
object on every invocation, with no caching. This occurs at findResult() line 
493:
if (!matchingPrincipals.contains(acl.kafkaPrincipal()))
2. findAclRule() iterates all ACLs in AclCache for every authorization check, 
regardless of whether they belong to the requesting principal. With N 
principals × M ACLs each, every handleTopicMetadataRequest call incurs O(N×M) 
cost.
The result is O(N_ACLs × N_topics) per metadata request.
 
*Evidence from production*
Thread dump from an affected broker shows a request handler stuck in this loop:
at StandardAuthorizerData.findResult(493)
at StandardAuthorizerData.checkSection(428)
at StandardAuthorizerData.findAclRule(367)
at StandardAuthorizerData.authorize(247)
at StandardAuthorizer.authorize(143)
at AuthHelper.filterByAuthorized(113)
at KafkaApis.handleTopicMetadataRequest(1350)
at KafkaRequestHandler.run(159)
Handler #1 accumulated 656 seconds of CPU out of 24 minutes of uptime (~45% of 
a single core) — and this was just one of 8 handlers. The handler was running 
continuously on handleTopicMetadataRequest, iterating ACLs.
 
*Reproduction*
On a 6-VM test cluster with:
- 7,000 PREFIXED ACLs (topic-* pattern with distinct principals)
- 504 topics
- 60 concurrent clients
All 8 request handler threads are RUNNABLE pinned at ~98% CPU, exactly matching 
the production pattern.
 
*Why this affects any KRaft cluster*
This is not a configuration mistake or an edge case. The bug is in 
StandardAuthorizerData.findAclRule() which scans the entire ACL set linearly. 
Any cluster using KRaft's built-in StandardAuthorizer with non-trivial ACLs — 
especially PREFIXED patterns with multiple principals — will hit this under 
sufficient metadata request load.
The issue exists since StandardAuthorizer became the default for KRaft (Kafka 
3.4.0+). There is no effective workaround for users who need fine-grained 
access control with many principals.
 
*The fix*
Two complementary changes:
1. Cache KafkaPrincipal in StandardAcl — a ConcurrentHashMap caches parsed 
principals by their string representation, eliminating repeated allocations. 
(patch v1 of {*}https://issues.apache.org/jira/browse/KAFKA-20651{*})
2. Principal index in AclCache (currently being finalized) — a secondary 
aclsByPrincipal map + rewritten findAclRule iterates only over ACLs belonging 
to the requesting principal, reducing per-request ACL scanning from O(N_total) 
to O(N_per_principal).
Together, these reduce a single handleTopicMetadataRequest authorization from 
iterating thousands of unrelated ACLs to just 1-2 per principal.
 
*Call for review*
I'd appreciate reviews of the v1 patch. This bug causes production outages 
(brokers killed by watchdog, load reaching 58 on 128-core machines) and affects 
any KRaft user with non-trivial ACL configurations.
 

> Cache parsed KafkaPrincipal in StandardAcl.kafkaPrincipal()
> -----------------------------------------------------------
>
>                 Key: KAFKA-20651
>                 URL: https://issues.apache.org/jira/browse/KAFKA-20651
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 3.9.2
>         Environment: KRaft clusters using StandardAuthorizer (3.4.0+)
>            Reporter: ibenchhida
>            Priority: Critical
>              Labels: authorization, performance
>         Attachments: KAFKA-20651.patch
>
>
> kafkaPrincipal() is called frequently during authorization (once per matching 
> ACL). Each call parses the principal string and allocates a new 
> KafkaPrincipal object.
> This adds a ConcurrentHashMap<String, KafkaPrincipal> cache to avoid 
> redundant parsing and allocation. The cache is bounded by the number of 
> distinct principal strings in the ACL store (typically orders of magnitude 
> smaller than total ACL count).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to