[ https://issues.apache.org/jira/browse/KAFKA-2587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14941474#comment-14941474 ]
Parth Brahmbhatt commented on KAFKA-2587: ----------------------------------------- I looked at the code to reason around why this can happen. The state reported is indeed one of the valid states during our test https://github.com/apache/kafka/blob/5764e54de147af81aac85acc00687c23e9646a5c/core/src/test/scala/unit/kafka/security/auth/SimpleAclAuthorizerTest.scala#L217 After that line we actually remove all acls for that resource, add one acl back to it and remove that one acl. All those steps pass verification. https://github.com/apache/kafka/blob/5764e54de147af81aac85acc00687c23e9646a5c/core/src/test/scala/unit/kafka/security/auth/SimpleAclAuthorizerTest.scala#L225 and https://github.com/apache/kafka/blob/5764e54de147af81aac85acc00687c23e9646a5c/core/src/test/scala/unit/kafka/security/auth/SimpleAclAuthorizerTest.scala#L226 Given we are using the same instance of the authorizer the cache of that instance is immediately updated for both add and remove. https://github.com/apache/kafka/blob/5764e54de147af81aac85acc00687c23e9646a5c/core/src/main/scala/kafka/security/auth/SimpleAclAuthorizer.scala#L171 https://github.com/apache/kafka/blob/5764e54de147af81aac85acc00687c23e9646a5c/core/src/main/scala/kafka/security/auth/SimpleAclAuthorizer.scala#L189 The only other place that can update the cache is notification handler as part of handling acl-changed notification. https://github.com/apache/kafka/blob/5764e54de147af81aac85acc00687c23e9646a5c/core/src/main/scala/kafka/security/auth/SimpleAclAuthorizer.scala#L269 However in that case we read the data from zookeeper and then update the cache. If the notifications processing was delayed for some reason, it should still read the acls from zk and then update the cache. There are pathological cases that can lead to this failure , for example: 1) Notification handler starts, reads acls from zk and a thread switch happens before it can update the cache 2) All the other cache updates go through (remove resource, add the acl, remove the acl). 3) Before verification finishes for the last "remove one acl" a thread switch happens and notification handler update the cache with stale acls that it read before. Even with this case there should be follow up notifications about adding an acl and removing an acl which should again cause the notification process to read state from zookeeper and update the cache to correct state. Plus this seems unlikely enough that it would not happen every other day. I will continue to look into this. In the meantime if this is a continuous dev pain, we can remove the last 3 lines of test that removes the last acl and tries to verify that the zookeeper path is deleted. > Transient test failure: `SimpleAclAuthorizerTest` > ------------------------------------------------- > > Key: KAFKA-2587 > URL: https://issues.apache.org/jira/browse/KAFKA-2587 > Project: Kafka > Issue Type: Sub-task > Reporter: Ismael Juma > Assignee: Parth Brahmbhatt > Fix For: 0.9.0.0 > > > I've seen `SimpleAclAuthorizerTest ` fail a couple of times since its recent > introduction. Here's one such build: > https://builds.apache.org/job/kafka-trunk-git-pr/576/console > [~parth.brahmbhatt], can you please take a look and see if it's an easy fix? -- This message was sent by Atlassian JIRA (v6.3.4#6332)