michaeljmarshall opened a new pull request #14177: URL: https://github.com/apache/pulsar/pull/14177
### Motivation In Pulsar 2.8, there is currently a bug that can lead to an incorrectly cached value in the `childrenCache`. The resulting behavior is that the broker serves the stale cache value until it is evicted from the cache. ### Steps to Reproduce Issue Start a 2.8 cluster with at least 2 brokers. It is helpful to run with debug logging to observe the ZK watch notifications. Run the following bash commands in order: ``` BROKER_1=192.168.6.228 BROKER_2=192.168.79.61 bin/pulsar-admin --admin-url http://$BROKER_1:8080 tenants create test bin/pulsar-admin --admin-url http://$BROKER_1:8080 namespaces create test/a bin/pulsar-admin --admin-url http://$BROKER_2:8080 topics list test/a bin/pulsar-admin --admin-url http://$BROKER_1:8080 topics create persistent://test/a/a bin/pulsar-admin --admin-url http://$BROKER_2:8080 topics list test/a ``` When broker 2 handles the command for `bin/pulsar-admin --admin-url http://$BROKER_2:8080 topics list test/a`, it caches a miss in the `childrenCache` in `AbstractMetadataStore` for path `/managed-ledgers/test/a/persistent`. After caching the miss, broker 2 only logs two ZK events: > 05:21:16.810 [main-EventThread] DEBUG org.apache.pulsar.metadata.impl.ZKMetadataStore - Received ZK watch : WatchedEvent state:SyncConnected type:NodeCreated path:/admin/local-policies/test/a > 05:21:19.808 [main-EventThread] DEBUG org.apache.pulsar.metadata.impl.ZKMetadataStore - Received ZK watch : WatchedEvent state:SyncConnected type:NodeCreated path:/managed-ledgers/test/a/persistent Note that the second even is of type `NodeCreated`. Because of its type, the `AbstractMetadataStore` does not invalidate the correct node in the `childrenCache`: https://github.com/apache/pulsar/blob/42422d84ab5c6d24b57138c39453b45d7dcfba35/pulsar-metadata/src/main/java/org/apache/pulsar/metadata/impl/AbstractMetadataStore.java#L163-L182 I was not able to reproduce this issue in 2.9. My theory is that we get around it because we have a persistent watch at `/`. Note also that when creating a second topic in the namespace, we see the following notification: > 07:14:42.181 [main-EventThread] DEBUG org.apache.pulsar.metadata.impl.ZKMetadataStore - Received ZK watch : WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/managed-ledgers/test/a/persistent In this case, we properly invalidate the child node in the cache. Also note that in 2.7 we _always_ invalidate the child node for a notification. I don't believe this is strictly necessary because we'll get `NodeChildrenChanged` notifications when the event is not created/deleted. https://github.com/apache/pulsar/blob/77f7965673119ff40c929b065ee837fe2256a221/pulsar-zookeeper-utils/src/main/java/org/apache/pulsar/zookeeper/ZooKeeperCache.java#L146 ### Modifications * Invalidate the `path` for `childrenCache` when the `path` is created or deleted. ### Verifying this change I added a test that failed before this change and passes after the change. ### Does this pull request potentially affect one of the following parts: This is an internal change. ### Documentation - [x] `no-need-doc` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
