michaeljmarshall commented on issue #6368: URL: https://github.com/apache/pulsar/issues/6368#issuecomment-770564365
The fundamental problem with this test is that the broker's implementation of anti-affinity does not guarantee that two namespaces in the same anti-affinity group will be placed on different broker nodes. When all of the preferred brokers are overloaded, the `selectBrokerForAssignment` method will choose a broker that has a namespace in the same anti-affinity group. However, the test essentially asserts that two namespaces in the same anti-affinity group should never get placed on the same broker. I propose that we set the overloaded threshold config high enough that it won't allow for the preferred broker to be overridden. Additional detail on how I reached this conclusion follows. I ran `$ mvn clean test -Dtest=AntiAffinityNamespaceGroupTest -DfailIfNoTests=false -pl pulsar-broker` locally until I got a failure. This snippet from the failed test's logs show the essential information to expose the problem: ``` 15:23:57.910 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 2 brokers being considered for assignment of tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns1/0x00000000_0xffffffff 15:23:57.910 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:57033 is overloaded: max usage=0.8523102402687073 15:23:57.910 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:57033 is overloaded: CPU: 85.231026%, MEMORY: 27.4213%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: 0.0%, BANDWIDTH OUT: 0.0% 15:23:57.910 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Selected broker Optional[localhost:57041] from candidate brokers [localhost:57033, localhost:57041] 15:23:57.914 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 1 brokers being considered for assignment of tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns2/0x00000000_0xffffffff 15:23:57.914 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:57033 is overloaded: max usage=0.8523102402687073 15:23:57.914 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:57033 is overloaded: CPU: 85.231026%, MEMORY: 27.4213%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: 0.0%, BANDWIDTH OUT: 0.0% 15:23:57.914 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Selected broker Optional[localhost:57033] from candidate brokers [localhost:57033] 15:23:57.916 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:57033 is overloaded: max usage=0.8523102402687073 15:23:57.916 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:57033 is overloaded: CPU: 85.231026%, MEMORY: 27.4213%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: 0.0%, BANDWIDTH OUT: 0.0% ``` The test fails with the following error: ```text ------------------------------------------------------------------------------- Test set: org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest ------------------------------------------------------------------------------- Tests run: 7, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 65.387 s <<< FAILURE! - in org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest testBrokerSelectionForAntiAffinityGroup(org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest) Time elapsed: 0.177 s <<< FAILURE! java.lang.AssertionError: did not expect [localhost:57041] but found [localhost:57041] at org.testng.Assert.fail(Assert.java:99) at org.testng.Assert.failEquals(Assert.java:1041) at org.testng.Assert.assertNotEqualsImpl(Assert.java:147) at org.testng.Assert.assertNotEquals(Assert.java:1531) at org.testng.Assert.assertNotEquals(Assert.java:1535) at org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest.testBrokerSelectionForAntiAffinityGroup(AntiAffinityNamespaceGroupTest.java:427) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.testng.internal.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:132) at org.testng.internal.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:45) at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:73) at org.testng.internal.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` Given the second log line `Selected broker Optional[localhost:57033] from candidate brokers [localhost:57033]`, it's clear that the anti-affinity was applied, but the default overloaded threshold of 85% led the method to override the preferred broker, which led to the test's failure. First, I don't think we want to remove this test. It adds value by testing `selectBrokerForAssignment`'s usage of the anti-affinity method (`LoadManagerShared.filterAntiAffinityGroupOwnedBrokers`). Here is a reference to the `selectBrokerForAssignment` method: https://github.com/apache/pulsar/blob/d40fb5837fe33fb9f7630f7674e6953bb63ca164/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/impl/ModularLoadManagerImpl.java#L763-L861 Instead, I think it would make the most sense to set the `overloadedThreshold` high enough that it won't allow the preferred broker to be overridden. In this way, we remove the main caveat on the anti-affinity logic because no broker will be too overloaded. Note that the other caveat is that a second broker has to be available, but I don't believe that has been a problem. For reference, here are the full logs from the test: ```text ------- Starting test [TestClass name=class org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest].testBrokerSelectionForAntiAffinityGroup([])------- 15:23:57.795 [pulsar-ordered-OrderedExecutor-0-0-EventThread] INFO org.apache.pulsar.zookeeper.ZooKeeperChildrenCache - [State:CONNECTED Timeout:30000 sessionid:0x100468ba975000d local:/127.0.0.1:57037 remoteserver:localhost/127.0.0.1:57017 lastZxid:71 xid:8 sent:8 recv:9 queuedpkts:0 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/admin/clusters/use/failureDomain 15:23:57.795 [pulsar-ordered-OrderedExecutor-0-0-EventThread] INFO org.apache.pulsar.zookeeper.ZooKeeperChildrenCache - [State:CONNECTED Timeout:30000 sessionid:0x100468ba9750007 local:/127.0.0.1:57029 remoteserver:localhost/127.0.0.1:57017 lastZxid:58 xid:10 sent:10 recv:11 queuedpkts:0 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/admin/clusters/use/failureDomain 15:23:57.796 [pulsar-ordered-OrderedExecutor-0-0-EventThread] INFO org.apache.pulsar.zookeeper.ZooKeeperChildrenCache - reloadCache called in zookeeperChildrenCache for path /admin/clusters/use/failureDomain 15:23:57.796 [pulsar-ordered-OrderedExecutor-0-0-EventThread] INFO org.apache.pulsar.zookeeper.ZooKeeperChildrenCache - reloadCache called in zookeeperChildrenCache for path /admin/clusters/use/failureDomain 15:23:57.797 [pulsar-web-454-5] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [31/Jan/2021:15:23:57 -0700] "POST /admin/v2/clusters/use/failureDomains/domain1 HTTP/1.1" 204 0 "-" "Pulsar-Java-v2.8.0-SNAPSHOT" 17 15:23:57.799 [pulsar-modular-load-manager-445-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Cluster domain refreshed {localhost:57033=domain1} 15:23:57.800 [pulsar-modular-load-manager-492-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Cluster domain refreshed {localhost:57033=domain1} 15:23:57.814 [pulsar-ordered-OrderedExecutor-0-0-EventThread] INFO org.apache.pulsar.zookeeper.ZooKeeperChildrenCache - [State:CONNECTED Timeout:30000 sessionid:0x100468ba975000d local:/127.0.0.1:57037 remoteserver:localhost/127.0.0.1:57017 lastZxid:72 xid:10 sent:10 recv:12 queuedpkts:0 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/admin/clusters/use/failureDomain 15:23:57.814 [pulsar-ordered-OrderedExecutor-0-0-EventThread] INFO org.apache.pulsar.zookeeper.ZooKeeperChildrenCache - [State:CONNECTED Timeout:30000 sessionid:0x100468ba9750007 local:/127.0.0.1:57029 remoteserver:localhost/127.0.0.1:57017 lastZxid:72 xid:12 sent:12 recv:14 queuedpkts:0 pendingresp:0 queuedevents:0] Received ZooKeeper watch event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/admin/clusters/use/failureDomain 15:23:57.815 [pulsar-web-454-1] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [31/Jan/2021:15:23:57 -0700] "POST /admin/v2/clusters/use/failureDomains/domain2 HTTP/1.1" 204 0 "-" "Pulsar-Java-v2.8.0-SNAPSHOT" 10 15:23:57.816 [pulsar-ordered-OrderedExecutor-0-0-EventThread] INFO org.apache.pulsar.zookeeper.ZooKeeperChildrenCache - reloadCache called in zookeeperChildrenCache for path /admin/clusters/use/failureDomain 15:23:57.816 [pulsar-ordered-OrderedExecutor-0-0-EventThread] INFO org.apache.pulsar.zookeeper.ZooKeeperChildrenCache - reloadCache called in zookeeperChildrenCache for path /admin/clusters/use/failureDomain 15:23:57.818 [pulsar-modular-load-manager-445-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Cluster domain refreshed {localhost:57033=domain1, localhost:57041=domain2} 15:23:57.819 [pulsar-modular-load-manager-492-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Cluster domain refreshed {localhost:57033=domain1, localhost:57041=domain2} 15:23:57.831 [metadata-store-417-1] INFO org.apache.pulsar.broker.admin.impl.TenantsBase - [null] Created tenant tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89 15:23:57.832 [metadata-store-417-1] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [31/Jan/2021:15:23:57 -0700] "PUT /admin/v2/tenants/tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89 HTTP/1.1" 204 0 "-" "Pulsar-Java-v2.8.0-SNAPSHOT" 12 15:23:57.840 [pulsar-web-454-6] INFO org.apache.pulsar.broker.web.PulsarWebResource - Successfully validated clusters on tenant [tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89] 15:23:57.854 [pulsar-web-454-6] INFO org.apache.pulsar.broker.admin.impl.NamespacesBase - [null] Created namespace tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns1 15:23:57.855 [pulsar-web-454-6] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [31/Jan/2021:15:23:57 -0700] "PUT /admin/namespaces/tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns1 HTTP/1.1" 204 0 "-" "Pulsar-Java-v2.8.0-SNAPSHOT" 20 15:23:57.860 [pulsar-web-454-7] INFO org.apache.pulsar.broker.web.PulsarWebResource - Successfully validated clusters on tenant [tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89] 15:23:57.866 [pulsar-web-454-7] INFO org.apache.pulsar.broker.admin.impl.NamespacesBase - [null] Created namespace tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns2 15:23:57.867 [pulsar-web-454-7] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [31/Jan/2021:15:23:57 -0700] "PUT /admin/namespaces/tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns2 HTTP/1.1" 204 0 "-" "Pulsar-Java-v2.8.0-SNAPSHOT" 9 15:23:57.870 [pulsar-web-454-8] INFO org.apache.pulsar.broker.admin.impl.NamespacesBase - [null] Setting anti-affinity group group for tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns1 15:23:57.871 [pulsar-web-454-8] INFO org.apache.pulsar.broker.admin.impl.NamespacesBase - local-policies for tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns1 is not setup at path /admin/local-policies/tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns1 15:23:57.889 [pulsar-ordered-OrderedExecutor-2-0-EventThread] INFO org.apache.pulsar.broker.cache.LocalZooKeeperCacheService - Successfully copyied bundles data to local zk at /admin/local-policies/tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns1 15:23:57.893 [pulsar-web-454-8] INFO org.apache.pulsar.broker.admin.impl.NamespacesBase - [null] Successfully updated local-policies configuration: namespace=tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns1, map={"bundles":{"boundaries":["0x00000000","0xffffffff"],"numBundles":1},"namespaceAntiAffinityGroup":"group"} 15:23:57.893 [pulsar-web-454-8] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [31/Jan/2021:15:23:57 -0700] "POST /admin/namespaces/tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns1/antiAffinity HTTP/1.1" 204 0 "-" "Pulsar-Java-v2.8.0-SNAPSHOT" 24 15:23:57.897 [pulsar-web-454-3] INFO org.apache.pulsar.broker.admin.impl.NamespacesBase - [null] Setting anti-affinity group group for tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns2 15:23:57.898 [pulsar-web-454-3] INFO org.apache.pulsar.broker.admin.impl.NamespacesBase - local-policies for tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns2 is not setup at path /admin/local-policies/tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns2 15:23:57.901 [pulsar-ordered-OrderedExecutor-2-0-EventThread] INFO org.apache.pulsar.broker.cache.LocalZooKeeperCacheService - Successfully copyied bundles data to local zk at /admin/local-policies/tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns2 15:23:57.905 [pulsar-web-454-3] INFO org.apache.pulsar.broker.admin.impl.NamespacesBase - [null] Successfully updated local-policies configuration: namespace=tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns2, map={"bundles":{"boundaries":["0x00000000","0xffffffff"],"numBundles":1},"namespaceAntiAffinityGroup":"group"} 15:23:57.906 [pulsar-web-454-3] INFO org.eclipse.jetty.server.RequestLog - 127.0.0.1 - - [31/Jan/2021:15:23:57 -0700] "POST /admin/namespaces/tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns2/antiAffinity HTTP/1.1" 204 0 "-" "Pulsar-Java-v2.8.0-SNAPSHOT" 11 15:23:57.910 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 2 brokers being considered for assignment of tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns1/0x00000000_0xffffffff 15:23:57.910 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:57033 is overloaded: max usage=0.8523102402687073 15:23:57.910 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:57033 is overloaded: CPU: 85.231026%, MEMORY: 27.4213%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: 0.0%, BANDWIDTH OUT: 0.0% 15:23:57.910 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Selected broker Optional[localhost:57041] from candidate brokers [localhost:57033, localhost:57041] 15:23:57.914 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 1 brokers being considered for assignment of tenant-b0b0f443-a4b2-4df9-aa38-ff71fe743e89/use/ns2/0x00000000_0xffffffff 15:23:57.914 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:57033 is overloaded: max usage=0.8523102402687073 15:23:57.914 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:57033 is overloaded: CPU: 85.231026%, MEMORY: 27.4213%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: 0.0%, BANDWIDTH OUT: 0.0% 15:23:57.914 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] INFO org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Selected broker Optional[localhost:57033] from candidate brokers [localhost:57033] 15:23:57.916 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker http://localhost:57033 is overloaded: max usage=0.8523102402687073 15:23:57.916 [TestNG-method=testBrokerSelectionForAntiAffinityGroup-1] WARN org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker localhost:57033 is overloaded: CPU: 85.231026%, MEMORY: 27.4213%, DIRECT MEMORY: 2.4414062%, BANDWIDTH IN: 0.0%, BANDWIDTH OUT: 0.0% !!!!!!!!! FAILURE-- [TestClass name=class org.apache.pulsar.broker.loadbalance.AntiAffinityNamespaceGroupTest].testBrokerSelectionForAntiAffinityGroup([])------- 15:23:57.939 [main] INFO org.eclipse.jetty.server.AbstractConnector - Stopped PulsarServerConnector@57a9be7f{HTTP/1.1, (http/1.1)}{0.0.0.0:0} 15:23:57.939 [main] INFO org.eclipse.jetty.server.session - node0 Stopped scavenging 15:23:57.940 [main] INFO org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.h.ContextHandler@292a73f2{/static,null,STOPPED} 15:23:57.940 [main] INFO org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@4c384498{/metrics,null,STOPPED} 15:23:57.941 [main] INFO org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@2b60c832{/lookup,null,STOPPED} 15:23:57.942 [main] INFO org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@68c3670{/admin/v3,null,STOPPED} 15:23:57.942 [main] INFO org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@52463130{/admin/v2,null,STOPPED} 15:23:57.943 [main] INFO org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@13f70dd5{/admin,null,STOPPED} 15:23:57.943 [main] INFO org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@4260a1c{/,null,STOPPED} 15:23:57.945 [main] INFO org.eclipse.jetty.server.handler.ContextHandler - Stopped o.e.j.s.ServletContextHandler@2f94339e{/,null,STOPPED} 15:23:57.945 [main] INFO org.apache.pulsar.broker.web.WebService - Web service closed 15:23:57.945 [main] INFO org.apache.pulsar.broker.service.BrokerService - Shutting down Pulsar Broker service ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
