[GitHub] [pulsar] xlu233 opened a new issue, #20405: [Bug] Missing watch event of metadata(zookeeper) when broker and zookeeper broken down at the same time

via GitHub Thu, 25 May 2023 20:54:50 -0700


xlu233 opened a new issue, #20405:
URL: https://github.com/apache/pulsar/issues/20405


   ### Search before asking
   
   - [X] I searched in the [issues](https://github.com/apache/pulsar/issues) 
and found nothing similar.
   
   
   ### Version
   
   OS: eulerosv2r7
   Pulsar Version: 2.10.0
   
   **Background**
   when broker and zookeeper broken down at the same time, sometimes the 
operation of deleting topic got failed,
   specifically,  broker list is [153, 154, 155], i fault the broker 154 and 
the zookeeper conneted to the 153 and 155.
   
   **In a case**, proxy send topic delete request to broker 155, at the time 
broker 155 add all brokers as the availableBrokers, the cache 
/loadbalance/brokers is not updated. So broker 155 redirect request to 154 and 
proxy cause 502 error
   
   ```
   2023-05-24T07:20:17,211+0800 [pulsar-external-web-5-6] INFO  
org.eclipse.jetty.server.RequestLog - 10.75.247.191 - - [24/May/2023:07:20:17 
+0800] "DELETE 
/admin/v2/persistent/new_tenant_1/new_namespace_1/new_topic_1/partitions 
HTTP/1.1" 502 0 "-" "-" 12
   ```
   here is log in broker 155
   ```
   2023-05-24T07:20:12,170+0800 [main-EventThread] DEBUG 
org.apache.pulsar.common.naming.NamespaceBundleFactory - 
[new_tenant_1/new_namespace_1] Get bundles from getLocalZkCacheService: 
bundles: BundlesDataImpl(boundaries=[0x00000000, 0x40000000, 0x80000000, 
0xc0000000, 0xffffffff], numBundles=4), version: Optional[0]
   2023-05-24T07:20:12,170+0800 [main-EventThread] DEBUG 
org.apache.pulsar.broker.namespace.NamespaceService - findBrokerServiceUrl: 
new_tenant_1/new_namespace_1/0xc0000000_0xffffffff - options: 
LookupOptions(authoritative=false, readOnly=false, loadTopicsInBundle=false, 
requestHttps=false, advertisedListenerName=null)
   2023-05-24T07:20:12,174+0800 
[main-SendThread(gemaspulsar.zookeeper.replicator1.cn-north-7.myhuaweicloud.com:2181)]
 DEBUG org.apache.zookeeper.ClientCnxn - Reading reply session id: 
0x1018b969af30001, packet:: clientPath:null serverPath:null finished:false 
header:: 2293,22  replyHeader:: 2293,2649994821664,0  request:: 
org.apache.zookeeper.MultiOperationRecord@5a4a274b response:: 
org.apache.zookeeper.MultiResponse@ffffff9b
   2023-05-24T07:20:12,174+0800 [pulsar-2-5] DEBUG 
org.apache.pulsar.broker.loadbalance.impl.LoadManagerShared - Added Shared 
Broker - [10.172.167.154] as possible Candidates for namespace - 
[new_tenant_1/new_namespace_1]
   2023-05-24T07:20:12,174+0800 [pulsar-2-5] DEBUG 
org.apache.pulsar.broker.loadbalance.impl.LoadManagerShared - Added Shared 
Broker - [10.172.167.153] as possible Candidates for namespace - 
[new_tenant_1/new_namespace_1]
   2023-05-24T07:20:12,174+0800 [pulsar-2-5] DEBUG 
org.apache.pulsar.broker.loadbalance.impl.LoadManagerShared - Added Shared 
Broker - [10.172.167.155] as possible Candidates for namespace - 
[new_tenant_1/new_namespace_1]
   2023-05-24T07:20:12,174+0800 [pulsar-2-5] DEBUG 
org.apache.pulsar.broker.loadbalance.impl.LoadManagerShared - Policies not 
present for namespace - [new_tenant_1/new_namespace_1] so only considering 
shared [3] brokers for possible owner
   2023-05-24T07:20:12,174+0800 [pulsar-2-5] INFO  
org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - 3 brokers 
being considered for assignment of 
new_tenant_1/new_namespace_1/0xc0000000_0xffffffff
   2023-05-24T07:20:12,175+0800 [pulsar-2-5] DEBUG 
org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker 
http://10.172.167.154:8080 has long term message rate 200.0
   2023-05-24T07:20:12,175+0800 [pulsar-2-5] DEBUG 
org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker 
http://10.172.167.153:8080 has long term message rate 200.00051005213578
   2023-05-24T07:20:12,175+0800 [pulsar-2-5] DEBUG 
org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate - Broker 
http://10.172.167.155:8080 has long term message rate 200.0
   2023-05-24T07:20:12,175+0800 [pulsar-2-5] DEBUG 
org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl - Selected 
broker Optional[10.172.167.154:8080] from candidate brokers 
[10.172.167.154:8080, 10.172.167.153:8080, 10.172.167.155:8080]
   2023-05-24T07:20:12,175+0800 [pulsar-2-5] DEBUG 
org.apache.pulsar.broker.namespace.NamespaceService - 
http://10.172.167.155:8080 : redirecting to the least loaded broker, lookup 
address=http://10.172.167.154:8080
   2023-05-24T07:20:12,175+0800 [pulsar-2-5] DEBUG 
org.apache.pulsar.broker.namespace.NamespaceService - Redirecting to broker 
http://10.172.167.154:8080 to acquire ownership of bundle 
new_tenant_1/new_namespace_1/0xc0000000_0xffffffff
   
   ```
   
   **another case** is proxy sending delete request to broker 155, broker 155 
find 154 is active because broker thinks broker 154 is in availableBrokers(see 
it at function **isBrokerActive**), the cache availableBrokers is not update, 
and then it will redict lookup request to broker 154, but the fact broker 154 
is broken.  
   
   here is the log: 
   ```
   2023-05-25T13:20:20,131+0800 [pulsar-web-35-16] DEBUG 
org.apache.pulsar.common.naming.NamespaceBundleFactory - Loading cache with 
bundles for new_tenant_1/new_namespace_1
   2023-05-25T13:20:20,135+0800 [main-EventThread] DEBUG 
org.apache.pulsar.common.naming.NamespaceBundleFactory - 
[new_tenant_1/new_namespace_1] Get bundles from getLocalZkCacheService: 
bundles: BundlesDataImpl(boundaries=[0x00000000, 0x40000000, 0x80000000, 
0xc0000000, 0xffffffff], numBundles=4), version: Optional[0]
   2023-05-25T13:20:20,135+0800 [main-EventThread] DEBUG 
org.apache.pulsar.broker.namespace.NamespaceService - findBrokerServiceUrl: 
new_tenant_1/new_namespace_1/0xc0000000_0xffffffff - options: 
LookupOptions(authoritative=false, readOnly=false, loadTopicsInBundle=false, 
requestHttps=false, advertisedListenerName=null)
   2023-05-25T13:20:20,140+0800 [pulsar-2-8] DEBUG 
org.apache.pulsar.broker.namespace.NamespaceService - Broker 
http://10.172.167.154:8080 (10.172.167.154:8080) **is available for**.
   5-25T13:20:20,140+0800 [pulsar-2-8] DEBUG 
org.apache.pulsar.broker.namespace.NamespaceService - Redirecting to broker 
http://10.172.167.154:8080 to acquire ownership of bundle 
new_tenant_1/new_namespace_1/0xc0000000_0xffffffff
   5-25T13:20:20,140+0800 [pulsar-2-8] 
pache.pulsar.broker.web.PulsarWebResource - Redirecting the rest call to 
http://10.172.167.154:8080/admin/v2/persistent/new_tenant_1/new_namespace_1/new_topic_1/partitions?authoritative=false
   ```
   ```
   private boolean isBrokerActive(String candidateBroker) {
           String candidateBrokerHostAndPort = 
parseHostAndPort(candidateBroker);
           Set<String> availableBrokers = getAvailableBrokers();
           if (availableBrokers.contains(candidateBrokerHostAndPort)) {
               if (LOG.isDebugEnabled()) {
                   LOG.debug("Broker {} ({}) is available for.", 
candidateBroker, candidateBrokerHostAndPort);
               }
               return true;
           } else {
               LOG.warn("Broker {} ({}) couldn't be found in available brokers 
{}",
                       candidateBroker, candidateBrokerHostAndPort,
                       
availableBrokers.stream().collect(Collectors.joining(",")));
               return false;
           }
       }
   ```
   
   **one more** 
   the cache  /loadbalance/leader sometime will not update when leader broker 
broken down, as you can see following
   ```
   2023-05-24T08:20:09,675+0800 [pulsar-web-35-1] DEBUG 
org.apache.pulsar.broker.namespace.NamespaceService - findBrokerServiceUrl: 
new_tenant_1/new_namespace_1/0x40000000_0x80000000 - options: 
LookupOptions(authoritative=false, readOnly=false, loadTopicsInBundle=false, 
requestHttps=false, advertisedListenerName=null)
   2023-05-24T08:20:09,675+0800 [pulsar-2-5] WARN  
org.apache.pulsar.broker.namespace.NamespaceService - Broker 
http://10.172.167.154:8080 (10.172.167.154:8080) couldn't be found in available 
brokers 10.172.167.153:8080,10.172.167.155:8080
   2023-05-24T08:20:09,675+0800 [pulsar-2-5] WARN  
org.apache.pulsar.broker.namespace.NamespaceService - The current leader broker 
LeaderBroker(serviceUrl=http://10.172.167.154:8080) isn't active. Handling load 
manager decisions in a decentralized way. 
NamespaceBundle[new_tenant_1/new_namespace_1/0x40000000_0x80000000]
   ```
   
   
   I think the reason is pulsar broker miss the watch event, when zk client 
lost connection and broker broken down before connection recovered, the 
function addWatch could not trigger watch event.
   
   
   
   
   ### Minimal reproduce step
   
   To valid the hypothesis，I reproduced the problem by following the steps below
   
   1. first we create a zk client and register listenser as following code, we 
only connect to one IP in the zk cluster.
   ```
   MetadataStoreExtended client = 
MetadataStoreExtended.create("10.155.192.47:2181", 
MetadataStoreConfig.builder().build());
   client.registerListener(notification -> {
       if ("/test".equals(notification.getPath()) && 
NotificationType.ChildrenChanged.equals(notification.getType())){
           client.getChildren("/test").whenComplete((nodes, ex) ->{
               System.out.println("node ChildrenChanged: " + nodes);
           });
       }
   });
   ```
   2. Disconnect the zk server the client connect to
   3. create node in another zk node
   4. recover the zk server the client connect to
   
   ### What did you expect to see?
   
   We should be able to see that zk client can receive the watch events during 
the disconnection time, but we can not get  change.
   
   I found discussion in https://github.com/apache/zookeeper/pull/1106
   
   ### What did you see instead?
   
   not any event recieved after the connection recover
   ```
   2023-05-26 11:51:27,661 - WARN  - 
[metadata-store-zk-session-watcher-2-1:ZKSessionWatcher@158] - ZooKeeper client 
is disconnected. Waiting to reconnect, time remaining = 9.68 seconds sessionID 
= 305aaad27c20001
   2023-05-26 11:51:29,677 - WARN  - 
[metadata-store-zk-session-watcher-2-1:ZKSessionWatcher@113] - timeout check zk 
state, use Disconnected, sessionId is 305aaad27c20001, zk state is CONNECTING
   2023-05-26 11:51:29,677 - WARN  - 
[metadata-store-zk-session-watcher-2-1:ZKSessionWatcher@158] - ZooKeeper client 
is disconnected. Waiting to reconnect, time remaining = 7.664 seconds sessionID 
= 305aaad27c20001
   2023-05-26 11:51:31,691 - WARN  - 
[metadata-store-zk-session-watcher-2-1:ZKSessionWatcher@113] - timeout check zk 
state, use Disconnected, sessionId is 305aaad27c20001, zk state is CONNECTING
   2023-05-26 11:51:31,691 - WARN  - 
[metadata-store-zk-session-watcher-2-1:ZKSessionWatcher@158] - ZooKeeper client 
is disconnected. Waiting to reconnect, time remaining = 5.65 seconds sessionID 
= 305aaad27c20001
   2023-05-26 11:51:33,399 - INFO  - 
[main-SendThread(10.155.192.47:2181):ClientCnxn$SendThread@1171] - Opening 
socket connection to server 10.155.192.47/10.155.192.47:2181.
   2023-05-26 11:51:33,399 - INFO  - 
[main-SendThread(10.155.192.47:2181):ClientCnxn$SendThread@1173] - SASL config 
status: Will not attempt to authenticate using SASL (unknown error)
   2023-05-26 11:51:33,428 - INFO  - 
[main-SendThread(10.155.192.47:2181):ClientCnxn$SendThread@1005] - Socket 
connection established, initiating session, client: /10.78.75.128:55422, 
server: 10.155.192.47/10.155.192.47:2181
   2023-05-26 11:51:33,464 - INFO  - 
[main-SendThread(10.155.192.47:2181):ClientCnxn$SendThread@1446] - Session 
establishment complete on server 10.155.192.47/10.155.192.47:2181, session id = 
0x305aaad27c20001, negotiated timeout = 30000
   2023-05-26 11:51:33,465 - INFO  - 
[main-EventThread:ZooKeeperWatcherBase@130] - ZooKeeper client is connected now.
   2023-05-26 11:51:33,477 - INFO  - [metadata-store-1-1:ZKSessionWatcher@126] 
- Got ZK session watch event: WatchedEvent state:SyncConnected type:None 
path:null, sessionId: 305aaad27c20001
   2023-05-26 11:51:33,477 - INFO  - [metadata-store-1-1:ZKSessionWatcher@170] 
- ZooKeeper client reconnection with server quorum. Current status: 
ConnectionLost, sessionId is 305aaad27c20001
   ```
   
   ### Anything else?
   
   see persistentRecursiveWatches at: 
https://github.com/apache/zookeeper/pull/1106
   
   ### Are you willing to submit a PR?
   
   - [X] I'm willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pulsar.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [pulsar] xlu233 opened a new issue, #20405: [Bug] Missing watch event of metadata(zookeeper) when broker and zookeeper broken down at the same time

Reply via email to