BewareMyPower commented on PR #23298:
URL: https://github.com/apache/pulsar/pull/23298#issuecomment-2347953367

   From another perspective, this PR resolves the case the broker exited 
unexpectedly (e.g. by a crash). Just like the incident we met online:
   
   ```
   Sep 12, 2024 @ 07:00:10.089  #  SIGSEGV (0xb) at pc=0x00000000000049c6, 
pid=1, tid=454
   ```
   
   The JVM process exited by a SIGSEGV signal and no cleanup was done, even the 
metadata store client was not closed. Then when the broker restarted, it would 
fail with `LockBusyException`. After some time, the metadata store server 
detected the client's session timeout and then it deleted the node. Then 
restarting the broker would succeed.
   
   When programming, we should handle exceptions rather than errors.
   - Unexpected crash of some other unexpected logics that skip the unregister 
are "exceptions": it can be recovered by ignoring the version compare, like 
what this PR does.
   - Misconfiguration at the beginning are "errors": it cannot be recovered 
until the operators realized this error and updated the correct configuration.
   
   There is another solution that we can add retry logic to 
`BrokerRegistryImpl#register`, here is a simple example (without using 
`Backoff`):
   
   ```java
               final var deadline = System.currentTimeMillis() + 
pulsar.getConfig().getMetadataStoreSessionTimeoutMillis();
               Throwable throwable = null;
               while (true) {
                   if (System.currentTimeMillis() > deadline) {
                       throw MetadataStoreException.unwrap(throwable == null ? 
new TimeoutException("Failed to register")
                               : throwable);
                   }
                   try {
                       this.brokerLookupDataLock = 
brokerLookupDataLockManager.acquireLock(keyPath(brokerId), brokerLookupData)
                               
.get(conf.getMetadataStoreOperationTimeoutSeconds(), TimeUnit.SECONDS);
                       this.state = State.Registered;
                       break;
                   } catch (InterruptedException | ExecutionException | 
TimeoutException e) {
                       throwable = e;
                       try {
                           Thread.sleep(100);
                       } catch (InterruptedException ex) {
                           throw MetadataStoreException.unwrap(ex);
                       }
                   }
               }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to