pushkar-engagio opened a new issue #7328:
URL: https://github.com/apache/pulsar/issues/7328


   **Describe the bug**
   Long term failure of single bookie causes the entire cluster to go down. 
   
   **To Reproduce**
   Steps to reproduce the behavior:
   I had a 6 bookies in the cluster(500Gb journal storage, 1TB ledger storage). 
One of the bookie failed and could not start. Once the cluster detected downed 
bookie, it kicked in recovery process for underreplicated ledgers. The ledgers 
replicated fine for few minutes but during the recovery process another bookie 
went down(the service was running on the bookie but the bookie was dropped from 
cluster ie. did not show up in read only or read write bookie list). This cause 
additional ledgers to be underreplicated. This process continued until i was 
down a single bookkeeper node, taking down the entire cluster.
   
   **Expected behavior**
   A clear and concise description of what you expected to happen.
   The bookkeeper failure, should replicate under replicated ledgers from the 
down bookies, so that another bookkeeper node can be added to replace the 
downed bookie.
   
   **Screenshots**
   If applicable, add screenshots to help explain your problem.
   
   **Desktop (please complete the following information):**
    - OS: [e.g. iOS]
   - Pulsar version: 2.3.0
   - Operating system: Amazon linux 2
   - Java version: openjdk version "1.8.0_222"
   
   **Additional context**
   Add any other context about the problem here.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to