pushkar-engagio opened a new issue #7328:
URL: https://github.com/apache/pulsar/issues/7328
**Describe the bug**
Long term failure of single bookie causes the entire cluster to go down.
**To Reproduce**
Steps to reproduce the behavior:
I had a 6 bookies in the cluster(500Gb journal storage, 1TB ledger storage).
One of the bookie failed and could not start. Once the cluster detected downed
bookie, it kicked in recovery process for underreplicated ledgers. The ledgers
replicated fine for few minutes but during the recovery process another bookie
went down(the service was running on the bookie but the bookie was dropped from
cluster ie. did not show up in read only or read write bookie list). This cause
additional ledgers to be underreplicated. This process continued until i was
down a single bookkeeper node, taking down the entire cluster.
**Expected behavior**
A clear and concise description of what you expected to happen.
The bookkeeper failure, should replicate under replicated ledgers from the
down bookies, so that another bookkeeper node can be added to replace the
downed bookie.
**Screenshots**
If applicable, add screenshots to help explain your problem.
**Desktop (please complete the following information):**
- OS: [e.g. iOS]
- Pulsar version: 2.3.0
- Operating system: Amazon linux 2
- Java version: openjdk version "1.8.0_222"
**Additional context**
Add any other context about the problem here.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]