[GitHub] [pulsar] devinbost opened a new issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

GitBox Wed, 12 May 2021 23:33:47 -0700


devinbost opened a new issue #6054:
URL: https://github.com/apache/pulsar/issues/6054

**Describe the bug**
Topics randomly freeze, causing catastrophic topic outages on a weekly (or
more frequent) basis. This has been an issue as long as my team has used
Pulsar, and it's been communicated to a number of folks on the Pulsar PMC
committee.

(I thought an issue was already created for this bug, but I couldn't find it
anywhere.)

**To Reproduce**
We have not figured out how to reproduce the issue. It's random (seems to be
non-deterministic) and doesn't seem to have any clues in the broker logs.

**Expected behavior**
Topics should never just randomly stop working to where the only resolution
is restarting the problem broker.

**Steps to Diagnose and Temporarily Resolve**

![image](https://user-images.githubusercontent.com/7418031/72367014-a3e81600-36b8-11ea-8e50-ea4ce7c9b329.png)
**Step 2**: Check the rate out on the topic. (click on the topic in the
dashboard, or do a stats on the topic and look at the "msgRateOut")

If the rate out is 0 this is likely a frozen topic, but to verify do the
following:

In the pulsar dashboard, click on the broker that topic is living on. If you
see that there are multiple topic that have a rate out of 0, then proceed to
the next step, if not it could potentially be another issue. Investigate
further.

![image](https://user-images.githubusercontent.com/7418031/72367085-c843f280-36b8-11ea-8f99-d24ec1edc933.png)

![image](https://user-images.githubusercontent.com/7418031/72367102-d560e180-36b8-11ea-86e4-9f1078adb13b.png)

**Step 3**: Stop the broker on the server that the topic is living on.
`pulsar-broker stop` .

**Step 4**: Wait for the backlog to be consumed and all the functions to be
rescheduled. (typically wait for about 5-10 mins)

**Environment:**
```
Docker on bare metal running: `apachepulsar/pulsar-all:2.4.0`
on CentOS.
Brokers are the function workers.
```
This has been an issue with previous versions of Pulsar as well.

**Additional context**

Problem was MUCH worse with Pulsar 2.4.2, so our team needed to roll back to
2.4.0 (which has the problem, but it's less frequent).
This is preventing the team from progressing in the use of Pulsar, and it's
causing SLA problems with those who use our service.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [pulsar] devinbost opened a new issue #6054: Catastrophic frequent random subscription freezes, especially on high-traffic topics.

Reply via email to