devinbost opened a new issue #6054:
URL: https://github.com/apache/pulsar/issues/6054


   **Describe the bug**
   Topics randomly freeze, causing catastrophic topic outages on a weekly (or 
more frequent) basis. This has been an issue as long as my team has used 
Pulsar, and it's been communicated to a number of folks on the Pulsar PMC 
committee.  
   
   (I thought an issue was already created for this bug, but I couldn't find it 
anywhere.)
   
   **To Reproduce**
   We have not figured out how to reproduce the issue. It's random (seems to be 
non-deterministic) and doesn't seem to have any clues in the broker logs. 
   
   **Expected behavior**
   Topics should never just randomly stop working to where the only resolution 
is restarting the problem broker. 
   
   **Steps to Diagnose and Temporarily Resolve**
   
![image](https://user-images.githubusercontent.com/7418031/72367014-a3e81600-36b8-11ea-8e50-ea4ce7c9b329.png)
   **Step 2**: Check the rate out on the topic. (click on the topic in the 
dashboard, or do a stats on the topic and look at the "msgRateOut")
   
   If the rate out is 0 this is likely a frozen topic, but to verify do the 
following: 
   
   In the pulsar dashboard, click on the broker that topic is living on. If you 
see that there are multiple topic that have a rate out of 0, then proceed to 
the next step, if not it could potentially be another issue. Investigate 
further.
   
![image](https://user-images.githubusercontent.com/7418031/72367085-c843f280-36b8-11ea-8f99-d24ec1edc933.png)
   
   
![image](https://user-images.githubusercontent.com/7418031/72367102-d560e180-36b8-11ea-86e4-9f1078adb13b.png)
   
   **Step 3**: Stop the broker on the server that the topic is living on. 
`pulsar-broker stop` . 
   
   **Step 4**: Wait for the backlog to be consumed and all the functions to be 
rescheduled. (typically wait for about 5-10 mins) 
   
   **Environment:**
   ```
   Docker on bare metal running: `apachepulsar/pulsar-all:2.4.0`
   on CentOS.
   Brokers are the function workers. 
   ```
   This has been an issue with previous versions of Pulsar as well. 
   
   **Additional context**
   
   Problem was MUCH worse with Pulsar 2.4.2, so our team needed to roll back to 
2.4.0 (which has the problem, but it's less frequent). 
   This is preventing the team from progressing in the use of Pulsar, and it's 
causing SLA problems with those who use our service. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to