chenlinzhong opened a new issue, #16634:
URL: https://github.com/apache/doris/issues/16634

   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Description
   
   be在高负载下brpc的bthread存在无法调度的问题,导致卡住接口的回调,一个常见的问题就是:
   timeout when waiting for send fragments RPC
   这个接口的耗时很短,通常数百微秒就应该返回,但是在高负载下会卡住很久,我们在高负载下通过perf top工具都在等待锁
   
![image](https://user-images.githubusercontent.com/11487604/218252468-df197938-73d5-4fbc-875e-50c6fb0de522.png)
   
   
   
原因是:bthread需要找到空闲的ptread才能成功调度,如果所有的pthread都在阻塞那么bthread会一直等待,那么导致pthread阻塞长时间阻塞的原因是啥?
   * 主要是调用了pthread阻塞函数,比如在等待一把锁(std::mutex)
   
   doris 
be接口中大量使用了std::mutex,在高负载下很容易把pthread的线程耗尽,一个简单的查询在be间产生的qps就能达到上万或者数万,影响brpc的所有请求的处理
   
   如何解决
   通常有3种方式
   - 1.brpc_num_thread: 调大线程的数量
   - 2.bthread::mutex替换std::mutex
   - 3.brpc+线程池方式:brpc不再处理请求,只做网络请求收发,请求接到到丢到线程池中
   
   方式1 治标不治本,pthread耗尽仍会影响所有brpc接口
   方案2  落地较难,这个多人开发的时候不太好操作了。没办法每个人都知道哪些函数需要bthread的mutex,哪些是需要pthread mutex
   方案3  可以彻底解决pthread耗尽影响
   
   
   ### Solution
   
   add thread pool to handle the be service logic, do not use brpc any more
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to