[GitHub] [doris] freemandealer opened a new issue, #17333: [Enhancement] better merge single_replica_load_rpc_service into brpc_service

via GitHub Wed, 01 Mar 2023 23:19:02 -0800


freemandealer opened a new issue, #17333:
URL: https://github.com/apache/doris/issues/17333


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Description
   
   better merge single_replica_load_rpc_service into brpc_service
   
   refer to there to see what is `single replica load`: 
   [single replica load 
design](https://cwiki.apache.org/confluence/display/DORIS/DSIP-015:+Support+single+replica+load+for+load)
   [single replica implement]( 
https://github.com/apache/doris/commit/f730a048b1e3baf3d78b27292698eecbae3fcef7)
   
   The code would be much clarified if there is only one rpc service. The 
reason why the original code uses two services is that rpc will time out under 
heavy workloads if use just one.
   
   However, according to the design of brpc thread model, even if running in 
separated service thread pools, the underlying pthreads/cpu cores are shared by 
the two services. I think the reason why the two services solution works is 
only because it claims more worker pthreads, which can also be achieved by 
simply setting bigger  `--bthread_concurrency`  gflag config.
   
   So two services achieve nothing but messyness. We should merge them into one 
and identify&fix the root cause of the timeout problem.
   
   ### Solution
   
   step1. merge single_replica_load_rpc_service into brpc_service
   
   step2. test the system with heavy load jobs to try to reproduce the rpc 
timeout problem
   
   step3. make assumptions & proving
   
   > some of my guesses:
   >
   > ```c++
   > void PInternalServiceImpl::request_slave_tablet_pull_rowset(...) {
   >     ...
   >     worker_pool.offer([=]() {
   >        ...     
   >     }
   >     ...
   > }
   > ```
   >
   > if the worker pool is full (possible under heavy loads), the pthread will 
wait under std::condition_variable. If most of the pthreads in the pool are 
blocked, there would be few pthread to deal with rpc request and then timeout 
happens.
   >
   > If it is true, we need to use bthread's condition variable instead of 
std::condition_variable. The former one will only yeild the underlying pthread 
to other rpc jobs, rather than blocking it.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [doris] freemandealer opened a new issue, #17333: [Enhancement] better merge single_replica_load_rpc_service into brpc_service

Reply via email to