[GitHub] [doris] chenlinzhong opened a new issue, #16203: [Bug] BufferControlBlock may block all fragment handle threads leads to be out of work

via GitHub Sun, 29 Jan 2023 00:37:27 -0800


chenlinzhong opened a new issue, #16203:
URL: https://github.com/apache/doris/issues/16203


   ### Search before asking
   
   - [X] I had searched in the 
[issues](https://github.com/apache/doris/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Version
   
   all version 
   
   ### What's Wrong?
   
   目前vresultsink实现逻辑在特定场景下，执行会被卡住，无法退出
   - vresultsink会先把数据写入到BufferControlBlock中
   - BufferControlBlock的大小为4096行(硬编码)，
   - 当buffer中的行数超过会被挂起，直到FE把数据取走，后重新唤醒，如果FE的StmtExecutor因为某些原因异常退出了
   - 
会导致这个挂起无法被唤醒，把线程卡住，随着时间的推移，可能最后把fragment处理池的所有线程(默认最大为fragment_pool_thread_num_max=512)都卡住，be停服，但此时心跳线程仍然正常，表现为be假死<img
 width="851" alt="image" 
src="https://user-images.githubusercontent.com/11487604/215314234-3faef20c-a509-423b-bf28-8cde5e702045.png";>
   
   ```
   #0  0x00007f9d2f3a143c in pthread_cond_wait@@GLIBC_2.3.2 () from 
/lib64/libpthread.so.0
   #1  0x0000564a83007bec in 
std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
   #2  0x0000564a7f1f4405 in 
doris::BufferControlBlock::add_batch(std::unique_ptr<doris::TFetchDataResult, 
std::default_delete<doris::TFetchDataResult> >&) ()
   #3  0x0000564a809300f0 in 
doris::vectorized::VMysqlResultWriter::append_block(doris::vectorized::Block&) 
()
   #4  0x0000564a808b6df9 in 
doris::vectorized::VResultSink::send(doris::RuntimeState*, 
doris::vectorized::Block*) ()
   #5  0x0000564a7f2022d5 in 
doris::PlanFragmentExecutor::open_vectorized_internal() ()
   #6  0x0000564a7f20393f in doris::PlanFragmentExecutor::open() ()
   ```
   
   
   
   
   ### What You Expected?
   
   BufferControlBlock work well
   
   
   ### How to Reproduce?
   
   1.修改代码
   <img width="924" alt="image" 
src="https://user-images.githubusercontent.com/11487604/215314795-639d6d29-0a07-4898-afd7-c1b34d6d9e0f.png";>
   <img width="970" alt="image" 
src="https://user-images.githubusercontent.com/11487604/215314881-8ff0f9bb-6982-4d6f-a96f-0254521f779e.png";>
   2.建一个表t_user，插入10w行数
   3.执行select * from t_user ,基本100%复现
   
   
   ### Anything Else?
   
   随着可用线程越来越少，这个也是我们大家经常遇到的fragmenttimeout超时的一个原因，尤其在高压力集群下
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [doris] chenlinzhong opened a new issue, #16203: [Bug] BufferControlBlock may block all fragment handle threads leads to be out of work

Reply via email to