[I] 和tensorflow一起使用可能导致的死锁问题 (brpc)

via GitHub Tue, 11 Jul 2023 00:15:37 -0700


zgaze opened a new issue, #2313:
URL: https://github.com/apache/brpc/issues/2313


   **Describe the bug (描述bug)**
   和tensorflow一起使用有概率会死锁。 准备来说并不确定是brpc的bug。 
   但是使用公司内部rpc，相同的行为并不会发生死锁，因此比较大概率是brpc的bthread带来的影响。
   
   **现象和描述**
   用法：
   1、rpc的接口里面调用tf的session run。设置了tf的超时时间，这个是tf主线程，应该是个bthread
   2、tf应该是有自己的线程池处理本次pv。
   3、tf的图节点有一个异步节点。 在这个异步节点 我使用brpc的DynamicPartitionChannel 
请求下游多个节点，有设置超时时间。由rpc的回调触发tf的回调done。
   现象：
   1、死锁的线程只有一个栈(并没有bthread直接相关的栈)。是rpc的线程（bthread），在等待tf结果返回。 
等于是这些工作线程死锁了，并不能继续对外提供服务。
   <img width="1203" alt="image" 
src="https://github.com/apache/brpc/assets/20501506/73d997f8-8785-4355-8617-9678bd5199c0";>
   2、brpc请求下游（在tf框架里，应该是pthread线程）出现了多个超时，1分钟后bthread开始大量报超时。然后死锁。
   3、线上多台服务，只有一两台有这个请求，稳定运行两周后偶现。
   
   tensorflow 使用nsync库来实现同步。比如 condition_variable，不确定它是否会跟bthread冲突。
   
   很愿意配合排查，任何建议都欢迎。
   
   **Expected behavior (期望行为)**
   
   
   **Versions (各种版本)**
   OS:
   Compiler: gcc 4.9.2
   brpc: 1.2.0
   protobuf: 3.8
   
   **Additional context/screenshots (更多上下文/截图)**
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] 和tensorflow一起使用可能导致的死锁问题 (brpc)

Reply via email to