Re: [I] 和tensorflow一起使用可能导致的死锁问题 (brpc)

via GitHub Wed, 12 Jul 2023 01:48:04 -0700


zgaze commented on issue #2313:
URL: https://github.com/apache/brpc/issues/2313#issuecomment-1632102012


   > 
比较好奇你在TF里面是如何使用brpc的？看你的栈信息使用的DirectSession，DirectSession不是单机本地训练吗？应该不会走rpc呀？还是你们基于DirectSession做了一些特殊逻辑处理？
 另外，栈信息看不出brpc相关的东西。可以在gdb里面把所有线程的栈打印出来，命令: taas bt
   
   是单机的，只是为tf提供一个rpc接口。类似于 tensorflow service。 用户访问http接口，进行单机的tf计算，然后返回结果。
   
   其他的栈我看了，没有异常。 
只有33个线程阻塞在了DirectSession的wait上。此时，图的op里面用brpc调用的下游，因为是异步，但是没有bthread来处理回调，所以应该是堆积在了队列里（猜测，等后续复现，我会用gdb打印出内存内容看）。
   
   "在rpc的worker线程内等计算结束，把所有的worker都阻塞住了 " 这个结论我觉得是正确的。 所以我现在正寻找解决方案。
   不知道把"等计算结束"改成异步的，是不是能彻底解决这个问题。 
   
   另外我其实想知道。brpc对这种需要长时间的io的。如果不异步，是不是也会出现把worker线程打满的问题。 比如worker都陷入了同步的read。 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] 和tensorflow一起使用可能导致的死锁问题 (brpc)

Reply via email to