I really like this proposal, thanks for the great write-up Przemyslaw.

I haven't totally thought through pros/cons, but would it be possible to return 
a cudaStreamWaitEvent by default after every block of operators is called, and 
use that as a reference for any dependent block of ops? Would this unblock our 
GPU worker threads because we're not calling a cudaStreamSync?

If I'm understanding correctly that would be the equivalent of what you're 
proposing in your second scenario (when we have two cuda streams)? Would it 
have a lot of overhead in scenario 1 where we use same stream?

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/18951#issuecomment-675180856

Reply via email to