Thanks for the thoughtful and valuable comments @arcadiaphy.

> I've deployed many models with scala API, and run them in multiple threads. 
> The whole system has run smoothly in production environment for more than 2 
> months.

> The backend of inference is graph executor, which is created for each thread 
> with shared model parameters. The executors can be dynamically reshaped in 
> each thread independently according to the shape of the data input.

Yes, if I am not mistaken this is very similar to how the C Predict API 
supports multi threaded inference today.

> Like what's mentioned above, the dependency engine is not thread safe, so if 
> you run it in threaded engine, dead lock and core dump will happen. 
> Therefore, naive engine is the only option left. Without the dependency 
> scheduling, any write dependency on model parameters is likely to be executed 
> simultaneously and mess the internal data. If mkldnn is used to accelerate 
> inference, you will get non-deterministic results per inference because mxnet 
> stealthily reorder the data in ndarray (write dependency involved) for mkldnn 
> operators. I've used a temporary method to address this issue which is not 
> suitable for an official PR.

This is a very useful point. In my proposal, I was concentrating mostly on 
ThreadedEngine and not NaiveEngine. Though, recently I added tests for 
NaiveEngine in my PR and everything seemed to be working fine. Till now I have 
not been able to reproduce the correctness issue that you mention with MKLDNN 
(hidden write) and NaiveEngine, but it could be because the Reorder doesnt 
happen in the spawned thread. Here is my test: 
https://github.com/apache/incubator-mxnet/pull/16654/files#diff-1335fbaf3930b1438d9be18edb07a1a6R1384
 . Not sure, if something changed with MKLDNN 1.0 or my test doesnt catch that 
use case, will dig more into this. 


> Multithreaded inference should be used with caution. Sharing model parameters 
> can reduce the memory footprint in your program, but a lot of memory usage is 
> consumed by global resources (temporary workspace, random number generator, 
> ...) or op cache for mkldnn which are stored in static thread_local 
> variables. So thread number is the most important factor for memory 
> footprint, any thread involving mxnet operation, be it any trivial imperative 
> invoking of operators, will incur memory overhead by creating its own set of 
> thread_local variables. I've spent so much time tracking down memory leak and 
> the best solution is to limit thread number.

> A new method to do multithreaded inference by threaded engine is much 
> welcomed here. It will solve the above issues automatically and ensure result 
> correctness by enforcing dependency checking.

Yes, the earlier approach which has one graph executor per thread, may have a 
lot of memory consumption for global resources. Sharing the cached op will 
alleviate the pain. As you know, we still have a lot of customers using graph 
executor as the backend. Would be a great add, if you are interested to 
contribute towards making graph executor also thread safe for inference use 
cases.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/16431#issuecomment-562335146

Reply via email to