Thanks to @nswamy for his inputs and design discussions related to this project 
and @frankfliu for explaining the requirements and the use case from customer 
perspective.

# Problem Statement

One of the big un-catered for use cases in MXNet is loading a model and being 
able to run parallel inference on the model from multiple threads while sharing 
the parameters. There are multiple user requests for the same 
[[1]](https://github.com/apache/incubator-mxnet/issues/3946). There also has 
been a lot of confusion around the current state of MXNet with respect to 
thread safety.

This doc attempts to address three things : 

1. Tries to clarify the current state of MXNet with respect to thread safety.
2. Tries to give an idea of the benefits to expect from adding this feature.
3. Attempts to solve the problem of parallel inference by providing a 
multi-threaded inference API ( C APIs and frontend APIs in CPP and Python), 

# Current State of MXNet Thread Safety

## MXNet Dependency Engine Thread Safety

Examining MXNet dependency engine code, it looks like it was designed  to be 
thread safe. Tried to push Convolution op from multiple threads into MXNet 
Engine, to see if there are any issues with thread safety. Used CPP Package for 
the same. The script is provided here : 
https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/cpp-package/example/multithreading_engine_push_mxnet_op.cpp

```
./build/cpp-package/example/multithreading_engine_push_mxnet_op 2
```

The script pushes Convolution op to the engine from multiple threads. You can 
verify the correctness of the op with this script : 
https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/test_cached_op_ts_check.py

```
python3 test_cached_op_ts_check.py
```

## MXNet Graph Executor Thread Safety

Removed NaiveEngine only restriction for C Predict API and tried to run multi 
threaded inference with C Predict API using ThreadedEngine by commenting the 
check : 
https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/src/c_api/c_predict_api.cc

When running this example the program core dumps with memory leaks in Graph 
Executor Bind. This shows that graph executor is not thread safe. 

## Cached Op (Gluon Backend) Thread Safety

Try to create cached op in the main thread and spawn multiple threads to invoke 
the same cached op inside each of the threads. Here is the script which does 
the same : 
https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/cpp-package/example/multithreading_engine_push_cached_op.cpp

```
# Usage
./build/cpp-package/example/multithreading_engine_push_cached_op <num_threads> 
<context> <thread_safe>

# Example
./build/cpp-package/example/multithreading_engine_push_cached_op 20 cpu 0 // 
uses cached op available in master
```

Multiple failures seen when I run this: one is in the dmlc ThreadLocalStore 
[[2]](https://github.com/dmlc/dmlc-core/issues/571),  other is in MXPlanMemory, 
retrieving forward_ref_count attribute. These errors are because of race 
condition w.r.t reading and writing of shared states in CachedOp.

# Proposed Solution

### Additions (Prioritized for 1.6)

Proposing to add a minimal thread safe cached op for inference which will be 
the following :
1. Similar to cached op, except it supports only inference use cases. 
2. Doesn’t support inlining, dynamic shapes, bulking, static alloc. 
3. Use static thread_local variables for GraphInfo which maintains the 
fwd_graph state, buff which maintains all ndarray states and for op_states. [ 
There is scope for additional optimization here w.r.t separation of buffers for 
inputs and params]
4. The above addition means that we can instantiate only one thread safe cached 
op per process. The frontend API for SymbolBlockThreadSafe needs to be a 
singleton because of this limitation.

### C API Changes (Prioritized for 1.6)

Adding a new thread_safe flag for MXCreateCachedOpEx. When set to true this 
should create a thread_safe cached op instead of a cached op.

```
  /*!
   * \brief create cached operator
   */
  MXNET_DLL int MXCreateCachedOpEx(SymbolHandle handle,
                                   int num_flags,
                                   const char** keys,
                                   const char** vals,
                                   CachedOpHandle *out,
                                   bool thread_safe = false);
```

Add similar thread_safe flag flags to Invoke and Free to invoke thread safe 
cached op versions instead of the default versions. 

```
  /*!
   * \brief invoke a cached op
   * \param handle the handle to the cached op
   * \param num_inputs number of input NDArrays
   * \param inputs input NDArrays
   * \param num_outputs number of output NDArrays
   * \param outputs output NDArrays
   * \param out_stypes output ndarrays' stypes
   * \param thread_safe whether to invoke thread safe version of cached op.
   * \return 0 when success, -1 when failure happens
   */
    
      
  MXNET_DLL int MXInvokeCachedOpEx(CachedOpHandle handle,
                                   int num_inputs,
                                   NDArrayHandle *inputs,
                                   int *num_outputs,
                                   NDArrayHandle **outputs,
                                   const int** out_stypes,
                                   bool thread_safe = false);

    
                                   
  /*!
   * \brief free cached operator
   */
  MXNET_DLL int MXFreeCachedOp(CachedOpHandle handle, bool thread_safe = false);
```

#### Please see the PoC here for details:

1. Thread Safe Cached Op Code : 
https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/src/imperative/cached_op_threadsafe.h
 , 
https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/src/imperative/cached_op_threadsafe.cc
2. Example Code for invoking Cached Op inference from multiple threads : 
https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/cpp-package/example/multithreading_engine_push_cached_op.cpp
 

```
# Usage
./build/cpp-package/example/multithreading_engine_push_cached_op <num_threads> 
<context> <thread_safe>

# Example
./build/cpp-package/example/multithreading_engine_push_cached_op 20 cpu 1
```

#### Use Cases Tested:

1. Create cached op with a single op (Convolution) from main thread. Spawn 
additional threads and invoke cached op from each thread.
2. Create cached op with a full model (resnet-18) from main thread. Spawn 
additional threads and invoke cached op from each thread.

### CPP Frontend Changes (Priority for 1.6)

1. Add a singleton SymbolBlock (ThreadSafe version) with an imports API like in 
python, targeted for Inference.
2. Params will be loaded using ndarray module.
3. Initially only one context is supported but this can be extended to multi 
context.
4. Forward call will invoke CachedOp passing the input ndarrays and param 
ndarrays.
5. Initially sparse storage types won’t be supported and casting won’t be 
supported.
6. Will be added to the contrib API.

@access2rohit will be helping me with the CPP API changes.

### Python Frontend Changes (Lower Priority, Post 1.6)

1. Add a SymbolBlock (threadsafe version, singleton) inheriting the SymbolBlock 
with imports and forward API. 
2. Here is a PoC: 
https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/python/mxnet/gluon/contrib/block.py
  and an example of how to call it : 
https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/test_symbolblock_cached_op_ts.py
3. The PoC is currently not functioning and hangs randomly. This could be 
because of WaitForVar and WaitForAll thread safety issues and/or the cross 
device copy thread safety issues and/or issues with usage of python thread 
local. This requires some more investigation.

# Existing Issues

1. dmlc-core ThreadLocalStore 
issue[[2]](https://github.com/dmlc/dmlc-core/issues/571) . Reverting back to 
MX_THREAD_LOCAL fixes the issue but need to explore additional downsides of 
reverting back. (HIGH PRIORITY FOR 1.6)
2. WaitForVar and WaitForAll are not thread safe. (HIGH PRIORITY FOR 1.6).
3. Python API Issues mentioned above. (LOWER PRIORITY, POST 1.6).

# Expected Benefits

One big benefit is being able to run inference on the same model with shared 
params from multiple threads. Current approach is to use multiprocessing 
library and import mxnet in each process. This saves a lot of memory footprint 
and improves the throughput for inference on a single machine. To obtain some 
numbers I wrote a multiprocessing script in python to load model and run 
inference from multiple processes. 

Please see here for the python script : 
https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/test_symbolblock_cached_op_ts.py
This runs out of memory with 12 parallel inferences. 

When running the same model inference on CPP, please see example here : 
https://github.com/anirudh2290/mxnet/tree/multithreaded_inference_poc/cpp-package/example/multithreading_engine_push_cached_op_full_model.cpp

```
# Usage 
./build/cpp-package/example/multithreading_engine_push_cached_op_full_model 
<num_threads> <context>

# Example
./build/cpp-package/example/multithreading_engine_push_cached_op_full_model 20 
cpu
```

This is able to run more than 960 parallel inferences though there is an 
increased latency with higher number of parallel inferences.


# Model Coverage

|Models Tested|MKLDNN|CUDNN|NO-CUDNN|
| --- | --- | --- | --- |
| resnet-18 | Yes | Yes | Yes |

This is a work in progress list and more models will be added to this list.

# What will not be supported for 1.6 ?

Since, this is a new interface where many things can go wrong, we are starting 
small here and will incrementally add support. Lot of these features may just 
work but requires some effort with verification and won't be feasible for 1.6.

1. Only operators tested with the existing model coverage are supported. Other 
operators (stateful operators, custom operators) not supported.
2. Only dense storage types supported currently.
3. Multi GPU inference not supported currently.
4. Instantiating multiple instances of SymbolBlockThreadsafe is not supported. 
Can run parallel inference only on one model per process.
5. dynamic shapes not supported.
6. static_alloc and static_shape not supported.
7. Bulking of ops is not supported.
8. This is only for inference use cases, backward pass/training use cases not 
supported.
9. graph rewrites with subgraph api currently not supported.
10. Python Frontend Changes


# References 

1. https://github.com/apache/incubator-mxnet/issues/3946
2. https://github.com/dmlc/dmlc-core/issues/571


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/16431

Reply via email to