QueensGambit opened a new issue #16173: Saving and loading of cudNN 
optimization and graph fusion
URL: https://github.com/apache/incubator-mxnet/issues/16173
 
 
   Hello everyone,
   
   there are several tasks which are executed repeatedly on binding MXNet 
graphs and result in the same outcome when the graph is unchanged.
   In theory these results could be saved to disk and later reloaded. 
   These tasks include cudNN autotuning, TensorRT graph fusion, IntelMKLDNN 
graph optimization.
   
   Here is a short overview:
   
   ## cudNN convolution autotuning
   
   * **Description:** runs performance tests for convolutional layers to check 
what convolutional algorithm types are most performant for the given 
computation graph
   * **Indicated by**:
   ```
   incubator-mxnet/src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97:
   Running performance tests to find the best convolution algorithm, 
   this can take a while...
   (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
   ```
   * **Optimization time**:  medium (e.g. 9 seconds)
   * **Urgency**: high
   * **Note**: An experimental version for export and loading of cudNN 
optimization has already been implemented by @KellenSunderland and his team 
(https://github.com/apache/incubator-mxnet/issues/14539).
   * **Related Issues**:  
https://github.com/apache/incubator-mxnet/issues/10567, 
https://github.com/apache/incubator-mxnet/issues/14539
   
   ## TensorRT graph fusion
   
   * **Description**: attempts to fuse multiple cuda operations into one which 
saves memory transfer times
   * **Indicated by**: multiple log messages in the case the current GPU does 
not support fp16.
   ```
   ../src/operator/subgraph/tensorrt/onnx_to_tensorrt.cc:121:
    TensorRT can't use fp16 on this platform
   ```
   * **Optimization time**: long (e.g. 24 seconds)
   * **Urgency**: high
   
   ## MKLDNN graph optimization
   * **Description**: applies MKLDNN optimization to Convolutional and 
FullyConnected layers and is enabled by `MXNET_SUBGRAPH_BACKEND=MKLDNN`
   * **Indicated by**:
   ```
   src/operator/subgraph/build_subgraph.cc:686:
   start to execute MKLDNN convolution optimization pass.
   src/operator/subgraph/build_subgraph.cc:686:
   start to execute MKLDNN FullyConnected optimization pass.
   ```
   * **Optimization time**: very fast (e.g. < 1 second)
   * **Urgency**: low
   
   ## Experimental support
   
   First I suggest adding two experimental API methods (python, C++,...) for 
each optimization technique independently which can be called by 
`model.save_cache()` and `model.load_cache()`:
   These methods are only supported for static graphs. (e.g. after 
`net.hybridize()` in the case of Gluon).
   
   ```python
   def load_cache(filename, type)
   
   """Load optimization cache from file previously saved by `save_cache`.
   
   Parameters
   ----------
   filename : str
       Path to cache file.
   type : str, default 'cudnn'
       must be in {'cudnn', 'tensorrt', 'mkldnn'}
   References
   ----------
   `Saving and Loading Optimization Cache for Models \
   `_
   """
   
   if type == 'cudnn':
      raise NotImplementedError
      # _load_cudnn_cache(filename)
   elif type == 'tensorrt':
      raise NotImplementedError
      # _load_tensorrt_cache(filename)
   elif type == 'mkldnn':
      raise NotImplementedError
      # _load_mkldnn_cache(filename)
   else:
     raise ValueError("type must be in {'cudnn', 'tensorrt', 'mkldnn'}")
   ```
   
   ```python
   def save_cache(filename, type)
   
   """Saves the optimization cache for the graph.
       Must be run after `model.bind(optimize=True)`.
   
   Parameters
   ----------
   filename : str
       Path to cache file.
   type : str, default 'cudnn'
       must be in {'cudnn', 'tensorrt', 'mkldnn'}
   References
   ----------
   `Saving and Loading Optimization Cache for Models \
   `_
   """
   
   if type == 'cudnn':
      raise NotImplementedError
      # _save_cudnn_cache(filename)
   elif type == 'tensorrt':
      raise NotImplementedError
      # _save_tensorrt_cache(filename)
   elif type == 'mkldnn':
      raise NotImplementedError
      # _save_mkldnn_cache(filename)
   else:
     raise ValueError("type must be in {'cudnn', 'tensorrt', 'mkldnn'}")
   ```
   This addition requires a new boolean parameter for `bind()` methods which is 
set to `True` by default for backward-compatibility.
   `optimize=True`
   ```python
   def bind(self, ctx, args, args_grad=None, grad_req='write',
                  aux_states=None, group2ctx=None, shared_exec=None, optimize= 
True):
   """
      # ...
       optimize : boolean, optional
           When set to True, the model is optimized with the current back-end 
(e.g. cuDNN, TensorRT, MKLDNN)
   """
   ```
   
   ## Automatic integration with `bind()`
   As soon as caching and reloading has surpassed experimental status, we can 
consider integrating it as an automatic procedure to `bind()`:
   
   As a unified module I imagine the following process:
   
   For every model a unique fingerprint is generated similar to a git commit 
hash in terms of lenght.
   The last 7 digits of the fingerprint indicate the filename for the cache 
file.
   It is based on the following ingredients:
   
   For **cudNN convolution autotuning**
   * MXNet version (major, minor)
   * model structure
   * CUDA version (major, minor)
   * cudNN version (major, minor)
   
   For **TensorRT graph optimization:**
   * MXNet version (major, minor)
   * model structure
   * CUDA version (major, minor)
   * cudNN version (major, minor)
   * Tensorrt version (major, minor)
   
   **MKLDNN graph optimization**
   * MXNet version (major, minor)
   * model structure
   * MKLDNN version (major, minor)
   
   On every `bind()` call, MXNet generates the fingerprint and attempts to load 
the file from the current working directory if it exists.
   If the file was not found or loading failed then the optimization will be 
run and saved afterwards to `<fingerprint_digits>.cache`.
   
   ---
   
   Is an important detail missing or do you recommend changes in certain 
aspects (e.g. naming conventions)? 
   I am interested to hear your thought on this.
   
   
   Best regards,
   ~Johannes Czech

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to