[GitHub] [incubator-mxnet] samskalicky commented on a change in pull request #19385: Update MXNet-TRT docs with the new optimize_for API

GitBox Tue, 20 Oct 2020 17:21:30 -0700


samskalicky commented on a change in pull request #19385:
URL: https://github.com/apache/incubator-mxnet/pull/19385#discussion_r508916692




##########
File path: 
docs/python_docs/python/tutorials/performance/backend/tensorrt/tensorrt.md
##########
@@ -33,74 +33,92 @@ from mxnet.gluon.model_zoo import vision
 import time
 import os
 
+ctx=mx.gpu(0)
+
+path = 'resnet18_v2'
 batch_shape = (1, 3, 224, 224)
-resnet18 = vision.resnet18_v2(pretrained=True)
-resnet18.hybridize()
-resnet18.forward(mx.nd.zeros(batch_shape))
-resnet18.export('resnet18_v2')
-sym, arg_params, aux_params = mx.model.load_checkpoint('resnet18_v2', 0)
+x = mx.nd.zeros(batch_shape, ctx=ctx)
+
+model = vision.resnet18_v2(pretrained=True)
+model.hybridize(static_shape=True, static_alloc=True)
+model.collect_params().reset_ctx(ctx)
+
+[p.reset_ctx(ctx) for _,p in model.collect_params().items() if p]
 ```
-In our first section of code we import the modules needed to run MXNet, and to 
time our benchmark runs.  We then download a pretrained version of Resnet18, 
hybridize it, and load it symbolically.  It's important to note that the 
experimental version of TensorRT integration will only work with the symbolic 
MXNet API. If you're using Gluon, you must 
[hybridize](https://gluon.mxnet.io/chapter07_distributed-learning/hybridize.html)
 your computation graph and export it as a symbol before running inference.  
This may be addressed in future releases of MXNet, but in general if you're 
concerned about getting the best inference performance possible from your 
models, it's a good practice to hybridize.
+In our first section of code we import the modules needed to run MXNet, and to 
time our benchmark runs.  We then download a pretrained version of Resnet18. We 
hybridize (link to hybridization) it with static_alloc and static_shape to get 
the best performance.
 
 ## MXNet Baseline Performance
 ```python
-# Create sample input
-input = mx.nd.zeros(batch_shape)
-
-# Execute with MXNet
-executor = sym.simple_bind(ctx=mx.gpu(0), data=batch_shape, grad_req='null', 
force_rebind=True)
-executor.copy_params_from(arg_params, aux_params)
-
 # Warmup
-print('Warming up MXNet')
-for i in range(0, 10):
-    y_gen = executor.forward(is_train=False, data=input)
-    y_gen[0].wait_to_read()
+for i in range(0, 1000):
+       out = model(x)
+       out[0].wait_to_read()
 
 # Timing
-print('Starting MXNet timed run')
-start = time.process_time()
+start = time.time()
 for i in range(0, 10000):
-    y_gen = executor.forward(is_train=False, data=input)
-    y_gen[0].wait_to_read()
-end = time.time()
-print(time.process_time() - start)
+       out = model(x)
+       out[0].wait_to_read()
+print(time.time() - start)
 ```
 
-We are interested in inference performance, so to simplify the benchmark we'll 
pass a tensor filled with zeros as an input.  We bind a symbol as usual, 
returning an MXNet executor, and we run forward on this executor in a loop.  To 
help improve the accuracy of our benchmarks we run a small number of 
predictions as a warmup before running our timed loop.  On a modern PC with an 
RTX 2070 GPU the time taken for our MXNet baseline is **17.20s**.  Next we'll 
run the same model with TensorRT enabled, and see how the performance compares.
+For this experiment we are strictly interested in inference performance, so to 
simplify the benchmark we'll pass a tensor filled with zeros as an input. 
+To help improve the accuracy of our benchmarks we run a small number of 
predictions as a warmup before running our timed loop. This will ensure various 
lazy operations, which do not represent real-world usage, have completed before 
we measure relative performance improvement. On a system with a V100 GPU, the 
time taken for our MXNet baseline is **19.5s** (512 samples/s).
 
 ## MXNet with TensorRT Integration Performance
 ```python
-# Execute with TensorRT
-print('Building TensorRT engine')
-trt_sym = sym.get_backend_symbol('TensorRT')
-arg_params, aux_params = mx.contrib.tensorrt.init_tensorrt_params(trt_sym, 
arg_params, aux_params)
-mx.contrib.tensorrt.set_use_fp16(True)
-executor = trt_sym.simple_bind(ctx=mx.gpu(), data=batch_shape,
-                               grad_req='null', force_rebind=True)
-executor.copy_params_from(arg_params, aux_params)
+[...]
+
+[p.reset_ctx(ctx) for _,p in model.collect_params().items() if p]

Review comment:
       do you need this here? context is still `mx.gpu()`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [incubator-mxnet] samskalicky commented on a change in pull request #19385: Update MXNet-TRT docs with the new optimize_for API

Reply via email to