trevor-m opened a new pull request #8172:
URL: https://github.com/apache/tvm/pull/8172


   This PR makes two changes which will help reduce GPU memory usage by 
TensorRT for models that use a dynamic batch dimension (tensors with shape like 
`(relay.Any(), 3, 224, 224)`).
   
   1. TensorRT engines are built with a "Max Batch Size" parameter. This means 
the engine can be used for inputs with any batch size from 1 to max_batch size. 
Previously, we built a new TensorRT engine for each unique batch size 
encountered at runtime. With this PR, when we encounter a new batch size, we 
will first try to match it to an already built engine with an equal or higher 
batch size. This will reduce the number of engines created at runtime.
   
   2. Because of the first change, we have to rethink how the GPU device 
buffers are allocated because now an engine can be used for multiple batch 
sizes. This PR decouples the device buffers from the engine, so there is only 
one set of device buffers for subgraph. They will be allocated only for the 
largest batch size encountered.This will further reduce memory usage since only 
one buffer per input is allocated, while previously each engine would have its 
own set of buffer. This will also fix the issue from 
https://github.com/apache/tvm/pull/7162.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to