trevor-m opened a new pull request #8172: URL: https://github.com/apache/tvm/pull/8172
This PR makes two changes which will help reduce GPU memory usage by TensorRT for models that use a dynamic batch dimension (tensors with shape like `(relay.Any(), 3, 224, 224)`). 1. TensorRT engines are built with a "Max Batch Size" parameter. This means the engine can be used for inputs with any batch size from 1 to max_batch size. Previously, we built a new TensorRT engine for each unique batch size encountered at runtime. With this PR, when we encounter a new batch size, we will first try to match it to an already built engine with an equal or higher batch size. This will reduce the number of engines created at runtime. 2. Because of the first change, we have to rethink how the GPU device buffers are allocated because now an engine can be used for multiple batch sizes. This PR decouples the device buffers from the engine, so there is only one set of device buffers for subgraph. They will be allocated only for the largest batch size encountered.This will further reduce memory usage since only one buffer per input is allocated, while previously each engine would have its own set of buffer. This will also fix the issue from https://github.com/apache/tvm/pull/7162. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
