Hi We have identified that cuda cudnn autotune produces a significant spike of ram usage when finding the best convolution algorithm.
As far as we understand this is inside the cudnn library. But in platforms like the TX1 where we only have 4G this is problematic as the spike is close to 4G. auto tune can be disabled with an environment variable, but for these platforms might be interesting to save these kind of parameters once and not have them run every time at runtime, otherwise you are probably doing convolutions with slower kernels. The second topic I wanted to bring up is, would it be a good idea to have configurable kernel launch parameters to optimize SM resource utilization? Either via maybe a compile time approach based on the target arch: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640924/ Or based on a runtime profile. Any thoughts on these topics? Pedro.
