In the past 2 days I've been researching various ways to avoid expensive
allocations/deallocations in tight loops on GPU, by reusing memory.
Apparently this is also a recurrent issue in game programming so I'd like to
share my approach and request for comments.
My use-case:
* I allocate and free memory through a custom allocator (cudaMalloc/cudaFree
in my case).
* Allocation/deallocation is very expensive and often occurs in tight loop.
* Memory is at a premium, we cannot hoard unused memory.
* Memory per object is huge, for example VGG neural network (one of the
earliest) with a typical image of size 224x224x3 RGB (150k parameter) will have
[138 millions of float32 parameters and takes about
93MB/image](https://cs231n.github.io/convolutional-networks/#case). A typical
batch size is 32~128 images, and a new batch is required every 200ms~1s (check
"OxfordNet" the other name of VGG in this
[benchmark](https://github.com/soumith/convnet-benchmarks))