Recently I found that my application spends ~65% of time in garbage collector. I'm looking for ways to reduce amount of memory produced by intermediate results. For example, I found that "A * B" may be changed to "A_mul_B!(out, A, B)" that uses preallocated "out" buffer and thus almost eliminates additional memory allocation. But my application still produces lots of garbage on operations like matrix addition/subtraction, multiplication by scalar, etc.
Are there any other tricks that allow to decrease memory usage?
