Thanks for the detailed reply Rasmus. I'll look into these points this week.
Pete On Tue, May 28, 2019 at 6:40 PM Rasmus Munk Larsen <[email protected]> wrote: > Hi Pete, > > The way to optimize the tensor library for hardware with limited cache > sizes would be to > > 1. Reduce the size of the buffer used for the ".block()" interface. I > believe we currently try to fit them in L1, but perhaps the detection > doesn't work correctly on your hardware. > 2. Reduce the block sizes used in TensorContraction. > > 1. By default the blocksize is chosen such that the blocks fits in L1: > > https://bitbucket.org/eigen/eigen/src/3cbfc2d75ecabbb0f17291d0153de6e41e568f15/unsupported/Eigen/CXX11/src/Tensor/TensorExecutor.h#lines-166 > > Each evaluator in an expression reports how scratch memory it needs to > compute a block's worth of data through the getResourceRequirements() API, > e.g.: > > https://bitbucket.org/eigen/eigen/src/3cbfc2d75ecabbb0f17291d0153de6e41e568f15/unsupported/Eigen/CXX11/src/Tensor/TensorShuffling.h#lines-230 > > These values are then merged by the the executor in the calls here: > > https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/TensorExecutor.h#lines-185 > > https://bitbucket.org/eigen/eigen/src/3cbfc2d75ecabbb0f17291d0153de6e41e568f15/unsupported/Eigen/CXX11/src/Tensor/TensorExecutor.h#lines-324 > > 2. The tensor contraction blocking uses a number of heuristics to choose > block sizes and level of parallelism. In particular, it tries to pack the > lhs into L2, and rhs into L3. > > > https://bitbucket.org/eigen/eigen/src/3cbfc2d75ecabbb0f17291d0153de6e41e568f15/unsupported/Eigen/CXX11/src/Tensor/TensorContractionThreadPool.h#lines-127 > > https://bitbucket.org/eigen/eigen/src/3cbfc2d75ecabbb0f17291d0153de6e41e568f15/unsupported/Eigen/CXX11/src/Tensor/TensorContractionThreadPool.h#lines-647 > > https://bitbucket.org/eigen/eigen/src/default/unsupported/Eigen/CXX11/src/Tensor/TensorContractionThreadPool.h#lines-239 > > I hope these pointers help. > > Rasmus > > > On Tue, May 28, 2019 at 7:38 AM Pete Blacker <[email protected]> > wrote: > >> Hi there, >> >> I'm currently using the Eigen::Tensor module on a relatively small >> processors which has very limited cache, 16KB level 1 no level 2 at all! >> I've been looking for any way to optimise the blocking of operations >> performed by Eigen for a particular block size but I can't find anything so >> far. >> >> Is there a way to optimise the Tensor operations for this type of small >> cache? >> >> Thanks, >> >> Pete >> >
