Hi Everyone, Would it be a good idea to arrange the data in fastest direction in the following manner for the ease of aligned loads and vector operations?
Total grid points = 4n 0, n, 2n, 3n, 1, n+1, 2n+1, 3n+1 and so on Ref: "Tuning a Finite Difference Computation for Parallel Vector Processors" http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6341495 This change in the global memory layout would mix up the ghost zones in Petscs' DMDAs and I guess change the matrix structure seperating adjacent points by a distance = 4. One can even make the distance = 8 and load one full cacheline in one go. I was wondering if this memory layout can be used for computations using Petscs' DMDAs and if the preconditioners would be ok with this kind of an arrangement. Thanks, Mani
