On Tue, Sep 24, 2013 at 2:45 AM, Jed Brown <[email protected]> wrote:
> Karl Rupp <[email protected]> writes: > > >>> This can obviously be done incrementally, so storing a batch of > >>> element matrices to global memory is not a problem. > >> > >> If you store element matrices to global memory, you're using a ton of > >> bandwidth (about 20x the size of the matrix if using P1 tets). > >> > >> What if you do the sort/reduce thing within thread blocks, and only > >> write the reduced version to global storage? > > > > My primary metric for GPU kernels is memory transfers from global memory > > ('flops are free'), hence what I suggest for the assembly stage is to go > > with something CSR-like rather than COO. Pure CSR may be too expensive > > in terms of element lookup if there are several fields involved > > (particularly 3d), so one could push (column-index, value) pairs for > > each row and making the merge-by-key much cheaper than for arbitrary COO > > matrices. > > I think CSR vs. COO is a second-order optimization to be considered > after the 20x redundancy has been eliminated and a synchronization > strategy has been chosen (e.g., coloring vs redundant storage and later > compression). Yes. I do not understand Karl's suggestion about CSR/COO. My take-away from Owens' talk at Brown was that synchronization is too expensive/complex and that we should always do redundant storage+compression. Please please please lets have an example where this takes > 5% of simulation time. I do not really believe it is alright to work on something that takes < 50%. Matt > > This, of course, requires the knowledge of the nonzero pattern and > > couplings among elements, yet this is reasonably cheap to extract for a > > large number of problems (for example, (non)linear PDEs without > > adaptivity). Also, the nonzero pattern is rather cheap to obtain if one > > uses coloring for avoiding expensive atomic writes to global memory. > > At this point, I don't mind having the nonzero pattern set ahead of time > using CPU code. It's reassembly in time-dependent problems with no > adaptivity or occasional adaptivity that I'm more concerned with. > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
