On Sat, Dec 5, 2009 at 4:25 PM, Jed Brown <jed at 59a2.org> wrote: > On Sat, 5 Dec 2009 16:02:38 -0600, Matthew Knepley <knepley at gmail.com> > wrote: > > I need to understand better. You are asking about the case where we have > > many GPUs and one CPU? If its always one or two GPUs per CPU I do not > > see the problem. > > Barry initially proposed one Python thread per node, then distributing > the kernels over many CPU cores on that node, or to one-or-more GPUs. > With some abuse of terminology, lets call them all worker threads, > perhaps dozens if running on multicore CPUs, or hundreds/thousands when > on a GPU. The physics, such as FEM integration, has to be done by those > worker threads. But unless every thread is it's own subdomain > (i.e. Block Jacobi/ASM with very small subdomains), we still need to > assemble a small number of matrices per node. So we would need a > lock-free concurrent MatSetValues, otherwise we'll only scale to a few > worker threads before everything is blocked on MatSetValues. >
I imagined that this kind of assembly will be handled similarly to what we do in the FMM. You assign a few threads per element to calculate the FEM integral. You could maintain this unassembled if you only need actions. However, if you want an actual sparse matrix, there are a couple of options 1) Store the unassembled matrix, and run assembly after integration is complete. This needs more memory, but should perform well. 2) Use atmoic operations to update. I have not seen this yet, so I am unsure how is will perform. 3) Use some memory scheme (monitor) to update. This will have terrible performance. Can you think of other options? Matt > > Hmm, still not quite getting this problem. We need concurrency on the > > GPU, but why would we need it on the CPU? > > Only if the we were doing real work on the many CPU cores per node. > > > On the GPU, triangular solve will be just as crappy as it currently > > is, but will look even worse due to large number of cores. > > It could be worse because a single GPU thread is likely slower than a > CPU core. > > > It is not the only smoother. For instance, polynomial smoothers would > > be more concurrent. > > Yup. > > > > I have trouble finding decent preconditioning algorithms suitable for > > > the fine granularity of GPUs. Matt thinks we can get rid of all the > > > crappy sparse matrix kernels and precondition everything with FMM. > > > > > > > That is definitely my view, or at least my goal. And I would say this, > > if we are just starting out on these things, I think it makes sense to > > do the home runs first. If we just try and reproduce things, people > > might say "That is nice, but I can already do that pretty well". > > Agreed, but it's also important to have something good to offer people > who aren't ready to throw out everything they know and design a new > algorithm based on a radically different approach that may or may not be > any good for their physics. > > Jed > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.mcs.anl.gov/pipermail/petsc-dev/attachments/20091205/490d4c3a/attachment.html>