Hi We have been doing more experiments with threaded algorithms for MPI/OpenMP on our branch of petsc-3.3:
https://bitbucket.org/ggorman/petsc-3.3-omp (all the usual health warnings etc - more than a few times we thought "Jed isn't going to like this") We just got a paper accepted to ISC which goes through the general benchmarking, preprint: http://arxiv.org/abs/1303.5275 It a nut shell the message seems to be that for strong scaling: - Task thread based approach is good as you can use one thread to do the mpi communication (so real asynchronous communication). - It is important to balance the load per thread as it's load imbalance that begins to limit scalability. We still have quite a bit of profiling to do. In particular, we need to look at the hardware counters and really make sure we understand what's happening on the machine. Nevertheless we think the results are interesting. We would welcome any suggestions/feedback on this. It seems like the same approach could be implemented in petsc-dev since threadcomms has most of the support required for this. Cheers Gerard
