Hi Jed et al., Just wanted to report back on the resolution of this issue. The computing support people at HLRN in Germany submitted a test case to CRAY re. performance on their XC30. CRAY has finally gotten back with a solution, which is to use the run-time option -vecscatter_alltoall. Apparently this is a known issue and according to the HLRN folks passing this command line option to PETSc seems to work nicely.
Thanks again for your help. Samar On Apr 11, 2014, at 7:44 AM, Jed Brown <[email protected]> wrote: > Samar Khatiwala <[email protected]> writes: > >> Hello, >> >> This is a somewhat vague query but I and a colleague have been running PETSc >> (3.4.3.0) on a Cray >> XC30 in Germany (https://www.hlrn.de/home/view/System3/WebHome) and the >> system administrators >> alerted us to some anomalies with our jobs that may or may not be related to >> PETSc but I thought I'd ask >> here in case others have noticed something similar. >> >> First, there was a large variation in run-time for identical jobs, sometimes >> as much as 50%. We didn't >> really pick up on this but other users complained to the IT people that >> their jobs were taking a performance >> hit with a similar variation in run-time. At that point we're told the IT >> folks started monitoring jobs and >> carrying out tests to see what was going on. They discovered that (1) this >> always happened when we were >> running our jobs and (2) the problem got worse with physical proximity to >> the nodes on which our jobs were >> running (what they described as a "strong interaction" between our jobs and >> others presumably through the >> communication network). > > It sounds like you are strong scaling (smallish subdomains) so that your > application is sensitive to network latency. I see significant > performance variability on XC-30 with this Full Multigrid solver that is > not using PETSc. > > http://59a2.org/files/hopper-vs-edison.3semilogx.png > > See the factor of 2 performance variability for the samples of the ~15M > element case. This operation is limited by instruction issue rather > than bandwidth (indeed, it is several times faster than doing the same > operations with assembled matrices). Here the variability is within the > same application performing repeated solves. If you get a different > partition on a different run, you can see larger variation. > > If your matrices are large enough, your performance will be limited by > memory bandwidth. (This is the typical case, but sufficiently small > matrices can fit in cache.) I once encountered a batch system that did > not properly reset nodes between runs, leaving a partially-filled > ramdisk distributed asymmetrically across the memory busses. This led > to 3x performance reduction on 4-socket nodes because much of the memory > demanded by the application would be faulted onto one memory bus. > Presumably your machine has a resource manager that would not allow such > things to happen.
