Hi, Barry -- I suppose that each of the runs could be thought of as a particular "realization" or something like that. In my case, I figure they are all "equally wrong." ;-) I have a hunch that the discrepancies between runs is exacerbated by under-resolution, but I haven't confirmed this to be the case.
What got me started on this was I was looking into the effect of changing a convergence threshold: how tight do I need to make my convergence tolerance before I start getting the "same" answer? I was distressed to discover that no matter how tight the tolerance, the results were always different. Then I noticed that they were different even for the same tolerance. This made it a little difficult to choose the tolerance! I'd be happy to supply the code for inclusion in petsc-dev. I use C++ STL vectors and sorting algorithms in VecAssemblyEnd, however. Can I toss you all the code with C++ stuff in it, or would it really only be useful if I turn it into C-only code first? -- Boyce Barry Smith wrote: > > Boyce, > > As you know, any of the runs give equally valid answers (all right > or all wrong depending on how you look at it). I assume you > want reproducibility to help track down other bugs in the code and to > see if changes in some places that are not suppose to change > the solution are not wrong and actually changing the solution (which is > hard to find if in each run you get a different answer). > > We would be interested in including your additions with petsc-dev > with a config/configure.py option (say --with-reproducibility) that turns > on the more precise (but slower) version if you would like to contribute > it. > > Barry > > We see this issue every once it a while and never got the energy to > comprehensively go through the code and support to make the computations > much > more reproducible. > > > On Jan 28, 2009, at 11:55 AM, Boyce Griffith wrote: > >> Hi, Folks -- >> >> I recently noticed an irreproducibility issue with some parallel code >> which uses PETSc. Basically, I can run the same parallel simulation >> on the same number of processors and get different answers. The >> discrepancies were only apparent to the eye after very long >> integration times (e.g., approx. 20k timesteps), but in fact the >> computations start diverging after only a few handfuls (30--40) of >> timesteps. (Running the same code in serial yields reproducible >> results, but changing the MPI library or running the code on a >> different system did not make any difference.) >> >> I believe that I have tracked this issue down to >> VecSetValues/VecSetValuesBlocked. A part of the computation uses >> VecSetValues to accumulate the values of forces at the nodes of a >> mesh. At some nodes, force contributions come from multiple >> processors. It appears that there is some "randomness" in the way >> that this accumulation is performed in parallel, presumably related to >> the order in which values are communicated. >> >> I have found that I can make modified versions of VecSetValues_MPI(), >> VecSetValuesBlocked_MPI(), and VecAssemblyEnd_MPI() to ensure >> reproducible results. I make the following modifications: >> >> (1) VecSetValues_MPI() and VecSetValuesBlocked_MPI() place all values >> into the stash instead of immediately adding the local components. >> >> (2) VecAssemblyEnd_MPI() "extracts" all of the values provided by the >> stash and bstash, placing these values into lists corresponding to >> each of the local entries in the vector. Next, once all values have >> been extracted, I sort each of these lists (e.g., by descending or >> ascending magnitude). >> >> Making these changes appears to yield exactly reproducible results, >> although I am still performing tests to try to shake out any other >> problems. Another approach which seems to work is to perform the >> final summation in higher precision (e.g., using "double-double >> precision" arithmetic). Using double-double precision allows me to >> skip the sorting step, although since the number of values to sort is >> relatively small, it may be cheaper to sort. >> >> Using more accurate summation methods (e.g., compensated summation) >> does not appear to fix the lack or reproducibility problem. >> >> I was wondering if anyone has run into similar issues with PETSc and >> has a better solution. >> >> Thanks! >> >> -- Boyce >> >> PS: These tests were done using PETSc 2.3.3. I have just finished >> upgrading the code to PETSc 3.0.0 and will re-run them. However, >> looking at VecSetValues_MPI()/VecSetValuesBlocked_MPI()/etc, it looks >> like this code is essentially the same in PETSc 2.3.3 and 3.0.0. >
