Jed, Thank you very much.
They made some observations, and they might make some progresses. I at least can make some runs now. They also say that it is something about ordering/rendezvous. They said that there may be too many messages or too long messages or both. On Wed, Oct 23, 2013 at 4:22 PM, Jed Brown <[email protected]> wrote: > Fande Kong <[email protected]> writes: > > > Hi Barry, > > > > I contacted the supercomputer center, and they asked me for a test case > so > > that they can forward it to IBM. Is it possible that we write a test case > > only using MPI? It is not a good idea that we send them the whole petsc > > code and my code. > > This may be possible, but this smells of a message ordering/rendezvous > problem, in which case you'll have to reproduce essentially the same > communication pattern. The fact that you don't see the error sooner in > your program execution (and that it doesn't affect lots of other people) > indicates that the bug may be very specific to your communication > pattern. In fact, it is likely that changing your element distribution > algorithm, or some similar changes, may make the bug vanish. > Reproducing all this matching context in a stand-alone code is likely to > be quite a lot of effort. > > I would configure the system to dump core on errors and package up the > test case. >
