> -----Original Message----- > From: Ira Weiny > Sent: Tuesday, June 13, 2006 8:12 PM > A co-worker here was seeing the following MPI error from his job: > > [1] Abort: [ldev2:1] Got completion with error, code=1 > at line 2148 in file viacheck.c > > After some tracking down he found that apparently if he used a "system" > call > [int system(const char *string)] the next MPI command will fail. > > I have been able to reproduce this with the attached simple "hello" > program.
I have seen this type of problem a couple years ago with our proprietary stack and it took a bit of work to correct it. Here is what it could be: This sounds like a conflict between with fork() and the Vma handling in Open IB for registered memory. system() is a fork(), exec(), wait() sequence. fork generally shares the VMAs and marks the pages as copy on write. In your case it sounds like one of the pages written by the child process includes memory previously registered by the main process, and the child ended up with the original page. The result is that the virtual address in the main process is now pointing to the wrong physical page. It sounds like you happened on a "magic sequence" which demonstrates the problem. Do you have information on the OS version, CPU type, and server config? Todd Rimmer _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
