Dear DMTCP team,
I am trying to use DMTCP for MPI applications that use MPI_THREAD_SERIALIZED
mode. As you know, this requires initializing the application using
MPI_Init_thread, instead of MPI_Init. Unfortunately, not a single time
dmtcp_launch succeeded for these applications. The error I got is just "Bus
Error".
Replacing MPI_Init_thread with MPI_Init (which uses MPI_THREAD_SINGLE) fixes
the problem, however, this is not a practical mode for my applications. I have
to use either MPI_THREAD_SERIALIZED or MPI_THREAD_MULTIPLE.
My configurations:
- DMTCP 2.4.5
- OpenMPI-ULFM (which is based on OpenMPI 1.7)
- Network file system used.
dmtcp_launch trace:
[5209] TRACE at dmtcp_launch.cpp:442 in main; REASON='dmtcp_launch starting new
program:'
argv[0] = mpirun
[5209] TRACE at coordinatorapi.cpp:566 in connectToCoordOnStartup;
REASON='sending coordinator handshake'
UniquePid::ThisProcess() = 216034594ce6503-5209-5800a2ed
[5209] TRACE at coordinatorapi.cpp:573 in connectToCoordOnStartup; REASON='Got
virtual pid from coordinator'
hello_remote.virtualPid = 281000
[5209] TRACE at shareddata.cpp:193 in initialize; REASON='Shared area mapped'
sharedDataHeader = 0x7ff19f819000
[5209] TRACE at dmtcp_launch.cpp:771 in setLDPreloadLibs; REASON='getting value
of LD_PRELOAD'
getenv("LD_PRELOAD") = <removed>
preloadLibs = <removed>
preloadLibs32 =
libdmtcp_alloc.so:libdmtcp_dl.so:libdmtcp_ipc.so:libdmtcp_svipc.so:libdmtcp_timer.so:libdmtcp.so:libdmtcp_pid.so:
[281000] TRACE at shareddata.cpp:193 in initialize; REASON='Shared area mapped'
sharedDataHeader = 0x7f7a5984d000
[281000] TRACE at dmtcpworker.cpp:260 in
prepareLogAndProcessdDataFromSerialFile; REASON='Root of processes tree'
[281000] TRACE at dmtcpworker.cpp:315 in DmtcpWorker; REASON='libdmtcp.so:
Running '
jalib::Filesystem::GetProgramName() = orterun
getenv ("LD_PRELOAD") = <removed>
[281000] TRACE at dmtcpworker.cpp:111 in restoreUserLDPRELOAD;
REASON='LD_PRELOAD'
preload =
userPreload = [281000] TRACE at coordinatorapi.cpp:127 in init;
REASON='Informing coordinator of new process'
UniquePid::ThisProcess() = 216034594ce6503-281000-5800a2ed
[281000] TRACE at processinfo.cpp:180 in growStack; REASON='Original stack area'
(void*)area.addr = 0x7fff33fdf000
area.size = 90112
[281000] TRACE at processinfo.cpp:218 in growStack; REASON='New stack size'
(void*)area.addr = 0x7fff30004000
area.size = 67047424
[281000] TRACE at fileconnlist.cpp:385 in scanForPreExisting; REASON='scanning
pre-existing device'
fd = 0
device = /dev/pts/18
Bus error
dmtcp_coordinator trace:
[26364] TRACE at dmtcp_coordinator.cpp:962 in onConnect; REASON='accepting new
connection'
remote.sockfd() = 5
(strerror((*__errno_location ()))) = No such file or directory
[26364] TRACE at dmtcp_coordinator.cpp:971 in onConnect; REASON='Reading from
incoming connection...'
[26364] TRACE at dmtcp_coordinator.cpp:1263 in validateNewWorkerProcess;
REASON='First process connected. Creating new computation group.'
compId = 216034594ce6503-281000-5800a2ed
[26364] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 216034594ce6503-5209-5800a2ed
[26364] TRACE at dmtcp_coordinator.cpp:1084 in onConnect; REASON='END'
clients.size() = 1
[26364] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process
Information after exec()'
progname = orterun
msg.from = 216034594ce6503-281000-5800a2ed
client->identity() = 216034594ce6503-5209-5800a2ed
[26364] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 216034594ce6503-281000-5800a2ed
client->progname() = orterun
[26364] TRACE at dmtcp_coordinator.cpp:892 in removeStaleSharedAreaFile;
REASON='Removing sharedArea file.'
o.str() = <tmp dir
path>/dmtcpSharedArea.216034594ce6503-281000-5800a2ed.5800a2ed7
I correlated the "Bus Error" with MPI_Thread_Init, but this is not necessarily
true. I hope the above log helps you identify the root cause of this error.
Best Regards,
Sara
Sara S. Hamouda
PhD Candidate (Computer Systems Group)
College of Engineering and Computer Science
The Australian National University
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum