Re: [OMPI users] numactl with torque cpusets
For the web archives... Brock and I talked about this in person at SC. The conversation was much more involved than this seemingly-simple question implied. :-) The short version is: - numactl does both memory and processor binding - hwloc is the new numactl :-) - e.g., see the hwloc-bind(1) command - OMPI does both memory and processor binding - OMPI 1.5.5 will have an MCA parameter for process-wide memory binding policy - Torque cpusets are probably do what is desired: restrict MPI processes to a subset of the processors on a given server (e.g., if multiple Torque jobs are running on the same server) On Nov 9, 2011, at 1:46 PM, Brock Palen wrote: > Question, > If we are using torque with TM with cpusets enabled for pinning should we not > enable numactl? Would they conflict with each other? > > Brock Palen > www.umich.edu/~brockp > Center for Advanced Computing > bro...@umich.edu > (734)936-1985 > > > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed
Hi, I have placed the source in \Program Files\openmpi-1.5.4 the build dir in \Program Files\openmpi.build and the install dir in \Program Files\openmpi I could not find config.log in any of the 3 directories nor in the directory from which I run mpirun. The build log attached is a zip of all the .log under \Program Files\openmpi.build First, I installed the provided binaries on xp32bit, and successfully ran the program in Release mode. in debug mode, there was that error of some function missing in kernel, that you fixed in svn. Second, I then downloaded the source and built the static libraries w cmake according to README.windows, and against these home built libs, the same program run neithers in debug nor in release, because of the error below. How can I generate the config.log? About Debug/Release, thinking about it at this time, I don't really need the debug libs of openmpi. but to be able to link against vs2010 Release libs of openmpi, I need them to be linked against the Release c runtime, so I might as well link against the debug version of the openmpi libs. Your help is very appreciated, MM -Original Message- From: Shiqing Fan [mailto:f...@hlrs.de] Sent: 21 November 2011 12:48 To: Open MPI Users Cc: MM Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed Hi, Could you please send your config and build log to me? Have you tried with a simpler program? Does this error always happen? Regards, Shiqing On 2011-11-19 4:24 PM, MM wrote: > Trying to run my program linked against debug 1.5.4 on vs2010 fails: > mpirun -np 1 .\nhui\Debug\nhui.exe : -np 1 .\nhcomp\Debug\nhcomp.exe > [PCNAME:04960] [[1282,0],0] ORTE_ERROR_LOG: Not found in file C:\Program > Files\openmpi-1.5.4\orte\mca\ess\hnp\ess_hnp_module.c at line 536 > -- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > >orte_debugger_select failed >--> Returned value Not found (-13) instead of ORTE_SUCCESS > -- > [PCNAME:04960] [[1282,0],0] ORTE_ERROR_LOG: Not found in file C:\Program > Files\openmpi-1.5.4\orte\runtime\orte_init.c at line 128 > -- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > >orte_ess_set_name failed >--> Returned value Not found (-13) instead of ORTE_SUCCESS > -- > [LLDNRATDHY9H4J:04960] [[1282,0],0] ORTE_ERROR_LOG: Not found in file > C:\Program Files\openmpi-1.5.4\orte\tools\orterun\orterun.c at line 616 > > any help is appreciated, > MM > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > -- --- Shiqing Fan High Performance Computing Center Stuttgart (HLRS) Tel: ++49(0)711-685-87234 Nobelstrasse 19 Fax: ++49(0)711-685-65832 70569 Stuttgart http://www.hlrs.de/organization/people/shiqing-fan/ email: f...@hlrs.de * ** ** ** WARNING: This email contains an attachment of a very suspicious type. ** ** You are urged NOT to open this attachment unless you are absolutely ** ** sure it is legitimate. Opening this attachment may cause irreparable ** ** damage to your computer and your files. If you have any questions ** ** about the validity of this message, PLEASE SEEK HELP BEFORE OPENING IT. ** ** ** ** This warning was added by the IU Computer Science Dept. mail scanner. ** * <>
Re: [OMPI users] UDP like messaging with MPI
On Mon, 21 Nov 2011, Mudassar Majeed wrote: Thank you for your answer. Actually, I used the term UDP to show the non-connection oriented messaging. TCP creates connection between two parties (who communicate) but in UDP a message can be sent to any IP/port where a process/thread is listening to, and if the process is busy in doing something, all the received messages are queued for it and when ever it calls the recv function one message is taken from the queue. That is how MPI message matching works; messages sit in a queue until you call MPI_Irecv (or MPI_Recv or MPI_Probe, etc.) to get them. Unlike UDP, MPI messages do not need to complete on the sender until they are received, so you will probably need to use MPI_Isend to avoid deadlocks. I am implementing a distributed algorithm that will provide communication sensitive load balancing for computational loads. For example, if we have 10 nodes each containing 10 cores (100 cores in total). So when MPI application will start (let say with 1000) processes (more than 1 process per core) then I will run my distributed algorithm MPI_Balance (sorry for giving MPI_ prefix as it is not a part of MPI, but I am trying to make it the part of MPI ;) ). So that algorithm will take those processes that communicate more in the same node (keeping the computational load on 10 cores on that node balanced). So that was the little bit explanation. So for that my distributed algorithm requires that some processes communicate with each other to collaborate on something. So I need a kind of messaging that I explained above. It is kind of UDP messaging (no connection before sending a message, and message is always queued on the receiver's side and sender is not blocked, it just sends the message and the receiver takes it when it gets free from other task). The one difficulty in doing this is to manage the MPI requests from the sends and poll them with MPI_Test periodically. You can just keep the requests in an array (std::vector in C++) which can be expanded when needed; to send a message, call MPI_Isend and put the request into the array, and periodically call MPI_Testany or MPI_Testsome on the array to find completed requests. Note that you will need to keep the data being sent intact in its buffer until the request completes. Here's a naive version that does extra copies and doesn't clean out its arrays of requests or buffers: class message_send_engine { vector requests; vectorbuffers; public: void send(void* buf, int byte_len, int dest, int tag) { MPI_Request req; size_t buf_num = buffers.size(); buffers.resize(buf_num + 1); buffers[buf_num].assign((char*)buf, (char*)buf + byte_len); requests.resize(buf_num + 1); MPI_Isend([buf_num][0], byte_len, MPI_BYTE, dest, tag, MPI_COMM_WORLD, [buf_num]); } void poll() { // Call this periodically while (true) { int index, flag; MPI_Testany((int)requests.size(), [0], , , MPI_STATUS_IGNORE); if (flag && index != MPI_UNDEFINED) { buffers[index].clear(); // Free memory } else { break; } } } }; bool test_for_message(void* buf, int max_len, MPI_Status& st) { int flag; MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, , ); return (flag != 0); } If test_for_message returns true, you can then use MPI_Recv to get the message. I have tried to use the combination of MPI_Send, MPI_Recv, MPI_Iprobe, MPI_Isend, MPI_Irecv, MPI_Test etc, but I am not getting that thing that I am looking for. I think MPI should also provide that way. May be it is not in my knowledge. That's why I am asking the experts. I am still looking for it :( -- Jeremiah Willcock
Re: [OMPI users] UDP like messaging with MPI
Thank you for your answer. Actually, I used the term UDP to show the non-connection oriented messaging. TCP creates connection between two parties (who communicate) but in UDP a message can be sent to any IP/port where a process/thread is listening to, and if the process is busy in doing something, all the received messages are queued for it and when ever it calls the recv function one message is taken from the queue. I am implementing a distributed algorithm that will provide communication sensitive load balancing for computational loads. For example, if we have 10 nodes each containing 10 cores (100 cores in total). So when MPI application will start (let say with 1000) processes (more than 1 process per core) then I will run my distributed algorithm MPI_Balance (sorry for giving MPI_ prefix as it is not a part of MPI, but I am trying to make it the part of MPI ;) ). So that algorithm will take those processes that communicate more in the same node (keeping the computational load on 10 cores on that node balanced). So that was the little bit explanation. So for that my distributed algorithm requires that some processes communicate with each other to collaborate on something. So I need a kind of messaging that I explained above. It is kind of UDP messaging (no connection before sending a message, and message is always queued on the receiver's side and sender is not blocked, it just sends the message and the receiver takes it when it gets free from other task). I have tried to use the combination of MPI_Send, MPI_Recv, MPI_Iprobe, MPI_Isend, MPI_Irecv, MPI_Test etc, but I am not getting that thing that I am looking for. I think MPI should also provide that way. May be it is not in my knowledge. That's why I am asking the experts. I am still looking for it :( thanks and regards, Mudassar Majeed PhD Student Linkoping University PhD Topic: Parallel Computing (Optimal composition of parallel programs and runtime support). From: Jeff SquyresTo: mudassar...@yahoo.com; Open MPI Users Cc: "li...@razik.name" Sent: Monday, November 21, 2011 6:07 PM Subject: Re: [OMPI users] UDP like messaging with MPI MPI defines only reliable communications -- it's not quite the same thing as UDP. Hence, if you send something, it is guaranteed to be able to be received. UDP may drop packets whenever it feels like it (e.g., when it is out of resources). Most MPI implementations will do some form of buffering of unexpected receives. So if process A sends message X to process B, if B hasn't posted a matching receive for message X yet, B will likely silently accept the message under the covers and buffer it (or at least buffer part of it). Hence, when you finally post the matching X receive in B, whatever of X was already received will already be there, but B may need to send a clear-to-send to A to get the rest of the message. Specifically: if X is "short", A may eagerly send the whole message to B. If X is "long", A may only send the first part of B and wait for a CTS before sending the rest of it. MPI implementations typically do this in order to conserve buffer space -- i.e., if A sends a 10MB message, there's no point in buffering it at B until the matching receive is made and the message can be received directly into the destination 10MB buffer that B has made available. If B accepted the 10MB X early, it would cost an additional 10MB to buffer it. Ick. Alternatively, what I think Lukas was trying to suggest was that you can post non-blocking receives and simply test for completion later. This allows MPI to receive straight into the target buffer without intermediate copies or additional buffers. Then you can just check to see when the receive(s) is(are) done. On Nov 19, 2011, at 10:47 AM, Mudassar Majeed wrote: > I know about tnıs functıons, they special requirements like the mpi_irecv > call should be made in every process. My processes should not look for > messages or implicitly receive them. But messages shuddering go into their > msg queues and retrieved when needed. Just like udp communication. > > Regards > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] openmpi and mingw32?
On 11/21/2011 5:43 AM, Shiqing Fan wrote: Hi John, Yes, there will be an initial build support for MinGW, but a few runtime issues still need to be fixed. If you want to try the current one, please download one of the latest 1.5 nightly tarballs. Please just let me know if you got problems on that. Feedback would be helpful and appreciated. Hi Shiqing, I went ahead and tried the svn trunk. I configured with cmake \ -DCMAKE_INSTALL_PREFIX:PATH=C:/winsame/contrib-mingw/openmpi-try \ -DCMAKE_BUILD_TYPE:STRING=Release \ -DCMAKE_VERBOSE_MAKEFILE:BOOL=TRUE \ -DCMAKE_COLOR_MAKEFILE:BOOL=FALSE \ -G 'NMake Makefiles JOM' \ -DCMAKE_C_COMPILER:FILEPATH='mingw32-gcc' \ -DCMAKE_CXX_COMPILER:FILEPATH='mingw32-g++' \ -DCMAKE_Fortran_COMPILER:FILEPATH='mingw32-gfortran' \ .. It fails right away at C:\MinGW\bin\mingw32-gcc.exe -Dlibopen_pal_EXPORTS -D_USRDLL -DOPAL_EXPORTS -O3 -DNDEBUG @CMakeFiles/libopen-pal.dir/includes_C.rsp -o CMakeFiles\libopen-pal.dir\class\opal_list.obj -c C:\winsame\builds-mingw\facetsall-mingw\ompi-trunk\opal\class\opal_list.c cd C:\winsame\builds-mingw\facetsall-mingw\ompi-trunk\try In file included from C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/try/opal/include/opal_config.h:1495:0, from C:\winsame\builds-mingw\facetsall-mingw\ompi-trunk\opal\class\opal_atomic_lifo.c:19: C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/include/opal_config_bottom.h:561:0: warning: "PF_UNSPEC" redefined c:\mingw\bin\../lib/gcc/mingw32/4.5.2/../../../../include/winsock2.h:368:0: note: this is the location of the previous definition C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/include/opal_config_bottom.h:564:0: warning: "AF_INET6" redefined c:\mingw\bin\../lib/gcc/mingw32/4.5.2/../../../../include/winsock2.h:329:0: note: this is the location of the previous definition C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/include/opal_config_bottom.h:567:0: warning: "PF_INET6" redefined c:\mingw\bin\../lib/gcc/mingw32/4.5.2/../../../../include/winsock2.h:392:0: note: this is the location of the previous definition In file included from C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/try/opal/include/opal_config.h:1495:0, from C:\winsame\builds-mingw\facetsall-mingw\ompi-trunk\opal\class\opal_free_list.c:20: C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/include/opal_config_bottom.h:561:0: warning: "PF_UNSPEC" redefined c:\mingw\bin\../lib/gcc/mingw32/4.5.2/../../../../include/winsock2.h:368:0: note: this is the location of the previous definition C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/include/opal_config_bottom.h:564:0: warning: "AF_INET6" redefined c:\mingw\bin\../lib/gcc/mingw32/4.5.2/../../../../include/winsock2.h:329:0: note: this is the location of the previous definition C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/include/opal_config_bottom.h:567:0: warning: "PF_INET6" redefined c:\mingw\bin\../lib/gcc/mingw32/4.5.2/../../../../include/winsock2.h:392:0: note: this is the location of the previous definition In file included from C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/class/opal_free_list.h:25:0, from C:\winsame\builds-mingw\facetsall-mingw\ompi-trunk\opal\class\opal_free_list.c:22: C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/threads/condition.h:116:57: warning: 'struct timespec' declared inside parameter list C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/threads/condition.h:116:57: warning: its scope is only this definition or declaration, which is probably not what you want C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/threads/condition.h: In function 'opal_condition_timedwait': C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/threads/condition.h:140:34: error: dereferencing pointer to incomplete type C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/threads/condition.h:141:35: error: dereferencing pointer to incomplete type C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/threads/condition.h:155:34: error: dereferencing pointer to incomplete type C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/threads/condition.h:156:35: error: dereferencing pointer to incomplete type command failed with exit code 1 In file included from C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/try/opal/include/opal_config.h:1495:0, from C:\winsame\builds-mingw\facetsall-mingw\ompi-trunk\opal\class\opal_hash_table.c:19: C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/include/opal_config_bottom.h:561:0: warning: "PF_UNSPEC" redefined c:\mingw\bin\../lib/gcc/mingw32/4.5.2/../../../../include/winsock2.h:368:0: note: this is the location of the previous definition C:/winsame/builds-mingw/facetsall-mingw/ompi-trunk/opal/include/opal_config_bottom.h:564:0: warning: "AF_INET6" redefined
Re: [OMPI users] UDP like messaging with MPI
MPI defines only reliable communications -- it's not quite the same thing as UDP. Hence, if you send something, it is guaranteed to be able to be received. UDP may drop packets whenever it feels like it (e.g., when it is out of resources). Most MPI implementations will do some form of buffering of unexpected receives. So if process A sends message X to process B, if B hasn't posted a matching receive for message X yet, B will likely silently accept the message under the covers and buffer it (or at least buffer part of it). Hence, when you finally post the matching X receive in B, whatever of X was already received will already be there, but B may need to send a clear-to-send to A to get the rest of the message. Specifically: if X is "short", A may eagerly send the whole message to B. If X is "long", A may only send the first part of B and wait for a CTS before sending the rest of it. MPI implementations typically do this in order to conserve buffer space -- i.e., if A sends a 10MB message, there's no point in buffering it at B until the matching receive is made and the message can be received directly into the destination 10MB buffer that B has made available. If B accepted the 10MB X early, it would cost an additional 10MB to buffer it. Ick. Alternatively, what I think Lukas was trying to suggest was that you can post non-blocking receives and simply test for completion later. This allows MPI to receive straight into the target buffer without intermediate copies or additional buffers. Then you can just check to see when the receive(s) is(are) done. On Nov 19, 2011, at 10:47 AM, Mudassar Majeed wrote: > I know about tnıs functıons, they special requirements like the mpi_irecv > call should be made in every process. My processes should not look for > messages or implicitly receive them. But messages shuddering go into their > msg queues and retrieved when needed. Just like udp communication. > > Regards > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI users] How are the Open MPI processes spawned?
No real ideas, I'm afraid. We regularly launch much larger jobs than that using ssh without problem, so it is likely something about the local setup of that node that is causing the problem. Offhand, it sounds like either the mapper isn't getting things right, or for some reason the daemon on 005 isn't properly getting or processing the launch command. What you could try is adding --display-map to see if the map is being correctly generated. If that works, then (using a debug build) try adding --leave-session-attached and see if any daemons are outputting an error. You could add -mca odls_base_verbose 5 --leave-session-attached to your cmd line. You'll see debug output from each daemon as it receives and processes the launch command. See if the daemon on 005 is behaving differently than the others. You should also try putting that long list of nodes in a hostfile - see if that makes a difference. It will process the nodes thru a different code path, so if there is some problem in --host, this will tell us. On Nov 21, 2011, at 9:33 AM, Paul Kapinos wrote: > Hello Open MPI volks, > > We use OpenMPI 1.5.3 on our pretty new 1800+ nodes InfiniBand cluster, and we > have some strange hangups if starting OpenMPI processes. > > The nodes are named linuxbsc001,linuxbsc002,... (with some lacuna due of > offline nodes). Each node is accessible from each other over SSH (without > password), also MPI programs between any two nodes are checked to run. > > > So long, I tried to start some bigger number of processes, one process per > node: > $ mpiexec -np NN --host linuxbsc001,linuxbsc002,... MPI_FastTest.exe > > Now the problem: there are some constellations of names in the host list on > which mpiexec reproducible hangs forever; and more surprising: other > *permutation* of the *same* node names may run without any errors! > > Example: the command in laueft.txt runs OK, the command in haengt.txt hangs. > Note: the only difference is that the node linuxbsc025 is put on the end of > the host list. Amazed, too? > > Looking on the particular nodes during the above mpiexec hangs, we found the > orted daemons started on *each* node and the binary on all but one node > (orted.txt, MPI_FastTest.txt). > Again amazing that the node with no user process started (leading to hangup > in MPI_Init of all processes and thus to hangup, I believe) was always the > same, linuxbsc005, which is NOT the permuted item linuxbsc025... > > This behaviour is reproducible. The hang-on only occure if the started > application is a MPI application ("hostname" does not hang). > > > Any Idea what is gonna on? > > > Best, > > Paul Kapinos > > > P.S: no alias names used, all names are real ones > > > > > > > > -- > Dipl.-Inform. Paul Kapinos - High Performance Computing, > RWTH Aachen University, Center for Computing and Communication > Seffenter Weg 23, D 52074 Aachen (Germany) > Tel: +49 241/80-24915 > linuxbsc001: STDOUT: 24323 ?SLl0:00 MPI_FastTest.exe > linuxbsc002: STDOUT: 2142 ?SLl0:00 MPI_FastTest.exe > linuxbsc003: STDOUT: 69266 ?SLl0:00 MPI_FastTest.exe > linuxbsc004: STDOUT: 58899 ?SLl0:00 MPI_FastTest.exe > linuxbsc006: STDOUT: 68255 ?SLl0:00 MPI_FastTest.exe > linuxbsc007: STDOUT: 62026 ?SLl0:00 MPI_FastTest.exe > linuxbsc008: STDOUT: 54221 ?SLl0:00 MPI_FastTest.exe > linuxbsc009: STDOUT: 55482 ?SLl0:00 MPI_FastTest.exe > linuxbsc010: STDOUT: 59380 ?SLl0:00 MPI_FastTest.exe > linuxbsc011: STDOUT: 58312 ?SLl0:00 MPI_FastTest.exe > linuxbsc014: STDOUT: 56013 ?SLl0:00 MPI_FastTest.exe > linuxbsc016: STDOUT: 58563 ?SLl0:00 MPI_FastTest.exe > linuxbsc017: STDOUT: 54693 ?SLl0:00 MPI_FastTest.exe > linuxbsc018: STDOUT: 54187 ?SLl0:00 MPI_FastTest.exe > linuxbsc020: STDOUT: 55811 ?SLl0:00 MPI_FastTest.exe > linuxbsc021: STDOUT: 54982 ?SLl0:00 MPI_FastTest.exe > linuxbsc022: STDOUT: 50032 ?SLl0:00 MPI_FastTest.exe > linuxbsc023: STDOUT: 54044 ?SLl0:00 MPI_FastTest.exe > linuxbsc024: STDOUT: 51247 ?SLl0:00 MPI_FastTest.exe > linuxbsc025: STDOUT: 18575 ?SLl0:00 MPI_FastTest.exe > linuxbsc027: STDOUT: 48969 ?SLl0:00 MPI_FastTest.exe > linuxbsc028: STDOUT: 52397 ?SLl0:00 MPI_FastTest.exe > linuxbsc029: STDOUT: 52780 ?SLl0:00 MPI_FastTest.exe > linuxbsc030: STDOUT: 47537 ?SLl0:00 MPI_FastTest.exe > linuxbsc031: STDOUT: 54609 ?SLl0:00 MPI_FastTest.exe > linuxbsc032: STDOUT: 52833 ?SLl0:00 MPI_FastTest.exe > $ timex /opt/MPI/openmpi-1.5.3/linux/intel/bin/mpiexec -np 27 --host >
[OMPI users] How are the Open MPI processes spawned?
Hello Open MPI volks, We use OpenMPI 1.5.3 on our pretty new 1800+ nodes InfiniBand cluster, and we have some strange hangups if starting OpenMPI processes. The nodes are named linuxbsc001,linuxbsc002,... (with some lacuna due of offline nodes). Each node is accessible from each other over SSH (without password), also MPI programs between any two nodes are checked to run. So long, I tried to start some bigger number of processes, one process per node: $ mpiexec -np NN --host linuxbsc001,linuxbsc002,... MPI_FastTest.exe Now the problem: there are some constellations of names in the host list on which mpiexec reproducible hangs forever; and more surprising: other *permutation* of the *same* node names may run without any errors! Example: the command in laueft.txt runs OK, the command in haengt.txt hangs. Note: the only difference is that the node linuxbsc025 is put on the end of the host list. Amazed, too? Looking on the particular nodes during the above mpiexec hangs, we found the orted daemons started on *each* node and the binary on all but one node (orted.txt, MPI_FastTest.txt). Again amazing that the node with no user process started (leading to hangup in MPI_Init of all processes and thus to hangup, I believe) was always the same, linuxbsc005, which is NOT the permuted item linuxbsc025... This behaviour is reproducible. The hang-on only occure if the started application is a MPI application ("hostname" does not hang). Any Idea what is gonna on? Best, Paul Kapinos P.S: no alias names used, all names are real ones -- Dipl.-Inform. Paul Kapinos - High Performance Computing, RWTH Aachen University, Center for Computing and Communication Seffenter Weg 23, D 52074 Aachen (Germany) Tel: +49 241/80-24915 linuxbsc001: STDOUT: 24323 ?SLl0:00 MPI_FastTest.exe linuxbsc002: STDOUT: 2142 ?SLl0:00 MPI_FastTest.exe linuxbsc003: STDOUT: 69266 ?SLl0:00 MPI_FastTest.exe linuxbsc004: STDOUT: 58899 ?SLl0:00 MPI_FastTest.exe linuxbsc006: STDOUT: 68255 ?SLl0:00 MPI_FastTest.exe linuxbsc007: STDOUT: 62026 ?SLl0:00 MPI_FastTest.exe linuxbsc008: STDOUT: 54221 ?SLl0:00 MPI_FastTest.exe linuxbsc009: STDOUT: 55482 ?SLl0:00 MPI_FastTest.exe linuxbsc010: STDOUT: 59380 ?SLl0:00 MPI_FastTest.exe linuxbsc011: STDOUT: 58312 ?SLl0:00 MPI_FastTest.exe linuxbsc014: STDOUT: 56013 ?SLl0:00 MPI_FastTest.exe linuxbsc016: STDOUT: 58563 ?SLl0:00 MPI_FastTest.exe linuxbsc017: STDOUT: 54693 ?SLl0:00 MPI_FastTest.exe linuxbsc018: STDOUT: 54187 ?SLl0:00 MPI_FastTest.exe linuxbsc020: STDOUT: 55811 ?SLl0:00 MPI_FastTest.exe linuxbsc021: STDOUT: 54982 ?SLl0:00 MPI_FastTest.exe linuxbsc022: STDOUT: 50032 ?SLl0:00 MPI_FastTest.exe linuxbsc023: STDOUT: 54044 ?SLl0:00 MPI_FastTest.exe linuxbsc024: STDOUT: 51247 ?SLl0:00 MPI_FastTest.exe linuxbsc025: STDOUT: 18575 ?SLl0:00 MPI_FastTest.exe linuxbsc027: STDOUT: 48969 ?SLl0:00 MPI_FastTest.exe linuxbsc028: STDOUT: 52397 ?SLl0:00 MPI_FastTest.exe linuxbsc029: STDOUT: 52780 ?SLl0:00 MPI_FastTest.exe linuxbsc030: STDOUT: 47537 ?SLl0:00 MPI_FastTest.exe linuxbsc031: STDOUT: 54609 ?SLl0:00 MPI_FastTest.exe linuxbsc032: STDOUT: 52833 ?SLl0:00 MPI_FastTest.exe $ timex /opt/MPI/openmpi-1.5.3/linux/intel/bin/mpiexec -np 27 --host linuxbsc001,linuxbsc002,linuxbsc003,linuxbsc004,linuxbsc005,linuxbsc006,linuxbsc007,linuxbsc008,linuxbsc009,linuxbsc010,linuxbsc011,linuxbsc014,linuxbsc016,linuxbsc017,linuxbsc018,linuxbsc020,linuxbsc021,linuxbsc022,linuxbsc023,linuxbsc024,linuxbsc025,linuxbsc027,linuxbsc028,linuxbsc029,linuxbsc030,linuxbsc031,linuxbsc032 MPI_FastTest.exe $ timex /opt/MPI/openmpi-1.5.3/linux/intel/bin/mpiexec -np 27 --host linuxbsc001,linuxbsc002,linuxbsc003,linuxbsc004,linuxbsc005,linuxbsc006,linuxbsc007,linuxbsc008,linuxbsc009,linuxbsc010,linuxbsc011,linuxbsc014,linuxbsc016,linuxbsc017,linuxbsc018,linuxbsc020,linuxbsc021,linuxbsc022,linuxbsc023,linuxbsc024,linuxbsc027,linuxbsc028,linuxbsc029,linuxbsc030,linuxbsc031,linuxbsc032,linuxbsc025 MPI_FastTest.exe linuxbsc001: STDOUT: 24322 ?Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 1 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc002: STDOUT: 2141 ?Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 2 -mca orte_ess_num_procs 28 --hnp-uri 751435776.0;tcp://134.61.194.2:33210 -mca plm rsh linuxbsc003: STDOUT: 69265 ?Ss 0:00 /opt/MPI/openmpi-1.5.3/linux/intel/bin/orted --daemonize -mca ess env -mca orte_ess_jobid 751435776 -mca orte_ess_vpid 3 -mca
Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed
Hi, Could you please send your config and build log to me? Have you tried with a simpler program? Does this error always happen? Regards, Shiqing On 2011-11-19 4:24 PM, MM wrote: Trying to run my program linked against debug 1.5.4 on vs2010 fails: mpirun -np 1 .\nhui\Debug\nhui.exe : -np 1 .\nhcomp\Debug\nhcomp.exe [PCNAME:04960] [[1282,0],0] ORTE_ERROR_LOG: Not found in file C:\Program Files\openmpi-1.5.4\orte\mca\ess\hnp\ess_hnp_module.c at line 536 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_debugger_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- [PCNAME:04960] [[1282,0],0] ORTE_ERROR_LOG: Not found in file C:\Program Files\openmpi-1.5.4\orte\runtime\orte_init.c at line 128 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_set_name failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- [LLDNRATDHY9H4J:04960] [[1282,0],0] ORTE_ERROR_LOG: Not found in file C:\Program Files\openmpi-1.5.4\orte\tools\orterun\orterun.c at line 616 any help is appreciated, MM ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- --- Shiqing Fan High Performance Computing Center Stuttgart (HLRS) Tel: ++49(0)711-685-87234 Nobelstrasse 19 Fax: ++49(0)711-685-65832 70569 Stuttgart http://www.hlrs.de/organization/people/shiqing-fan/ email: f...@hlrs.de
Re: [OMPI users] openmpi and mingw32?
Hi John, Yes, there will be an initial build support for MinGW, but a few runtime issues still need to be fixed. If you want to try the current one, please download one of the latest 1.5 nightly tarballs. Please just let me know if you got problems on that. Feedback would be helpful and appreciated. Regards, Shiqing On 2011-11-20 10:13 PM, John R. Cary wrote: Are there plans for mingw32 support in openmpi? If so, any time scale? I configured with cmake and errored out at In file included from C:/winsame/builds-mingw/facetsall-mingw/openmpi-1.5.4/opal/include/opal_config_bottom.h:258:0, from C:/winsame/builds-mingw/facetsall-mingw/openmpi-1.5.4/try/opal/include/opal_config.h:1423, from C:\winsame\builds-mingw\facetsall-mingw\openmpi-1.5.4\try\opal\datatype\opal_datatype_pack_checksum.c:21: C:/winsame/builds-mingw/facetsall-mingw/openmpi-1.5.4/opal/win32/win_compat.h:104:14: error: conflicting types for 'ssize_t' Thx...John ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- --- Shiqing Fan High Performance Computing Center Stuttgart (HLRS) Tel: ++49(0)711-685-87234 Nobelstrasse 19 Fax: ++49(0)711-685-65832 70569 Stuttgart http://www.hlrs.de/organization/people/shiqing-fan/ email: f...@hlrs.de