Re: [OMPI devel] 0.9.1rc2 is available
- "Jeff Squyres" wrote: > Sweet! :-) > And -- your reply tells me that, for the 2nd time in a single day, I > posted to the wrong list. :-) Ah well, if you'd posted to the right list I wouldn't have seen this. > I'll forward your replies to the hwloc-devel list. Not a problem - I'll go subscribe now. cheers! Chris -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
Re: [OMPI devel] 0.9.1rc2 is available
On Thu, Oct 22, 2009 at 10:29:36AM +1100, Chris Samuel wrote: > Dual socket, dual core Power5 (SMT disabled) running SLES9 > (2.6.9 based kernel): > > System(15GB) > Node#0(7744MB) > P#0 > P#2 > Node#1(8000MB) > P#4 > P#6 Powerpc kernels that old do not have the topology information needed (in /sys or /proc/cpuinfo) So for the short term that's be best we can do. FWIW I'm looking at how we can pull more (if not all) the same info from the device tree on these kenels. Yours Tony
Re: [OMPI devel] 0.9.1rc2 is available
Sweet! And -- your reply tells me that, for the 2nd time in a single day, I posted to the wrong list. :-) I'll forward your replies to the hwloc-devel list. Thanks! On Oct 21, 2009, at 7:37 PM, Chris Samuel wrote: - "Chris Samuel" wrote: > Some sample results below for configs not represented > on the current website. A final example of a more convoluted configuration with a Torque job requesting 5 CPUs on a dual Shanghai node and has been given a non-contiguous configuration. [csamuel@tango069 ~]$ cat /dev/cpuset/`cat /proc/$$/cpuset`/cpus 0,4-7 [csamuel@tango069 ~]$ ~/local/hwloc/0.9.1rc2/bin/lstopo System(31GB) Node#0(15GB) + Socket#0 + L3(6144KB) + L2(512KB) + L1(64KB) + Core#0 + P#0 Node#1(16GB) + Socket#1 + L3(6144KB) L2(512KB) + L1(64KB) + Core#0 + P#4 L2(512KB) + L1(64KB) + Core#1 + P#5 L2(512KB) + L1(64KB) + Core#2 + P#6 L2(512KB) + L1(64KB) + Core#3 + P#7 -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] 0.9.1rc2 is available
- "Chris Samuel" wrote: > Some sample results below for configs not represented > on the current website. A final example of a more convoluted configuration with a Torque job requesting 5 CPUs on a dual Shanghai node and has been given a non-contiguous configuration. [csamuel@tango069 ~]$ cat /dev/cpuset/`cat /proc/$$/cpuset`/cpus 0,4-7 [csamuel@tango069 ~]$ ~/local/hwloc/0.9.1rc2/bin/lstopo System(31GB) Node#0(15GB) + Socket#0 + L3(6144KB) + L2(512KB) + L1(64KB) + Core#0 + P#0 Node#1(16GB) + Socket#1 + L3(6144KB) L2(512KB) + L1(64KB) + Core#0 + P#4 L2(512KB) + L1(64KB) + Core#1 + P#5 L2(512KB) + L1(64KB) + Core#2 + P#6 L2(512KB) + L1(64KB) + Core#3 + P#7 -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
Re: [OMPI devel] 0.9.1rc2 is available
- "Jeff Squyres" wrote: > Give it a whirl: Nice - built without warnings with GCC 4.4.2. Some sample results below for configs not represented on the current website. Dual socket Shanghai: System(31GB) Node#0(15GB) + Socket#0 + L3(6144KB) L2(512KB) + L1(64KB) + Core#0 + P#0 L2(512KB) + L1(64KB) + Core#1 + P#1 L2(512KB) + L1(64KB) + Core#2 + P#2 L2(512KB) + L1(64KB) + Core#3 + P#3 Node#1(16GB) + Socket#1 + L3(6144KB) L2(512KB) + L1(64KB) + Core#0 + P#4 L2(512KB) + L1(64KB) + Core#1 + P#5 L2(512KB) + L1(64KB) + Core#2 + P#6 L2(512KB) + L1(64KB) + Core#3 + P#7 Dual socket single core Opteron: System(3961MB) Node#0(2014MB) + Socket#0 + L2(1024KB) + L1(1024KB) + Core#0 + P#0 Node#1(2017MB) + Socket#1 + L2(1024KB) + L1(1024KB) + Core#0 + P#1 Dual socket, dual core Power5 (SMT disabled) running SLES9 (2.6.9 based kernel): System(15GB) Node#0(7744MB) P#0 P#2 Node#1(8000MB) P#4 P#6 Inside a single CPU Torque job (using cpusets) on a dual socket Shanghai: System(31GB) Node#0(15GB) + Socket#0 + L3(6144KB) + L2(512KB) + L1(64KB) + Core#0 + P#0 Node#1(16GB) -- Christopher Samuel - (03) 9925 4751 - Systems Manager The Victorian Partnership for Advanced Computing P.O. Box 201, Carlton South, VIC 3053, Australia VPAC is a not-for-profit Registered Research Agency
[OMPI devel] MPI_Group_{incl|exc} with nranks=0 and ranks=NULL
Currently (trunk, just svn update'd), the following call fails (because of the ranks=NULL pointer) MPI_Group_{incl|excl}(group, 0, NULL, &newgroup) BTW, MPI_Group_translate_ranks() has similar issues... Provided that Open MPI accept the combination (int_array_size=0, int_array_ptr=NULL) in other calls, I think it should also accept the NULL's in the calls above... What do you think? -- Lisandro Dalcín --- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594
Re: [OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?
On Oct 21, 2009, at 3:32 PM, Brice Goglin wrote: George Bosilca wrote: On Oct 21, 2009, at 13:42 , Scott Atchley wrote: On Oct 21, 2009, at 1:25 PM, George Bosilca wrote: Because MX doesn't provide a real RMA protocol, we created a fake one on top of point-to-point. The two peers have to agree on a unique tag, then the receiver posts it before the sender starts the send. However, as this is integrated with the real RMA protocol, where only one side knows about the completion of the RMA operation, we still exchange the ACK at the end. Therefore, the receiver doesn't need to know when the receive is completed, as it will get an ACK from the sender. At least this was the original idea. But I can see how this might fails if the short ACK from the sender manage to pass the RMA operation on the wire. I was under the impression (based on the fact that MX respect the ordering) that the mx_send will trigger the completion only when all data is on the wire/nic memory so I supposed there is _absolutely_ no way for the ACK to bypass the last RMA fragments and to reach the receiver before the recv is really completed. If my supposition is not correct, then we should remove the mx_forget and make sure the that before we mark a fragment as completed we got both completions (the one from mx_recv and the remote one). When is the ACK sent? After the "PUT" completion returns (via mx_test(), etc) or simply after calling mx_isend() for the "PUT" but before the completion? The ACK is sent by the PML layer. If I'm not mistaken, it is sent when the completion callback is triggered, which should happen only when the MX BTL detect the completion of the mx_isend (using the mx_test). Therefore, I think the ACK is sent in response to the completion of the mx_isend. Before or after mx_test() doesn't actually matter if it's a small/medium. Even if the send(PUT) completes in mx_test(), the data could still be on the wire in case of packet loss or so: if it's a tiny/small/medium message (it's was a medium in my crash), the MX lib opportunistically completes the request on the sender before it's actually acked by the receiver. Matching is in order, request completion is not. There's no strong delivery guarantee here. Brice Yes, I was thinking of the rendezvous case (>32 kB) only. Scott
Re: [OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?
George Bosilca wrote: > On Oct 21, 2009, at 13:42 , Scott Atchley wrote: >> On Oct 21, 2009, at 1:25 PM, George Bosilca wrote: >>> Because MX doesn't provide a real RMA protocol, we created a fake >>> one on top of point-to-point. The two peers have to agree on a >>> unique tag, then the receiver posts it before the sender starts the >>> send. However, as this is integrated with the real RMA protocol, >>> where only one side knows about the completion of the RMA operation, >>> we still exchange the ACK at the end. Therefore, the receiver >>> doesn't need to know when the receive is completed, as it will get >>> an ACK from the sender. At least this was the original idea. >>> >>> But I can see how this might fails if the short ACK from the sender >>> manage to pass the RMA operation on the wire. I was under the >>> impression (based on the fact that MX respect the ordering) that the >>> mx_send will trigger the completion only when all data is on the >>> wire/nic memory so I supposed there is _absolutely_ no way for the >>> ACK to bypass the last RMA fragments and to reach the receiver >>> before the recv is really completed. If my supposition is not >>> correct, then we should remove the mx_forget and make sure the that >>> before we mark a fragment as completed we got both completions (the >>> one from mx_recv and the remote one). >> >> When is the ACK sent? After the "PUT" completion returns (via >> mx_test(), etc) or simply after calling mx_isend() for the "PUT" but >> before the completion? > > The ACK is sent by the PML layer. If I'm not mistaken, it is sent when > the completion callback is triggered, which should happen only when > the MX BTL detect the completion of the mx_isend (using the mx_test). > Therefore, I think the ACK is sent in response to the completion of > the mx_isend. Before or after mx_test() doesn't actually matter if it's a small/medium. Even if the send(PUT) completes in mx_test(), the data could still be on the wire in case of packet loss or so: if it's a tiny/small/medium message (it's was a medium in my crash), the MX lib opportunistically completes the request on the sender before it's actually acked by the receiver. Matching is in order, request completion is not. There's no strong delivery guarantee here. Brice
[OMPI devel] 0.9.1rc2 is available
Give it a whirl: http://www.open-mpi.org/software/hwloc/v0.9/ I updated the docs, too: http://www.open-mpi.org/projects/hwloc/doc/ -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Trunk is brokem ?
Thanks - impossible to know what explicit includes are required for every environment. We have been building the trunk without problem on our systems. Appreciate the fix! On Oct 21, 2009, at 10:30 AM, Pavel Shamis (Pasha) wrote: It was broken :-( I fixed it - r22119 Pasha Pavel Shamis (Pasha) wrote: On my systems I see follow error: gcc -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../ orte/include -I../../../../ompi/include -I../../../../opal/mca/ paffinity/linux/plpa/src/libplpa -I../../../.. -O3 -DNDEBUG -Wall - Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict- prototypes -Wcomment -pedantic -Werror-implicit-function- declaration -finline-functions -fno-strict-aliasing -pthread - fvisibility=hidden -MT sensor_pru.lo -MD -MP -MF .deps/ sensor_pru.Tpo -c sensor_pru.c -fPIC -DPIC -o .libs/sensor_pru.o sensor_pru_component.c: In function 'orte_sensor_pru_open': sensor_pru_component.c:77: error: implicit declaration of function 'opal_output' Looks like the sensor code is broken. Thanks, Pasha ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?
On Oct 21, 2009, at 13:42 , Scott Atchley wrote: On Oct 21, 2009, at 1:25 PM, George Bosilca wrote: Brice, Because MX doesn't provide a real RMA protocol, we created a fake one on top of point-to-point. The two peers have to agree on a unique tag, then the receiver posts it before the sender starts the send. However, as this is integrated with the real RMA protocol, where only one side knows about the completion of the RMA operation, we still exchange the ACK at the end. Therefore, the receiver doesn't need to know when the receive is completed, as it will get an ACK from the sender. At least this was the original idea. But I can see how this might fails if the short ACK from the sender manage to pass the RMA operation on the wire. I was under the impression (based on the fact that MX respect the ordering) that the mx_send will trigger the completion only when all data is on the wire/nic memory so I supposed there is _absolutely_ no way for the ACK to bypass the last RMA fragments and to reach the receiver before the recv is really completed. If my supposition is not correct, then we should remove the mx_forget and make sure the that before we mark a fragment as completed we got both completions (the one from mx_recv and the remote one). George, When is the ACK sent? After the "PUT" completion returns (via mx_test (), etc) or simply after calling mx_isend() for the "PUT" but before the completion? The ACK is sent by the PML layer. If I'm not mistaken, it is sent when the completion callback is triggered, which should happen only when the MX BTL detect the completion of the mx_isend (using the mx_test). Therefore, I think the ACK is sent in response to the completion of the mx_isend. george. If the former, the ACK cannot pass the data. If the latter, it is easily possible especially if there is a lot of contention (and thus a lot of route dispersion). MX only guarantees order of matching (two identical tags will match in order), not order of completion. Scott ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trac ticket emails
Blah; wrong list -- sorry! On Oct 21, 2009, at 2:03 PM, Jeff Squyres wrote: The IU sysadmins fixed something with trac today such that we should now get mails for trac ticket actions (to the hwloc-bugs list). -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
[OMPI devel] trac ticket emails
The IU sysadmins fixed something with trac today such that we should now get mails for trac ticket actions (to the hwloc-bugs list). -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?
On Oct 21, 2009, at 1:25 PM, George Bosilca wrote: Brice, Because MX doesn't provide a real RMA protocol, we created a fake one on top of point-to-point. The two peers have to agree on a unique tag, then the receiver posts it before the sender starts the send. However, as this is integrated with the real RMA protocol, where only one side knows about the completion of the RMA operation, we still exchange the ACK at the end. Therefore, the receiver doesn't need to know when the receive is completed, as it will get an ACK from the sender. At least this was the original idea. But I can see how this might fails if the short ACK from the sender manage to pass the RMA operation on the wire. I was under the impression (based on the fact that MX respect the ordering) that the mx_send will trigger the completion only when all data is on the wire/nic memory so I supposed there is _absolutely_ no way for the ACK to bypass the last RMA fragments and to reach the receiver before the recv is really completed. If my supposition is not correct, then we should remove the mx_forget and make sure the that before we mark a fragment as completed we got both completions (the one from mx_recv and the remote one). George, When is the ACK sent? After the "PUT" completion returns (via mx_test(), etc) or simply after calling mx_isend() for the "PUT" but before the completion? If the former, the ACK cannot pass the data. If the latter, it is easily possible especially if there is a lot of contention (and thus a lot of route dispersion). MX only guarantees order of matching (two identical tags will match in order), not order of completion. Scott
Re: [OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?
Brice, Because MX doesn't provide a real RMA protocol, we created a fake one on top of point-to-point. The two peers have to agree on a unique tag, then the receiver posts it before the sender starts the send. However, as this is integrated with the real RMA protocol, where only one side knows about the completion of the RMA operation, we still exchange the ACK at the end. Therefore, the receiver doesn't need to know when the receive is completed, as it will get an ACK from the sender. At least this was the original idea. But I can see how this might fails if the short ACK from the sender manage to pass the RMA operation on the wire. I was under the impression (based on the fact that MX respect the ordering) that the mx_send will trigger the completion only when all data is on the wire/ nic memory so I supposed there is _absolutely_ no way for the ACK to bypass the last RMA fragments and to reach the receiver before the recv is really completed. If my supposition is not correct, then we should remove the mx_forget and make sure the that before we mark a fragment as completed we got both completions (the one from mx_recv and the remote one). george. On Oct 21, 2009, at 04:33 , Brice Goglin wrote: Hello, I am debugging a crash with OMPI 1.3.3 BTL over Open-MX. It's crashing will trying to store incoming data in the OMPI receive buffer, but OMPI seems to have already freed the buffer even if the MX request is not complete yet. It looks like this is caused by mca_btl_mx_prepare_dst() posting the receive and then calling mx_forget() immediately. The OMPI r17452 by George introduced this. Commit log says "Improve the performance of the MX BTL. Correct the fake PUT protocol." I don't understand how this works. mx_forget() is supposed to be used when you don't care anymore about a message or a request, not really for performance purpose. It should not help much in "normal" cases since you usually need to know when the receive request is completed before you can actually use the received data. And completion order is not guaranteed anyway, so it's hard to guess when a request will complete if mx_forget() disabled the actual completion notification. Are you calling mx_forget() because you have another way to know when the message will be received? If so, how? When does OMPI free the fragment that is passed to mx_irecv in mca_btl_mx_prepare_dst? thanks, Brice ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Trunk is brokem ?
It was broken :-( I fixed it - r22119 Pasha Pavel Shamis (Pasha) wrote: On my systems I see follow error: gcc -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../orte/include -I../../../../ompi/include -I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../.. -O3 -DNDEBUG -Wall -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror-implicit-function-declaration -finline-functions -fno-strict-aliasing -pthread -fvisibility=hidden -MT sensor_pru.lo -MD -MP -MF .deps/sensor_pru.Tpo -c sensor_pru.c -fPIC -DPIC -o .libs/sensor_pru.o sensor_pru_component.c: In function 'orte_sensor_pru_open': sensor_pru_component.c:77: error: implicit declaration of function 'opal_output' Looks like the sensor code is broken. Thanks, Pasha ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Trunk is brokem ?
On my systems I see follow error: gcc -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../orte/include -I../../../../ompi/include -I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../.. -O3 -DNDEBUG -Wall -Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic -Werror-implicit-function-declaration -finline-functions -fno-strict-aliasing -pthread -fvisibility=hidden -MT sensor_pru.lo -MD -MP -MF .deps/sensor_pru.Tpo -c sensor_pru.c -fPIC -DPIC -o .libs/sensor_pru.o sensor_pru_component.c: In function 'orte_sensor_pru_open': sensor_pru_component.c:77: error: implicit declaration of function 'opal_output' Looks like the sensor code is broken. Thanks, Pasha
[OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?
Hello, I am debugging a crash with OMPI 1.3.3 BTL over Open-MX. It's crashing will trying to store incoming data in the OMPI receive buffer, but OMPI seems to have already freed the buffer even if the MX request is not complete yet. It looks like this is caused by mca_btl_mx_prepare_dst() posting the receive and then calling mx_forget() immediately. The OMPI r17452 by George introduced this. Commit log says "Improve the performance of the MX BTL. Correct the fake PUT protocol." I don't understand how this works. mx_forget() is supposed to be used when you don't care anymore about a message or a request, not really for performance purpose. It should not help much in "normal" cases since you usually need to know when the receive request is completed before you can actually use the received data. And completion order is not guaranteed anyway, so it's hard to guess when a request will complete if mx_forget() disabled the actual completion notification. Are you calling mx_forget() because you have another way to know when the message will be received? If so, how? When does OMPI free the fragment that is passed to mx_irecv in mca_btl_mx_prepare_dst? thanks, Brice