Re: [OMPI devel] 0.9.1rc2 is available

2009-10-21 Thread Chris Samuel

- "Jeff Squyres"  wrote:

> Sweet!

:-)

> And -- your reply tells me that, for the 2nd time in a single day, I 
> posted to the wrong list.  :-)

Ah well, if you'd posted to the right list I wouldn't
have seen this.

> I'll forward your replies to the hwloc-devel list.

Not a problem - I'll go subscribe now.

cheers!
Chris
-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


Re: [OMPI devel] 0.9.1rc2 is available

2009-10-21 Thread Tony Breeds
On Thu, Oct 22, 2009 at 10:29:36AM +1100, Chris Samuel wrote:

> Dual socket, dual core Power5 (SMT disabled) running SLES9
> (2.6.9 based kernel):
> 
> System(15GB)
>   Node#0(7744MB)
> P#0
> P#2
>   Node#1(8000MB)
> P#4
> P#6

Powerpc kernels that old do not have the topology information needed (in /sys
or /proc/cpuinfo)  So for the short term that's be best we can do.  FWIW I'm 
looking at how we can pull more (if not all) the same info from the device tree
on these kenels.

Yours Tony


Re: [OMPI devel] 0.9.1rc2 is available

2009-10-21 Thread Jeff Squyres

Sweet!

And -- your reply tells me that, for the 2nd time in a single day, I  
posted to the wrong list.  :-)


I'll forward your replies to the hwloc-devel list.

Thanks!


On Oct 21, 2009, at 7:37 PM, Chris Samuel wrote:



- "Chris Samuel"  wrote:

> Some sample results below for configs not represented
> on the current website.

A final example of a more convoluted configuration with
a Torque job requesting 5 CPUs on a dual Shanghai node
and has been given a non-contiguous configuration.

[csamuel@tango069 ~]$ cat /dev/cpuset/`cat /proc/$$/cpuset`/cpus
0,4-7

[csamuel@tango069 ~]$ ~/local/hwloc/0.9.1rc2/bin/lstopo
System(31GB)
  Node#0(15GB) + Socket#0 + L3(6144KB) + L2(512KB) + L1(64KB) +  
Core#0 + P#0

  Node#1(16GB) + Socket#1 + L3(6144KB)
L2(512KB) + L1(64KB) + Core#0 + P#4
L2(512KB) + L1(64KB) + Core#1 + P#5
L2(512KB) + L1(64KB) + Core#2 + P#6
L2(512KB) + L1(64KB) + Core#3 + P#7

--
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI devel] 0.9.1rc2 is available

2009-10-21 Thread Chris Samuel

- "Chris Samuel"  wrote:

> Some sample results below for configs not represented
> on the current website.

A final example of a more convoluted configuration with
a Torque job requesting 5 CPUs on a dual Shanghai node
and has been given a non-contiguous configuration.

[csamuel@tango069 ~]$ cat /dev/cpuset/`cat /proc/$$/cpuset`/cpus
0,4-7

[csamuel@tango069 ~]$ ~/local/hwloc/0.9.1rc2/bin/lstopo
System(31GB)
  Node#0(15GB) + Socket#0 + L3(6144KB) + L2(512KB) + L1(64KB) + Core#0 + P#0
  Node#1(16GB) + Socket#1 + L3(6144KB)
L2(512KB) + L1(64KB) + Core#0 + P#4
L2(512KB) + L1(64KB) + Core#1 + P#5
L2(512KB) + L1(64KB) + Core#2 + P#6
L2(512KB) + L1(64KB) + Core#3 + P#7

-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


Re: [OMPI devel] 0.9.1rc2 is available

2009-10-21 Thread Chris Samuel

- "Jeff Squyres"  wrote:

> Give it a whirl:

Nice - built without warnings with GCC 4.4.2.

Some sample results below for configs not represented
on the current website.


Dual socket Shanghai:

System(31GB)
  Node#0(15GB) + Socket#0 + L3(6144KB)
L2(512KB) + L1(64KB) + Core#0 + P#0
L2(512KB) + L1(64KB) + Core#1 + P#1
L2(512KB) + L1(64KB) + Core#2 + P#2
L2(512KB) + L1(64KB) + Core#3 + P#3
  Node#1(16GB) + Socket#1 + L3(6144KB)
L2(512KB) + L1(64KB) + Core#0 + P#4
L2(512KB) + L1(64KB) + Core#1 + P#5
L2(512KB) + L1(64KB) + Core#2 + P#6
L2(512KB) + L1(64KB) + Core#3 + P#7


Dual socket single core Opteron:

System(3961MB)
  Node#0(2014MB) + Socket#0 + L2(1024KB) + L1(1024KB) + Core#0 + P#0
  Node#1(2017MB) + Socket#1 + L2(1024KB) + L1(1024KB) + Core#0 + P#1


Dual socket, dual core Power5 (SMT disabled) running SLES9
(2.6.9 based kernel):

System(15GB)
  Node#0(7744MB)
P#0
P#2
  Node#1(8000MB)
P#4
P#6


Inside a single CPU Torque job (using cpusets) on a dual socket Shanghai:

System(31GB)
  Node#0(15GB) + Socket#0 + L3(6144KB) + L2(512KB) + L1(64KB) + Core#0 + P#0
  Node#1(16GB)


-- 
Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency


[OMPI devel] MPI_Group_{incl|exc} with nranks=0 and ranks=NULL

2009-10-21 Thread Lisandro Dalcin
Currently (trunk, just svn update'd), the following call fails
(because of the ranks=NULL pointer)

MPI_Group_{incl|excl}(group, 0, NULL, &newgroup)

BTW, MPI_Group_translate_ranks() has similar issues...


Provided that Open MPI accept the combination (int_array_size=0,
int_array_ptr=NULL) in other calls, I think it should also accept the
NULL's in the calls above... What do you think?


-- 
Lisandro Dalcín
---
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594



Re: [OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?

2009-10-21 Thread Scott Atchley

On Oct 21, 2009, at 3:32 PM, Brice Goglin wrote:


George Bosilca wrote:

On Oct 21, 2009, at 13:42 , Scott Atchley wrote:

On Oct 21, 2009, at 1:25 PM, George Bosilca wrote:

Because MX doesn't provide a real RMA protocol, we created a fake
one on top of point-to-point. The two peers have to agree on a
unique tag, then the receiver posts it before the sender starts the
send. However, as this is integrated with the real RMA protocol,
where only one side knows about the completion of the RMA  
operation,

we still exchange the ACK at the end. Therefore, the receiver
doesn't need to know when the receive is completed, as it will get
an ACK from the sender. At least this was the original idea.

But I can see how this might fails if the short ACK from the sender
manage to pass the RMA operation on the wire. I was under the
impression (based on the fact that MX respect the ordering) that  
the

mx_send will trigger the completion only when all data is on the
wire/nic memory so I supposed there is _absolutely_ no way for the
ACK to bypass the last RMA fragments and to reach the receiver
before the recv is really completed. If my supposition is not
correct, then we should remove the mx_forget and make sure the that
before we mark a fragment as completed we got both completions (the
one from mx_recv and the remote one).


When is the ACK sent? After the "PUT" completion returns (via
mx_test(), etc) or simply after calling mx_isend() for the "PUT" but
before the completion?


The ACK is sent by the PML layer. If I'm not mistaken, it is sent  
when

the completion callback is triggered, which should happen only when
the MX BTL detect the completion of the mx_isend (using the mx_test).
Therefore, I think the ACK is sent in response to the completion of
the mx_isend.


Before or after mx_test() doesn't actually matter if it's a
small/medium. Even if the send(PUT) completes in mx_test(), the data
could still be on the wire in case of packet loss or so: if it's a
tiny/small/medium message (it's was a medium in my crash), the MX lib
opportunistically completes the request on the sender before it's
actually acked by the receiver. Matching is in order, request  
completion

is not. There's no strong delivery guarantee here.

Brice


Yes, I was thinking of the rendezvous case (>32 kB) only.

Scott


Re: [OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?

2009-10-21 Thread Brice Goglin
George Bosilca wrote:
> On Oct 21, 2009, at 13:42 , Scott Atchley wrote:
>> On Oct 21, 2009, at 1:25 PM, George Bosilca wrote:
>>> Because MX doesn't provide a real RMA protocol, we created a fake
>>> one on top of point-to-point. The two peers have to agree on a
>>> unique tag, then the receiver posts it before the sender starts the
>>> send. However, as this is integrated with the real RMA protocol,
>>> where only one side knows about the completion of the RMA operation,
>>> we still exchange the ACK at the end. Therefore, the receiver
>>> doesn't need to know when the receive is completed, as it will get
>>> an ACK from the sender. At least this was the original idea.
>>>
>>> But I can see how this might fails if the short ACK from the sender
>>> manage to pass the RMA operation on the wire. I was under the
>>> impression (based on the fact that MX respect the ordering) that the
>>> mx_send will trigger the completion only when all data is on the
>>> wire/nic memory so I supposed there is _absolutely_ no way for the
>>> ACK to bypass the last RMA fragments and to reach the receiver
>>> before the recv is really completed. If my supposition is not
>>> correct, then we should remove the mx_forget and make sure the that
>>> before we mark a fragment as completed we got both completions (the
>>> one from mx_recv and the remote one).
>>
>> When is the ACK sent? After the "PUT" completion returns (via
>> mx_test(), etc) or simply after calling mx_isend() for the "PUT" but
>> before the completion?
>
> The ACK is sent by the PML layer. If I'm not mistaken, it is sent when
> the completion callback is triggered, which should happen only when
> the MX BTL detect the completion of the mx_isend (using the mx_test).
> Therefore, I think the ACK is sent in response to the completion of
> the mx_isend.

Before or after mx_test() doesn't actually matter if it's a
small/medium. Even if the send(PUT) completes in mx_test(), the data
could still be on the wire in case of packet loss or so: if it's a
tiny/small/medium message (it's was a medium in my crash), the MX lib
opportunistically completes the request on the sender before it's
actually acked by the receiver. Matching is in order, request completion
is not. There's no strong delivery guarantee here.

Brice



[OMPI devel] 0.9.1rc2 is available

2009-10-21 Thread Jeff Squyres

Give it a whirl:

http://www.open-mpi.org/software/hwloc/v0.9/

I updated the docs, too:

http://www.open-mpi.org/projects/hwloc/doc/

--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI devel] Trunk is brokem ?

2009-10-21 Thread Ralph Castain
Thanks - impossible to know what explicit includes are required for  
every environment. We have been building the trunk without problem on  
our systems.


Appreciate the fix!

On Oct 21, 2009, at 10:30 AM, Pavel Shamis (Pasha) wrote:


It was broken :-(
I fixed it - r22119

Pasha

Pavel Shamis (Pasha) wrote:

On my systems I see follow error:

gcc -DHAVE_CONFIG_H -I. -I../../../../opal/include -I../../../../ 
orte/include -I../../../../ompi/include -I../../../../opal/mca/ 
paffinity/linux/plpa/src/libplpa -I../../../.. -O3 -DNDEBUG -Wall - 
Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict- 
prototypes -Wcomment -pedantic -Werror-implicit-function- 
declaration -finline-functions -fno-strict-aliasing -pthread - 
fvisibility=hidden -MT sensor_pru.lo -MD -MP -MF .deps/ 
sensor_pru.Tpo -c sensor_pru.c -fPIC -DPIC -o .libs/sensor_pru.o

sensor_pru_component.c: In function 'orte_sensor_pru_open':
sensor_pru_component.c:77: error: implicit declaration of function  
'opal_output'


Looks like the sensor code is broken.

Thanks,
Pasha
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?

2009-10-21 Thread George Bosilca


On Oct 21, 2009, at 13:42 , Scott Atchley wrote:


On Oct 21, 2009, at 1:25 PM, George Bosilca wrote:


Brice,

Because MX doesn't provide a real RMA protocol, we created a fake  
one on top of point-to-point. The two peers have to agree on a  
unique tag, then the receiver posts it before the sender starts the  
send. However, as this is integrated with the real RMA protocol,  
where only one side knows about the completion of the RMA  
operation, we still exchange the ACK at the end. Therefore, the  
receiver doesn't need to know when the receive is completed, as it  
will get an ACK from the sender. At least this was the original idea.


But I can see how this might fails if the short ACK from the sender  
manage to pass the RMA operation on the wire. I was under the  
impression (based on the fact that MX respect the ordering) that  
the mx_send will trigger the completion only when all data is on  
the wire/nic memory so I supposed there is _absolutely_ no way for  
the ACK to bypass the last RMA fragments and to reach the receiver  
before the recv is really completed. If my supposition is not  
correct, then we should remove the mx_forget and make sure the that  
before we mark a fragment as completed we got both completions (the  
one from mx_recv and the remote one).


George,

When is the ACK sent? After the "PUT" completion returns (via mx_test 
(), etc) or simply after calling mx_isend() for the "PUT" but before  
the completion?


The ACK is sent by the PML layer. If I'm not mistaken, it is sent when  
the completion callback is triggered, which should happen only when  
the MX BTL detect the completion of the mx_isend (using the mx_test).  
Therefore, I think the ACK is sent in response to the completion of  
the mx_isend.


  george.

If the former, the ACK cannot pass the data. If the latter, it is  
easily possible especially if there is a lot of contention (and thus  
a lot of route dispersion).


MX only guarantees order of matching (two identical tags will match  
in order), not order of completion.


Scott
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] trac ticket emails

2009-10-21 Thread Jeff Squyres

Blah; wrong list -- sorry!

On Oct 21, 2009, at 2:03 PM, Jeff Squyres wrote:

The IU sysadmins fixed something with trac today such that we should  
now get mails for trac ticket actions (to the hwloc-bugs list).


--
Jeff Squyres
jsquy...@cisco.com

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
jsquy...@cisco.com



[OMPI devel] trac ticket emails

2009-10-21 Thread Jeff Squyres
The IU sysadmins fixed something with trac today such that we should  
now get mails for trac ticket actions (to the hwloc-bugs list).


--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?

2009-10-21 Thread Scott Atchley

On Oct 21, 2009, at 1:25 PM, George Bosilca wrote:


Brice,

Because MX doesn't provide a real RMA protocol, we created a fake  
one on top of point-to-point. The two peers have to agree on a  
unique tag, then the receiver posts it before the sender starts the  
send. However, as this is integrated with the real RMA protocol,  
where only one side knows about the completion of the RMA operation,  
we still exchange the ACK at the end. Therefore, the receiver  
doesn't need to know when the receive is completed, as it will get  
an ACK from the sender. At least this was the original idea.


But I can see how this might fails if the short ACK from the sender  
manage to pass the RMA operation on the wire. I was under the  
impression (based on the fact that MX respect the ordering) that the  
mx_send will trigger the completion only when all data is on the  
wire/nic memory so I supposed there is _absolutely_ no way for the  
ACK to bypass the last RMA fragments and to reach the receiver  
before the recv is really completed. If my supposition is not  
correct, then we should remove the mx_forget and make sure the that  
before we mark a fragment as completed we got both completions (the  
one from mx_recv and the remote one).


George,

When is the ACK sent? After the "PUT" completion returns (via  
mx_test(), etc) or simply after calling mx_isend() for the "PUT" but  
before the completion?


If the former, the ACK cannot pass the data. If the latter, it is  
easily possible especially if there is a lot of contention (and thus a  
lot of route dispersion).


MX only guarantees order of matching (two identical tags will match in  
order), not order of completion.


Scott


Re: [OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?

2009-10-21 Thread George Bosilca

Brice,

Because MX doesn't provide a real RMA protocol, we created a fake one  
on top of point-to-point. The two peers have to agree on a unique tag,  
then the receiver posts it before the sender starts the send. However,  
as this is integrated with the real RMA protocol, where only one side  
knows about the completion of the RMA operation, we still exchange the  
ACK at the end. Therefore, the receiver doesn't need to know when the  
receive is completed, as it will get an ACK from the sender. At least  
this was the original idea.


But I can see how this might fails if the short ACK from the sender  
manage to pass the RMA operation on the wire. I was under the  
impression (based on the fact that MX respect the ordering) that the  
mx_send will trigger the completion only when all data is on the wire/ 
nic memory so I supposed there is _absolutely_ no way for the ACK to  
bypass the last RMA fragments and to reach the receiver before the  
recv is really completed. If my supposition is not correct, then we  
should remove the mx_forget and make sure the that before we mark a  
fragment as completed we got both completions (the one from mx_recv  
and the remote one).


  george.

On Oct 21, 2009, at 04:33 , Brice Goglin wrote:


Hello,

I am debugging a crash with OMPI 1.3.3 BTL over Open-MX. It's crashing
will trying to store incoming data in the OMPI receive buffer, but  
OMPI

seems to have already freed the buffer even if the MX request is not
complete yet. It looks like this is caused by mca_btl_mx_prepare_dst()
posting the receive and then calling mx_forget() immediately. The OMPI
r17452 by George introduced this. Commit log says "Improve the
performance of the MX BTL. Correct the fake PUT protocol." I don't
understand how this works.

mx_forget() is supposed to be used when you don't care anymore about a
message or a request, not really for performance purpose. It should  
not

help much in "normal" cases since you usually need to know when the
receive request is completed before you can actually use the received
data. And completion order is not guaranteed anyway, so it's hard to
guess when a request will complete if mx_forget() disabled the actual
completion notification.

Are you calling mx_forget() because you have another way to know when
the message will be received? If so, how?

When does OMPI free the fragment that is passed to mx_irecv in
mca_btl_mx_prepare_dst?

thanks,
Brice

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Trunk is brokem ?

2009-10-21 Thread Pavel Shamis (Pasha)

It was broken :-(
I fixed it - r22119

Pasha

Pavel Shamis (Pasha) wrote:

On my systems I see follow error:

gcc -DHAVE_CONFIG_H -I. -I../../../../opal/include 
-I../../../../orte/include -I../../../../ompi/include 
-I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../.. 
-O3 -DNDEBUG -Wall -Wundef -Wno-long-long -Wsign-compare 
-Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic 
-Werror-implicit-function-declaration -finline-functions 
-fno-strict-aliasing -pthread -fvisibility=hidden -MT sensor_pru.lo 
-MD -MP -MF .deps/sensor_pru.Tpo -c sensor_pru.c -fPIC -DPIC -o 
.libs/sensor_pru.o

sensor_pru_component.c: In function 'orte_sensor_pru_open':
sensor_pru_component.c:77: error: implicit declaration of function 
'opal_output'


Looks like the sensor code is broken.

Thanks,
Pasha
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





[OMPI devel] Trunk is brokem ?

2009-10-21 Thread Pavel Shamis (Pasha)

On my systems I see follow error:

gcc -DHAVE_CONFIG_H -I. -I../../../../opal/include 
-I../../../../orte/include -I../../../../ompi/include 
-I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../.. 
-O3 -DNDEBUG -Wall -Wundef -Wno-long-long -Wsign-compare 
-Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic 
-Werror-implicit-function-declaration -finline-functions 
-fno-strict-aliasing -pthread -fvisibility=hidden -MT sensor_pru.lo -MD 
-MP -MF .deps/sensor_pru.Tpo -c sensor_pru.c  -fPIC -DPIC -o 
.libs/sensor_pru.o

sensor_pru_component.c: In function 'orte_sensor_pru_open':
sensor_pru_component.c:77: error: implicit declaration of function 
'opal_output'


Looks like the sensor code is broken.

Thanks,
Pasha


[OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?

2009-10-21 Thread Brice Goglin
Hello,

I am debugging a crash with OMPI 1.3.3 BTL over Open-MX. It's crashing
will trying to store incoming data in the OMPI receive buffer, but OMPI
seems to have already freed the buffer even if the MX request is not
complete yet. It looks like this is caused by mca_btl_mx_prepare_dst()
posting the receive and then calling mx_forget() immediately. The OMPI
r17452 by George introduced this. Commit log says "Improve the
performance of the MX BTL. Correct the fake PUT protocol." I don't
understand how this works.

mx_forget() is supposed to be used when you don't care anymore about a
message or a request, not really for performance purpose. It should not
help much in "normal" cases since you usually need to know when the
receive request is completed before you can actually use the received
data. And completion order is not guaranteed anyway, so it's hard to
guess when a request will complete if mx_forget() disabled the actual
completion notification.

Are you calling mx_forget() because you have another way to know when
the message will be received? If so, how?

When does OMPI free the fragment that is passed to mx_irecv in
mca_btl_mx_prepare_dst?

thanks,
Brice