Re: [OMPI devel] iprobe and opal_progress

2008-06-18 Thread Jeff Squyres

Perhaps we did that as a latency optimization...?

George / Brian / Galen -- do you guys know/remember why this was done?

On the surface, it looks like it would be ok to call progress and  
check again to see if it found the match.  Can anyone think of a  
deeper reason not to?



On Jun 17, 2008, at 11:43 AM, Terry Dontje wrote:

I've ran into an issue while running hpl where a message has been  
sent (in shared memory in this case) and the receiver calls iprobe  
but doesn't see said message the first call to iprobe (even though  
it is there) but does see it the second call to iprobe.  Looking at  
mca_pml_ob1_iprobe function and the calls it makes it looks like it  
checks the unexpected queue for matches and if it doesn't find one  
it sets the flag to 0 (no matches), then calls opal_progress and  
return.  This seems wrong to me since I would expect that the call  
to opal_progress probably would pull in the message that the iprobe  
is waiting for.


Am I correct in my reading of the code?  It seems that maybe some  
sort of check needs to be done after the call to opal_progress in  
mca_pml_ob1_iprobe.


Attached is a simple program that shows the issue I am running into:

#include 

int main() {
  int rank, src[2], dst[2], flag = 0;
  int nxfers;
  MPI_Status status;

  MPI_Init(NULL, NULL);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  if (0 == rank) {
  for (nxfers = 0; nxfers < 5; nxfers++)
  MPI_Send(src, 2, MPI_INT, 1, 0, MPI_COMM_WORLD);
  } else if (1 == rank) {
  for (nxfers = 0; nxfers < 5; nxfers++) {
  sleep(5);
  flag = 0;
  while (!flag) {
 printf("iprobe...");
 MPI_Iprobe(0, 0, MPI_COMM_WORLD, &flag, &status);
  }
  printf("\n");
  MPI_Recv(dst, 2, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
  }
  }
  MPI_Finalize();
}

--td
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] iprobe and opal_progress

2008-06-18 Thread Brian W. Barrett
I'm sure it was a latency optimization, just like the old test behavior. 
Personally, I'd call opal_progress blindly, then walk through the queue. 
Doing the walk the queue, call opal_progress, walk the queue thing seems 
like too much work for iprobe.  Test, sure.  iProbe...  eh.


Brian


On Wed, 18 Jun 2008, Jeff Squyres wrote:


Perhaps we did that as a latency optimization...?

George / Brian / Galen -- do you guys know/remember why this was done?

On the surface, it looks like it would be ok to call progress and check again 
to see if it found the match.  Can anyone think of a deeper reason not to?



On Jun 17, 2008, at 11:43 AM, Terry Dontje wrote:

I've ran into an issue while running hpl where a message has been sent (in 
shared memory in this case) and the receiver calls iprobe but doesn't see 
said message the first call to iprobe (even though it is there) but does 
see it the second call to iprobe.  Looking at mca_pml_ob1_iprobe function 
and the calls it makes it looks like it checks the unexpected queue for 
matches and if it doesn't find one it sets the flag to 0 (no matches), then 
calls opal_progress and return.  This seems wrong to me since I would 
expect that the call to opal_progress probably would pull in the message 
that the iprobe is waiting for.


Am I correct in my reading of the code?  It seems that maybe some sort of 
check needs to be done after the call to opal_progress in 
mca_pml_ob1_iprobe.


Attached is a simple program that shows the issue I am running into:

#include 

int main() {
 int rank, src[2], dst[2], flag = 0;
 int nxfers;
 MPI_Status status;

 MPI_Init(NULL, NULL);
 MPI_Comm_rank(MPI_COMM_WORLD, &rank);

 if (0 == rank) {
 for (nxfers = 0; nxfers < 5; nxfers++)
 MPI_Send(src, 2, MPI_INT, 1, 0, MPI_COMM_WORLD);
 } else if (1 == rank) {
 for (nxfers = 0; nxfers < 5; nxfers++) {
 sleep(5);
 flag = 0;
 while (!flag) {
printf("iprobe...");
MPI_Iprobe(0, 0, MPI_COMM_WORLD, &flag, &status);
 }
 printf("\n");
 MPI_Recv(dst, 2, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
 }
 }
 MPI_Finalize();
}

--td
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel






Re: [OMPI devel] iprobe and opal_progress

2008-06-18 Thread Terry Dontje

Jeff Squyres wrote:

Perhaps we did that as a latency optimization...?

George / Brian / Galen -- do you guys know/remember why this was done?

On the surface, it looks like it would be ok to call progress and 
check again to see if it found the match.  Can anyone think of a 
deeper reason not to?


If it is ok to check again, my next question is going to be how?  
Because after looking at the code some more I found iprobe requests are 
not actually queued.  So can I just do another 
MCA_PML_OB1_RECV_REQUEST_START on the init'd IPROBE_REQUEST after the  
call opal_progress to force a search on the unexpected queue or do I 
need to FINI the request and regenerate it again?


--td


On Jun 17, 2008, at 11:43 AM, Terry Dontje wrote:

I've ran into an issue while running hpl where a message has been 
sent (in shared memory in this case) and the receiver calls iprobe 
but doesn't see said message the first call to iprobe (even though it 
is there) but does see it the second call to iprobe.  Looking at 
mca_pml_ob1_iprobe function and the calls it makes it looks like it 
checks the unexpected queue for matches and if it doesn't find one it 
sets the flag to 0 (no matches), then calls opal_progress and 
return.  This seems wrong to me since I would expect that the call to 
opal_progress probably would pull in the message that the iprobe is 
waiting for.


Am I correct in my reading of the code?  It seems that maybe some 
sort of check needs to be done after the call to opal_progress in 
mca_pml_ob1_iprobe.


Attached is a simple program that shows the issue I am running into:

#include 

int main() {
  int rank, src[2], dst[2], flag = 0;
  int nxfers;
  MPI_Status status;

  MPI_Init(NULL, NULL);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  if (0 == rank) {
  for (nxfers = 0; nxfers < 5; nxfers++)
  MPI_Send(src, 2, MPI_INT, 1, 0, MPI_COMM_WORLD);
  } else if (1 == rank) {
  for (nxfers = 0; nxfers < 5; nxfers++) {
  sleep(5);
  flag = 0;
  while (!flag) {
 printf("iprobe...");
 MPI_Iprobe(0, 0, MPI_COMM_WORLD, &flag, &status);
  }
  printf("\n");
  MPI_Recv(dst, 2, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
  }
  }
  MPI_Finalize();
}

--td
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel







Re: [OMPI devel] iprobe and opal_progress

2008-06-18 Thread Brian W. Barrett

On Wed, 18 Jun 2008, Terry Dontje wrote:


Jeff Squyres wrote:

Perhaps we did that as a latency optimization...?

George / Brian / Galen -- do you guys know/remember why this was done?

On the surface, it looks like it would be ok to call progress and check 
again to see if it found the match.  Can anyone think of a deeper reason 
not to?


If it is ok to check again, my next question is going to be how?  Because 
after looking at the code some more I found iprobe requests are not actually 
queued.  So can I just do another MCA_PML_OB1_RECV_REQUEST_START on the 
init'd IPROBE_REQUEST after the  call opal_progress to force a search on the 
unexpected queue or do I need to FINI the request and regenerate it again?


I think you'd have to re-init the request at a minimum.  In other words, 
just always call opal_progres at the top of iprobe and be done :).


Brian


Re: [OMPI devel] iprobe and opal_progress

2008-06-18 Thread George Bosilca
I kind of remember that we had a discussion about this long ago, and  
that we decided to have it this way for latency. Now looking at the  
code it seems way to ugly to me. I think Brian have a point. MPIPobe  
and MPI_Iprobe are MPI functions, and they are expected to make  
progress all the time. So call opal_progress, then do the probe seems  
like the smartest and simplest approach.


However, if you want to do this, then it's better if we do it in the  
right way. What we have today in the PML OB1 or probe is horribly  
expensive. Initialize a complete request, that will never be used for  
anything than matching is an overkill. The only fields that you really  
need are the flags and the matching information. How about, creating a  
request, setting these flags and then call the matching directly ?  
This way, we can create a special path or probes, and this will remove  
some ifs from the critical path for receives ...


  george.

On Jun 18, 2008, at 3:57 PM, Brian W. Barrett wrote:


On Wed, 18 Jun 2008, Terry Dontje wrote:


Jeff Squyres wrote:

Perhaps we did that as a latency optimization...?
George / Brian / Galen -- do you guys know/remember why this was  
done?
On the surface, it looks like it would be ok to call progress and  
check again to see if it found the match.  Can anyone think of a  
deeper reason not to?
If it is ok to check again, my next question is going to be how?   
Because after looking at the code some more I found iprobe requests  
are not actually queued.  So can I just do another  
MCA_PML_OB1_RECV_REQUEST_START on the init'd IPROBE_REQUEST after  
the  call opal_progress to force a search on the unexpected queue  
or do I need to FINI the request and regenerate it again?


I think you'd have to re-init the request at a minimum.  In other  
words, just always call opal_progres at the top of iprobe and be  
done :).


Brian
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] iprobe and opal_progress

2008-06-18 Thread Terry Dontje
Ok, I'll see if I can figure out the below.  Though is this really 
something that can be used in both MPI_Iprobe and MPI_Probe?  One other 
question,  is the use of opal_progress in MPI_Iprobe the right thing to 
do?  Is there something a little lighter weight (bml_progress maybe)?


--td

George Bosilca wrote:
I kind of remember that we had a discussion about this long ago, and 
that we decided to have it this way for latency. Now looking at the 
code it seems way to ugly to me. I think Brian have a point. MPIPobe 
and MPI_Iprobe are MPI functions, and they are expected to make 
progress all the time. So call opal_progress, then do the probe seems 
like the smartest and simplest approach.


However, if you want to do this, then it's better if we do it in the 
right way. What we have today in the PML OB1 or probe is horribly 
expensive. Initialize a complete request, that will never be used for 
anything than matching is an overkill. The only fields that you really 
need are the flags and the matching information. How about, creating a 
request, setting these flags and then call the matching directly ? 
This way, we can create a special path or probes, and this will remove 
some ifs from the critical path for receives ...


  george.

On Jun 18, 2008, at 3:57 PM, Brian W. Barrett wrote:


On Wed, 18 Jun 2008, Terry Dontje wrote:


Jeff Squyres wrote:

Perhaps we did that as a latency optimization...?
George / Brian / Galen -- do you guys know/remember why this was done?
On the surface, it looks like it would be ok to call progress and 
check again to see if it found the match.  Can anyone think of a 
deeper reason not to?
If it is ok to check again, my next question is going to be how?  
Because after looking at the code some more I found iprobe requests 
are not actually queued.  So can I just do another 
MCA_PML_OB1_RECV_REQUEST_START on the init'd IPROBE_REQUEST after 
the  call opal_progress to force a search on the unexpected queue or 
do I need to FINI the request and regenerate it again?


I think you'd have to re-init the request at a minimum.  In other 
words, just always call opal_progres at the top of iprobe and be done 
:).


Brian
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  




Re: [OMPI devel] iprobe and opal_progress

2008-06-18 Thread George Bosilca
No, please call the opal_progress. Otherwise, you will create  
different behavior based on the available networks, basically the  
networks that register a socket and those who don't. It might not be a  
big deal today (except if the user call MPI_Iprobe to progress  
communications), as TCP is the only network that use file descriptors,  
but it will be in the case of multithreaded applications.


  george.

On Jun 18, 2008, at 4:25 PM, Terry Dontje wrote:

Ok, I'll see if I can figure out the below.  Though is this really  
something that can be used in both MPI_Iprobe and MPI_Probe?  One  
other question,  is the use of opal_progress in MPI_Iprobe the right  
thing to do?  Is there something a little lighter weight  
(bml_progress maybe)?


--td

George Bosilca wrote:
I kind of remember that we had a discussion about this long ago,  
and that we decided to have it this way for latency. Now looking at  
the code it seems way to ugly to me. I think Brian have a point.  
MPIPobe and MPI_Iprobe are MPI functions, and they are expected to  
make progress all the time. So call opal_progress, then do the  
probe seems like the smartest and simplest approach.


However, if you want to do this, then it's better if we do it in  
the right way. What we have today in the PML OB1 or probe is  
horribly expensive. Initialize a complete request, that will never  
be used for anything than matching is an overkill. The only fields  
that you really need are the flags and the matching information.  
How about, creating a request, setting these flags and then call  
the matching directly ? This way, we can create a special path or  
probes, and this will remove some ifs from the critical path for  
receives ...


 george.

On Jun 18, 2008, at 3:57 PM, Brian W. Barrett wrote:


On Wed, 18 Jun 2008, Terry Dontje wrote:


Jeff Squyres wrote:

Perhaps we did that as a latency optimization...?
George / Brian / Galen -- do you guys know/remember why this was  
done?
On the surface, it looks like it would be ok to call progress  
and check again to see if it found the match.  Can anyone think  
of a deeper reason not to?
If it is ok to check again, my next question is going to be how?   
Because after looking at the code some more I found iprobe  
requests are not actually queued.  So can I just do another  
MCA_PML_OB1_RECV_REQUEST_START on the init'd IPROBE_REQUEST after  
the  call opal_progress to force a search on the unexpected queue  
or do I need to FINI the request and regenerate it again?


I think you'd have to re-init the request at a minimum.  In other  
words, just always call opal_progres at the top of iprobe and be  
done :).


Brian
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


[OMPI devel] OpenMPI multiple ethernet questions ...

2008-06-18 Thread Muhammad Atif
Hi again... I was on a break from Xensocket stuff This time some general 
questions...

Forgive me for the question its  a quick one and related to some of my 
development work on Xen, I will explain the rationale after the question. What 
if I have multiple Ethernet cards (say 5) on two of my quad core machines.  The 
IP addresses (and the subnets of course) are 
Machine A   Machine B
eth0 is y.y.1.a y.y.1.z 
eth1 is y.y.4.by.y.4.y
eth2 is y.y.4.c   ...
eth3 is y.y.4.d   ...

 ...
Now from the FAQ's (Refer 9: How does Open MPI know which TCP addresses are 
routable to each other?) it is clear that if I want to run a job on multiple 
ethernets, I can use --mca btl_tcp_if_include  eth0,eth1. This will run the job 
on two of the subnets utilizing both the Ethernet cards. Is it doing some sort 
of load balancing? or some round robin mechanism? What part of code is 
responsible for this work?

Now what if I want to run the job like --mca btl_tcp_if_include 
eth1,eth2,eth3,eth4. Notice that all of these ethNs are on same subnet. Even in 
the FAQ's (which mostly answers our lame questions)  its not entirely clear how 
communication will be done.  Each process will have tcp_num_btls equal to 
interfaces, but then what? Is it some sort of load balancing or similar stuff 
which is not clear in tcpdump?

Another related question is what if I want to run 8 process job (on 2x4 
cluster) and want to pin a process to an network interface. OpenMPI to my 
understanding does not give any control of allocating IP to a process (like 
MPICH) or is there some magical --mca thingie. I think only way to go is adding 
routing tables... am i thinking in right direction? If yes, then the 
performance of my boxes decrease when i trying to force the routing (obviously 
something terrible with my configuration)
 Its related to my Xen (virtualization) work. We are in a scenario, where all 
the virtual machines on one Xen host need to use eth2 (which is virtualized but 
optimized for intra-domain communication) and for communication outside the 
physical machine (i.e. to other Xen hosts)  we want to use eth1. Is 'route add' 
the only way again?

I will ask Xensocket BTL related questions later :)

Best Regards and thanks in advance,
Muhammad Atif





Re: [OMPI devel] iprobe and opal_progress

2008-06-18 Thread Terry Dontje
Ok however, I've seen a 40-150us hit by calling opal_progress.  Which is 
why I was hoping for something lighter weight.


--td
George Bosilca wrote:
No, please call the opal_progress. Otherwise, you will create 
different behavior based on the available networks, basically the 
networks that register a socket and those who don't. It might not be a 
big deal today (except if the user call MPI_Iprobe to progress 
communications), as TCP is the only network that use file descriptors, 
but it will be in the case of multithreaded applications.


  george.

On Jun 18, 2008, at 4:25 PM, Terry Dontje wrote:

Ok, I'll see if I can figure out the below.  Though is this really 
something that can be used in both MPI_Iprobe and MPI_Probe?  One 
other question,  is the use of opal_progress in MPI_Iprobe the right 
thing to do?  Is there something a little lighter weight 
(bml_progress maybe)?


--td

George Bosilca wrote:
I kind of remember that we had a discussion about this long ago, and 
that we decided to have it this way for latency. Now looking at the 
code it seems way to ugly to me. I think Brian have a point. MPIPobe 
and MPI_Iprobe are MPI functions, and they are expected to make 
progress all the time. So call opal_progress, then do the probe 
seems like the smartest and simplest approach.


However, if you want to do this, then it's better if we do it in the 
right way. What we have today in the PML OB1 or probe is horribly 
expensive. Initialize a complete request, that will never be used 
for anything than matching is an overkill. The only fields that you 
really need are the flags and the matching information. How about, 
creating a request, setting these flags and then call the matching 
directly ? This way, we can create a special path or probes, and 
this will remove some ifs from the critical path for receives ...


 george.

On Jun 18, 2008, at 3:57 PM, Brian W. Barrett wrote:


On Wed, 18 Jun 2008, Terry Dontje wrote:


Jeff Squyres wrote:

Perhaps we did that as a latency optimization...?
George / Brian / Galen -- do you guys know/remember why this was 
done?
On the surface, it looks like it would be ok to call progress and 
check again to see if it found the match.  Can anyone think of a 
deeper reason not to?
If it is ok to check again, my next question is going to be how?  
Because after looking at the code some more I found iprobe 
requests are not actually queued.  So can I just do another 
MCA_PML_OB1_RECV_REQUEST_START on the init'd IPROBE_REQUEST after 
the  call opal_progress to force a search on the unexpected queue 
or do I need to FINI the request and regenerate it again?


I think you'd have to re-init the request at a minimum.  In other 
words, just always call opal_progres at the top of iprobe and be 
done :).


Brian
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


 



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  




[OMPI devel] multiple GigE interfaces...

2008-06-18 Thread Muhammad Atif


 Hi again... I was on a break from Xensocket stuff This time some general 
questions...

Forgive
me for the question its  a quick one and related to some of my
development work on Xen, I will explain the rationale after the
question. What if I have multiple Ethernet cards (say 5) on two of my
quad core machines.  The IP addresses (and the subnets of course) are 
Machine A   Machine B
eth0 is y.y.1.a y.y.1.z 
eth1 is y.y.4.by.y.4.y
eth2 is y.y.4.c   ...
eth3 is y.y.4.d   ...

 ...
Now
from the FAQ's/Some emails in user lists  it is clear that if I want to run a 
job on
multiple ethernets, I can use --mca btl_tcp_if_include  eth0,eth1. This
will run the job on two of the subnets utilizing both the Ethernet
cards. Is it doing some sort of load balancing? or some round robin mechanism? 
What part of code is responsible for this work?

Now
what if I want to run the job like --mca btl_tcp_if_include
eth1,eth2,eth3,eth4. Notice that all of these ethNs are on same subnet.
Even in the FAQ's (which mostly answers our lame questions)  its not
entirely clear how communication will be done.  Each process will have
tcp_num_btls equal to interfaces, but then what? Is it some sort of
load balancing or similar stuff which is not clear in tcpdump?

Another
related question is what if I want to run 8 process job (on 2x4
cluster) and want to pin a process to an network interface. OpenMPI to
my understanding does not give any control of allocating IP to a
process (like MPICH)
or is there some magical --mca thingie. I think only way to go is
adding routing tables... am i thinking in right direction? If yes, then
the performance of my boxes decrease when i trying to force the routing
(obviously something terrible with my configuration)
 Its related to
my Xen (virtualization) work. We are in a scenario, where all the
virtual machines on one Xen host need to use eth2 (which is virtualized
but optimized for intra-domain communication) and for communication
outside the physical machine (i.e. to other Xen hosts)  we want to use
eth1. Is 'route add' the only way again?

I will ask Xensocket BTL related questions later :)

Best Regards and thanks in advance,
Muhammad Atif

PS: Sorry if you receive multiple messages. I think my previous message did 
not go through.