Re: [OMPI devel] iprobe and opal_progress
Perhaps we did that as a latency optimization...? George / Brian / Galen -- do you guys know/remember why this was done? On the surface, it looks like it would be ok to call progress and check again to see if it found the match. Can anyone think of a deeper reason not to? On Jun 17, 2008, at 11:43 AM, Terry Dontje wrote: I've ran into an issue while running hpl where a message has been sent (in shared memory in this case) and the receiver calls iprobe but doesn't see said message the first call to iprobe (even though it is there) but does see it the second call to iprobe. Looking at mca_pml_ob1_iprobe function and the calls it makes it looks like it checks the unexpected queue for matches and if it doesn't find one it sets the flag to 0 (no matches), then calls opal_progress and return. This seems wrong to me since I would expect that the call to opal_progress probably would pull in the message that the iprobe is waiting for. Am I correct in my reading of the code? It seems that maybe some sort of check needs to be done after the call to opal_progress in mca_pml_ob1_iprobe. Attached is a simple program that shows the issue I am running into: #include int main() { int rank, src[2], dst[2], flag = 0; int nxfers; MPI_Status status; MPI_Init(NULL, NULL); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (0 == rank) { for (nxfers = 0; nxfers < 5; nxfers++) MPI_Send(src, 2, MPI_INT, 1, 0, MPI_COMM_WORLD); } else if (1 == rank) { for (nxfers = 0; nxfers < 5; nxfers++) { sleep(5); flag = 0; while (!flag) { printf("iprobe..."); MPI_Iprobe(0, 0, MPI_COMM_WORLD, &flag, &status); } printf("\n"); MPI_Recv(dst, 2, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); } } MPI_Finalize(); } --td ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] iprobe and opal_progress
I'm sure it was a latency optimization, just like the old test behavior. Personally, I'd call opal_progress blindly, then walk through the queue. Doing the walk the queue, call opal_progress, walk the queue thing seems like too much work for iprobe. Test, sure. iProbe... eh. Brian On Wed, 18 Jun 2008, Jeff Squyres wrote: Perhaps we did that as a latency optimization...? George / Brian / Galen -- do you guys know/remember why this was done? On the surface, it looks like it would be ok to call progress and check again to see if it found the match. Can anyone think of a deeper reason not to? On Jun 17, 2008, at 11:43 AM, Terry Dontje wrote: I've ran into an issue while running hpl where a message has been sent (in shared memory in this case) and the receiver calls iprobe but doesn't see said message the first call to iprobe (even though it is there) but does see it the second call to iprobe. Looking at mca_pml_ob1_iprobe function and the calls it makes it looks like it checks the unexpected queue for matches and if it doesn't find one it sets the flag to 0 (no matches), then calls opal_progress and return. This seems wrong to me since I would expect that the call to opal_progress probably would pull in the message that the iprobe is waiting for. Am I correct in my reading of the code? It seems that maybe some sort of check needs to be done after the call to opal_progress in mca_pml_ob1_iprobe. Attached is a simple program that shows the issue I am running into: #include int main() { int rank, src[2], dst[2], flag = 0; int nxfers; MPI_Status status; MPI_Init(NULL, NULL); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (0 == rank) { for (nxfers = 0; nxfers < 5; nxfers++) MPI_Send(src, 2, MPI_INT, 1, 0, MPI_COMM_WORLD); } else if (1 == rank) { for (nxfers = 0; nxfers < 5; nxfers++) { sleep(5); flag = 0; while (!flag) { printf("iprobe..."); MPI_Iprobe(0, 0, MPI_COMM_WORLD, &flag, &status); } printf("\n"); MPI_Recv(dst, 2, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); } } MPI_Finalize(); } --td ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] iprobe and opal_progress
Jeff Squyres wrote: Perhaps we did that as a latency optimization...? George / Brian / Galen -- do you guys know/remember why this was done? On the surface, it looks like it would be ok to call progress and check again to see if it found the match. Can anyone think of a deeper reason not to? If it is ok to check again, my next question is going to be how? Because after looking at the code some more I found iprobe requests are not actually queued. So can I just do another MCA_PML_OB1_RECV_REQUEST_START on the init'd IPROBE_REQUEST after the call opal_progress to force a search on the unexpected queue or do I need to FINI the request and regenerate it again? --td On Jun 17, 2008, at 11:43 AM, Terry Dontje wrote: I've ran into an issue while running hpl where a message has been sent (in shared memory in this case) and the receiver calls iprobe but doesn't see said message the first call to iprobe (even though it is there) but does see it the second call to iprobe. Looking at mca_pml_ob1_iprobe function and the calls it makes it looks like it checks the unexpected queue for matches and if it doesn't find one it sets the flag to 0 (no matches), then calls opal_progress and return. This seems wrong to me since I would expect that the call to opal_progress probably would pull in the message that the iprobe is waiting for. Am I correct in my reading of the code? It seems that maybe some sort of check needs to be done after the call to opal_progress in mca_pml_ob1_iprobe. Attached is a simple program that shows the issue I am running into: #include int main() { int rank, src[2], dst[2], flag = 0; int nxfers; MPI_Status status; MPI_Init(NULL, NULL); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (0 == rank) { for (nxfers = 0; nxfers < 5; nxfers++) MPI_Send(src, 2, MPI_INT, 1, 0, MPI_COMM_WORLD); } else if (1 == rank) { for (nxfers = 0; nxfers < 5; nxfers++) { sleep(5); flag = 0; while (!flag) { printf("iprobe..."); MPI_Iprobe(0, 0, MPI_COMM_WORLD, &flag, &status); } printf("\n"); MPI_Recv(dst, 2, MPI_INT, 0, 0, MPI_COMM_WORLD, &status); } } MPI_Finalize(); } --td ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] iprobe and opal_progress
On Wed, 18 Jun 2008, Terry Dontje wrote: Jeff Squyres wrote: Perhaps we did that as a latency optimization...? George / Brian / Galen -- do you guys know/remember why this was done? On the surface, it looks like it would be ok to call progress and check again to see if it found the match. Can anyone think of a deeper reason not to? If it is ok to check again, my next question is going to be how? Because after looking at the code some more I found iprobe requests are not actually queued. So can I just do another MCA_PML_OB1_RECV_REQUEST_START on the init'd IPROBE_REQUEST after the call opal_progress to force a search on the unexpected queue or do I need to FINI the request and regenerate it again? I think you'd have to re-init the request at a minimum. In other words, just always call opal_progres at the top of iprobe and be done :). Brian
Re: [OMPI devel] iprobe and opal_progress
I kind of remember that we had a discussion about this long ago, and that we decided to have it this way for latency. Now looking at the code it seems way to ugly to me. I think Brian have a point. MPIPobe and MPI_Iprobe are MPI functions, and they are expected to make progress all the time. So call opal_progress, then do the probe seems like the smartest and simplest approach. However, if you want to do this, then it's better if we do it in the right way. What we have today in the PML OB1 or probe is horribly expensive. Initialize a complete request, that will never be used for anything than matching is an overkill. The only fields that you really need are the flags and the matching information. How about, creating a request, setting these flags and then call the matching directly ? This way, we can create a special path or probes, and this will remove some ifs from the critical path for receives ... george. On Jun 18, 2008, at 3:57 PM, Brian W. Barrett wrote: On Wed, 18 Jun 2008, Terry Dontje wrote: Jeff Squyres wrote: Perhaps we did that as a latency optimization...? George / Brian / Galen -- do you guys know/remember why this was done? On the surface, it looks like it would be ok to call progress and check again to see if it found the match. Can anyone think of a deeper reason not to? If it is ok to check again, my next question is going to be how? Because after looking at the code some more I found iprobe requests are not actually queued. So can I just do another MCA_PML_OB1_RECV_REQUEST_START on the init'd IPROBE_REQUEST after the call opal_progress to force a search on the unexpected queue or do I need to FINI the request and regenerate it again? I think you'd have to re-init the request at a minimum. In other words, just always call opal_progres at the top of iprobe and be done :). Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] iprobe and opal_progress
Ok, I'll see if I can figure out the below. Though is this really something that can be used in both MPI_Iprobe and MPI_Probe? One other question, is the use of opal_progress in MPI_Iprobe the right thing to do? Is there something a little lighter weight (bml_progress maybe)? --td George Bosilca wrote: I kind of remember that we had a discussion about this long ago, and that we decided to have it this way for latency. Now looking at the code it seems way to ugly to me. I think Brian have a point. MPIPobe and MPI_Iprobe are MPI functions, and they are expected to make progress all the time. So call opal_progress, then do the probe seems like the smartest and simplest approach. However, if you want to do this, then it's better if we do it in the right way. What we have today in the PML OB1 or probe is horribly expensive. Initialize a complete request, that will never be used for anything than matching is an overkill. The only fields that you really need are the flags and the matching information. How about, creating a request, setting these flags and then call the matching directly ? This way, we can create a special path or probes, and this will remove some ifs from the critical path for receives ... george. On Jun 18, 2008, at 3:57 PM, Brian W. Barrett wrote: On Wed, 18 Jun 2008, Terry Dontje wrote: Jeff Squyres wrote: Perhaps we did that as a latency optimization...? George / Brian / Galen -- do you guys know/remember why this was done? On the surface, it looks like it would be ok to call progress and check again to see if it found the match. Can anyone think of a deeper reason not to? If it is ok to check again, my next question is going to be how? Because after looking at the code some more I found iprobe requests are not actually queued. So can I just do another MCA_PML_OB1_RECV_REQUEST_START on the init'd IPROBE_REQUEST after the call opal_progress to force a search on the unexpected queue or do I need to FINI the request and regenerate it again? I think you'd have to re-init the request at a minimum. In other words, just always call opal_progres at the top of iprobe and be done :). Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] iprobe and opal_progress
No, please call the opal_progress. Otherwise, you will create different behavior based on the available networks, basically the networks that register a socket and those who don't. It might not be a big deal today (except if the user call MPI_Iprobe to progress communications), as TCP is the only network that use file descriptors, but it will be in the case of multithreaded applications. george. On Jun 18, 2008, at 4:25 PM, Terry Dontje wrote: Ok, I'll see if I can figure out the below. Though is this really something that can be used in both MPI_Iprobe and MPI_Probe? One other question, is the use of opal_progress in MPI_Iprobe the right thing to do? Is there something a little lighter weight (bml_progress maybe)? --td George Bosilca wrote: I kind of remember that we had a discussion about this long ago, and that we decided to have it this way for latency. Now looking at the code it seems way to ugly to me. I think Brian have a point. MPIPobe and MPI_Iprobe are MPI functions, and they are expected to make progress all the time. So call opal_progress, then do the probe seems like the smartest and simplest approach. However, if you want to do this, then it's better if we do it in the right way. What we have today in the PML OB1 or probe is horribly expensive. Initialize a complete request, that will never be used for anything than matching is an overkill. The only fields that you really need are the flags and the matching information. How about, creating a request, setting these flags and then call the matching directly ? This way, we can create a special path or probes, and this will remove some ifs from the critical path for receives ... george. On Jun 18, 2008, at 3:57 PM, Brian W. Barrett wrote: On Wed, 18 Jun 2008, Terry Dontje wrote: Jeff Squyres wrote: Perhaps we did that as a latency optimization...? George / Brian / Galen -- do you guys know/remember why this was done? On the surface, it looks like it would be ok to call progress and check again to see if it found the match. Can anyone think of a deeper reason not to? If it is ok to check again, my next question is going to be how? Because after looking at the code some more I found iprobe requests are not actually queued. So can I just do another MCA_PML_OB1_RECV_REQUEST_START on the init'd IPROBE_REQUEST after the call opal_progress to force a search on the unexpected queue or do I need to FINI the request and regenerate it again? I think you'd have to re-init the request at a minimum. In other words, just always call opal_progres at the top of iprobe and be done :). Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
[OMPI devel] OpenMPI multiple ethernet questions ...
Hi again... I was on a break from Xensocket stuff This time some general questions... Forgive me for the question its a quick one and related to some of my development work on Xen, I will explain the rationale after the question. What if I have multiple Ethernet cards (say 5) on two of my quad core machines. The IP addresses (and the subnets of course) are Machine A Machine B eth0 is y.y.1.a y.y.1.z eth1 is y.y.4.by.y.4.y eth2 is y.y.4.c ... eth3 is y.y.4.d ... ... Now from the FAQ's (Refer 9: How does Open MPI know which TCP addresses are routable to each other?) it is clear that if I want to run a job on multiple ethernets, I can use --mca btl_tcp_if_include eth0,eth1. This will run the job on two of the subnets utilizing both the Ethernet cards. Is it doing some sort of load balancing? or some round robin mechanism? What part of code is responsible for this work? Now what if I want to run the job like --mca btl_tcp_if_include eth1,eth2,eth3,eth4. Notice that all of these ethNs are on same subnet. Even in the FAQ's (which mostly answers our lame questions) its not entirely clear how communication will be done. Each process will have tcp_num_btls equal to interfaces, but then what? Is it some sort of load balancing or similar stuff which is not clear in tcpdump? Another related question is what if I want to run 8 process job (on 2x4 cluster) and want to pin a process to an network interface. OpenMPI to my understanding does not give any control of allocating IP to a process (like MPICH) or is there some magical --mca thingie. I think only way to go is adding routing tables... am i thinking in right direction? If yes, then the performance of my boxes decrease when i trying to force the routing (obviously something terrible with my configuration) Its related to my Xen (virtualization) work. We are in a scenario, where all the virtual machines on one Xen host need to use eth2 (which is virtualized but optimized for intra-domain communication) and for communication outside the physical machine (i.e. to other Xen hosts) we want to use eth1. Is 'route add' the only way again? I will ask Xensocket BTL related questions later :) Best Regards and thanks in advance, Muhammad Atif
Re: [OMPI devel] iprobe and opal_progress
Ok however, I've seen a 40-150us hit by calling opal_progress. Which is why I was hoping for something lighter weight. --td George Bosilca wrote: No, please call the opal_progress. Otherwise, you will create different behavior based on the available networks, basically the networks that register a socket and those who don't. It might not be a big deal today (except if the user call MPI_Iprobe to progress communications), as TCP is the only network that use file descriptors, but it will be in the case of multithreaded applications. george. On Jun 18, 2008, at 4:25 PM, Terry Dontje wrote: Ok, I'll see if I can figure out the below. Though is this really something that can be used in both MPI_Iprobe and MPI_Probe? One other question, is the use of opal_progress in MPI_Iprobe the right thing to do? Is there something a little lighter weight (bml_progress maybe)? --td George Bosilca wrote: I kind of remember that we had a discussion about this long ago, and that we decided to have it this way for latency. Now looking at the code it seems way to ugly to me. I think Brian have a point. MPIPobe and MPI_Iprobe are MPI functions, and they are expected to make progress all the time. So call opal_progress, then do the probe seems like the smartest and simplest approach. However, if you want to do this, then it's better if we do it in the right way. What we have today in the PML OB1 or probe is horribly expensive. Initialize a complete request, that will never be used for anything than matching is an overkill. The only fields that you really need are the flags and the matching information. How about, creating a request, setting these flags and then call the matching directly ? This way, we can create a special path or probes, and this will remove some ifs from the critical path for receives ... george. On Jun 18, 2008, at 3:57 PM, Brian W. Barrett wrote: On Wed, 18 Jun 2008, Terry Dontje wrote: Jeff Squyres wrote: Perhaps we did that as a latency optimization...? George / Brian / Galen -- do you guys know/remember why this was done? On the surface, it looks like it would be ok to call progress and check again to see if it found the match. Can anyone think of a deeper reason not to? If it is ok to check again, my next question is going to be how? Because after looking at the code some more I found iprobe requests are not actually queued. So can I just do another MCA_PML_OB1_RECV_REQUEST_START on the init'd IPROBE_REQUEST after the call opal_progress to force a search on the unexpected queue or do I need to FINI the request and regenerate it again? I think you'd have to re-init the request at a minimum. In other words, just always call opal_progres at the top of iprobe and be done :). Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] multiple GigE interfaces...
Hi again... I was on a break from Xensocket stuff This time some general questions... Forgive me for the question its a quick one and related to some of my development work on Xen, I will explain the rationale after the question. What if I have multiple Ethernet cards (say 5) on two of my quad core machines. The IP addresses (and the subnets of course) are Machine A Machine B eth0 is y.y.1.a y.y.1.z eth1 is y.y.4.by.y.4.y eth2 is y.y.4.c ... eth3 is y.y.4.d ... ... Now from the FAQ's/Some emails in user lists it is clear that if I want to run a job on multiple ethernets, I can use --mca btl_tcp_if_include eth0,eth1. This will run the job on two of the subnets utilizing both the Ethernet cards. Is it doing some sort of load balancing? or some round robin mechanism? What part of code is responsible for this work? Now what if I want to run the job like --mca btl_tcp_if_include eth1,eth2,eth3,eth4. Notice that all of these ethNs are on same subnet. Even in the FAQ's (which mostly answers our lame questions) its not entirely clear how communication will be done. Each process will have tcp_num_btls equal to interfaces, but then what? Is it some sort of load balancing or similar stuff which is not clear in tcpdump? Another related question is what if I want to run 8 process job (on 2x4 cluster) and want to pin a process to an network interface. OpenMPI to my understanding does not give any control of allocating IP to a process (like MPICH) or is there some magical --mca thingie. I think only way to go is adding routing tables... am i thinking in right direction? If yes, then the performance of my boxes decrease when i trying to force the routing (obviously something terrible with my configuration) Its related to my Xen (virtualization) work. We are in a scenario, where all the virtual machines on one Xen host need to use eth2 (which is virtualized but optimized for intra-domain communication) and for communication outside the physical machine (i.e. to other Xen hosts) we want to use eth1. Is 'route add' the only way again? I will ask Xensocket BTL related questions later :) Best Regards and thanks in advance, Muhammad Atif PS: Sorry if you receive multiple messages. I think my previous message did not go through.