George Bosilca wrote:
Terry,
We had a discussion about this few weeks ago. I have a version that
modify this behavior (SM progress will not return as long as there are
pending acks). There was no benefit from doing so (even if one might
think that less calls to opal_progress might improve the performances).
But my concern is not the raw performance of MPI_Iprobe in this case but
more of an interaction between MPI and an application. The concern is
if it takes 2 MPI_Iprobes to get to the real message (instead of one)
then could this induce a synchronization delay in an application? That
is by the application not receiving the "real" message in the first
MPI_Iprobe it may decide to do other work while the other processes
are potentially blocked waiting for it to do some communications.
In fact TCP has the potential to exhibit the same behavior. However,
TCP after each successful poll it empty the socket, so it might read
more than one message. As we have to empty the temporary buffer, we
interpret most of the messages inside, and this is why TCP exhibit a
different behavior.
I guess this difference in behavior between the SM BTL and TCP BTL is
disturbing to me. Does just processing one fifo entry per sm_progress
call per connection buying us performance? Would draining the acks be
detrimental to performance? Wouldn't providing the messages at the time
they arrived meet the rule of obviousness to application writers?
I know there is a slippery slope here of saying ok you've read one
message should read more until there is none on the fifo. I believe
that is really debatable and could go either way depending on the
application. But ack messages are not visible to the users. Which is
why I was only asking about draining the ack packets.
--td
george.
On Jun 19, 2008, at 2:16 PM, Terry Dontje wrote:
Galen, George and others that might have SM BTL interest.
In my quest of looking at MPI_Iprobe performance I found what I think
is an issue. If you have an application that is using the SM BTL and
does a small message send <=256 followed by an MPI_Iprobe the
mca_btl_sm_component function that is eventually called as a result
of the opal_progress will receive and ack message from its send and
then return. The net affect is that the real message is after the
ack message doesn't get read until a second MPI_Iprobe is made.
It seems to me that mca_btl_sm_component should read all Ack messages
from a particular fifo until it either finds a real send fragment or
no more messages on the fifo. Otherwise, we are forcing calls like
MPI_Iprobe to not return messages that are really there. I am not
sure by IB but I know that the TCP BTL does not show this issue
(which doesn't surprise me since I imagine the BTL is relying on TCP
to handle this type of protocol stuff).
Before I go munging with the code I wanted to make sure I am not
overlooking something here. One concern is if I change the code to
drain all the ack messages is that going to disrupt performance
elsewhere?
--td
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel