Sorry- "rendezvous" the wrong terminology here for what is happening
within bmi_tcp at the individual message level. It doesn't implicitly
exchange control messages before putting each buffer on the wire.
bmi_tcp will send any size message without using control messages to
handshake within bmi_tcp. It is making the assumption that someone at a
higher level has already performed handshaking and agreeed that both
sides are going to post the appropriate matching operations.
Unexpected messages are the only things BMI is permitted to send without
a guarantee that the other side is going to post a recv.
The difference between small and large "normal" messsages in bmi_tcp is
not that larger ones will wait to transmit. Both are basically sent
the same way. The difference is on the recv side. Small messages are
allowed to be temporarily buffered by the receiver until a matching recv
is posted, while large messages will not be read into memory until a
matching receive buffer is posted. So the actual network transfer will
not _complete_ until both sides have posted, but it can definitely begin
before the recv is posted.
Flows have the same sort of restriction as large buffers do in bmi_tcp.
When a flow is posted, it does not do any hand shaking to make sure
both sides are ready to transmit before moving data. It assumes that
the request protocol has already sorted that out, so it just starts
transmitting.
I think that is what's leading to the problem here- the client has been
told to proceed with the flow before waiting to make sure that the
server is also ready to transmit a flow.
Once upon a time sys-io.sm did wait for write acks before starting write
flows, but that was changed at some point to try to improve performance.
We just didn't notice the case that it breaks for until now.
-Phil
Rob Ross wrote:
There's a fundamental issue here that I don't quite get: if we're in
rendezvous mode, why is there data on the wire if we aren't ready to
receive it? The whole point of rendezvous mode is to *not* send the data
until the matching receive has been posted.
What am I missing?
Thanks,
Rob
Phil Carns wrote:
Ok, I think I _might_ see what the problem is with the BMI messaging.
I haven't 100% confirmed yet, but it looks like we have the following
scenario:
On the client side:
--------------------
- pvfs2-client-core starts a I/O operation (write) to server X
- a send (for the request) is posted, which is a small buffer
- the flow is posted before an ack is received
- the flow itself posts another send for data, which is a large buffer
- ...
A few notes real quick- I think the above is a performance
optimization; we try to go ahead and get the flow going before
receiving a positive ack from the server. It will be canceled if we
get a negative ack (or fail to get an ack altogether)
- while the above is in progress, pvfs2-client-core starts another
write operation to server X (from another application that is hitting
the same server)
- a send for this second request is posted
- another flow is posted before an ack is received
- depending on the timing, it may manage to post a send for data as
well, which is another large buffer
- this traffic is interleaved on the same socket as is being used
for the first flow, which is still running at this point
On the server side:
--------------------
- the first I/O request arrives
- it gets past the request scheduler
- a flow is started and receives the first (large) data buffer
- a different request for the same handle arrives
- getattr would be a good example, could be from any client
- this getattr gets queued in the request scheduler behind the write
- the second I/O request arrives
- it gets queued behind the getattr in the request scheduler
At this point on the server side, we have a flow in progress that is
waiting on a data buffer. However, the next message is for a
different flow (the tags don't match). Since this message is
relatively large (256K), it is in rendezvous mode within bmi_tcp and
cannot be pulled out of the socket until a matching receive is
posted. The flow that is expected to post that receive is not running
yet because the second I/O request is stuck in the scheduler.
... so we have a deadlock. The socket is filled with data that the
server isn't allowed to recv yet, and the data that it really needs
next is stuck behind it.
I'm not sure that I described that all that well. At a high level we
have two flows sharing the same socket. The client started both of
them and the messages got interleaved. The server only started one of
them, but is now stuck because it can't deal with data arriving for
the second one.
I am going to try to find a brute force way to serialize I/O from each
pvfs2-client-core just to see if that solves the problem (maybe only
allowing one buffer between pvfs2-client-core and kernel, rather than
5). If that does look like it fixed the problem, then we need a more
elegant solution. Maybe waiting for acks before starting flows, or
just somehow serializing flows that share sockets.
-Phil
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers