There's a fundamental issue here that I don't quite get: if we're in
rendezvous mode, why is there data on the wire if we aren't ready to
receive it? The whole point of rendezvous mode is to *not* send the data
until the matching receive has been posted.
What am I missing?
Thanks,
Rob
Phil Carns wrote:
Ok, I think I _might_ see what the problem is with the BMI messaging.
I haven't 100% confirmed yet, but it looks like we have the following
scenario:
On the client side:
--------------------
- pvfs2-client-core starts a I/O operation (write) to server X
- a send (for the request) is posted, which is a small buffer
- the flow is posted before an ack is received
- the flow itself posts another send for data, which is a large buffer
- ...
A few notes real quick- I think the above is a performance optimization;
we try to go ahead and get the flow going before receiving a positive
ack from the server. It will be canceled if we get a negative ack (or
fail to get an ack altogether)
- while the above is in progress, pvfs2-client-core starts another write
operation to server X (from another application that is hitting the same
server)
- a send for this second request is posted
- another flow is posted before an ack is received
- depending on the timing, it may manage to post a send for data as
well, which is another large buffer
- this traffic is interleaved on the same socket as is being used for
the first flow, which is still running at this point
On the server side:
--------------------
- the first I/O request arrives
- it gets past the request scheduler
- a flow is started and receives the first (large) data buffer
- a different request for the same handle arrives
- getattr would be a good example, could be from any client
- this getattr gets queued in the request scheduler behind the write
- the second I/O request arrives
- it gets queued behind the getattr in the request scheduler
At this point on the server side, we have a flow in progress that is
waiting on a data buffer. However, the next message is for a different
flow (the tags don't match). Since this message is relatively large
(256K), it is in rendezvous mode within bmi_tcp and cannot be pulled out
of the socket until a matching receive is posted. The flow that is
expected to post that receive is not running yet because the second I/O
request is stuck in the scheduler.
... so we have a deadlock. The socket is filled with data that the
server isn't allowed to recv yet, and the data that it really needs next
is stuck behind it.
I'm not sure that I described that all that well. At a high level we
have two flows sharing the same socket. The client started both of them
and the messages got interleaved. The server only started one of them,
but is now stuck because it can't deal with data arriving for the second
one.
I am going to try to find a brute force way to serialize I/O from each
pvfs2-client-core just to see if that solves the problem (maybe only
allowing one buffer between pvfs2-client-core and kernel, rather than
5). If that does look like it fixed the problem, then we need a more
elegant solution. Maybe waiting for acks before starting flows, or
just somehow serializing flows that share sockets.
-Phil
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers