On Jun 14, 2006, at 8:21 AM, Phil Carns wrote:
If we did this within BMI, we would be paying an extra round trip
latency time for each large TCP message, which we should probably
try to avoid.
I vote for just changing the ordering of sys-io.sm so that it does
not post write flows until a positive write ack is received from
the server. That is basically equivalent to performing one
handshake (or rendezvous) for the whole flow rather than one per
BMI message.
I remember changing that last year to improve the performance of IO
ops that are larger than 16K, but still small enough that waiting for
the ack mattered. We discussed this a bit offline and came to the
conclusion that the path of least resistance is to change the sys-io
state machine back to the way it was (wait for response before
posting flow).
If this is a significant performance hit, we might consider
increasing the buffer size for eager mode in tcp, so that the small
io state machine gets triggered for larger IOs (right now its 16K I
think). Kind of a cheap fix for the performance penalty if there is
one.
If that doesn't work we could implement rendezvous in bmi_tcp and put
the sys-io state machine back to the way it is now.
I'll try to send a patch of the sys-io state machine waiting for the
response later today.
-sam
The sys-io.sm state machine already does that for reads. Read
flows do not get posted until an ack is received from the server.
-Phil
Rob Ross wrote:
Ok. Well we screwed up here. We've either got to be able to pull
that data off the wire (presumably at the BMI layer) or we've got
to ACK for large messages (either in BMI or flow or elsewhere).
Suggestions on which approach to take and where to implement?
Probably most straightforward to do this in BMI, but likely not
the most efficient.
Rob
Phil Carns wrote:
Sorry- "rendezvous" the wrong terminology here for what is
happening within bmi_tcp at the individual message level. It
doesn't implicitly exchange control messages before putting each
buffer on the wire.
bmi_tcp will send any size message without using control messages
to handshake within bmi_tcp. It is making the assumption that
someone at a higher level has already performed handshaking and
agreeed that both sides are going to post the appropriate
matching operations.
Unexpected messages are the only things BMI is permitted to send
without a guarantee that the other side is going to post a recv.
The difference between small and large "normal" messsages in
bmi_tcp is not that larger ones will wait to transmit. Both are
basically sent the same way. The difference is on the recv
side. Small messages are allowed to be temporarily buffered by
the receiver until a matching recv is posted, while large
messages will not be read into memory until a matching receive
buffer is posted. So the actual network transfer will not
_complete_ until both sides have posted, but it can definitely
begin before the recv is posted.
Flows have the same sort of restriction as large buffers do in
bmi_tcp. When a flow is posted, it does not do any hand shaking
to make sure both sides are ready to transmit before moving
data. It assumes that the request protocol has already sorted
that out, so it just starts transmitting.
I think that is what's leading to the problem here- the client
has been told to proceed with the flow before waiting to make
sure that the server is also ready to transmit a flow.
Once upon a time sys-io.sm did wait for write acks before
starting write flows, but that was changed at some point to try
to improve performance. We just didn't notice the case that it
breaks for until now.
-Phil
Rob Ross wrote:
There's a fundamental issue here that I don't quite get: if
we're in rendezvous mode, why is there data on the wire if we
aren't ready to receive it? The whole point of rendezvous mode
is to *not* send the data until the matching receive has been
posted.
What am I missing?
Thanks,
Rob
Phil Carns wrote:
Ok, I think I _might_ see what the problem is with the BMI
messaging.
I haven't 100% confirmed yet, but it looks like we have the
following scenario:
On the client side:
--------------------
- pvfs2-client-core starts a I/O operation (write) to server X
- a send (for the request) is posted, which is a small buffer
- the flow is posted before an ack is received
- the flow itself posts another send for data, which is a
large buffer
- ...
A few notes real quick- I think the above is a performance
optimization; we try to go ahead and get the flow going before
receiving a positive ack from the server. It will be canceled
if we get a negative ack (or fail to get an ack altogether)
- while the above is in progress, pvfs2-client-core starts
another write operation to server X (from another application
that is hitting the same server)
- a send for this second request is posted
- another flow is posted before an ack is received
- depending on the timing, it may manage to post a send for
data as well, which is another large buffer
- this traffic is interleaved on the same socket as is being
used for the first flow, which is still running at this point
On the server side:
--------------------
- the first I/O request arrives
- it gets past the request scheduler
- a flow is started and receives the first (large) data buffer
- a different request for the same handle arrives
- getattr would be a good example, could be from any client
- this getattr gets queued in the request scheduler behind
the write
- the second I/O request arrives
- it gets queued behind the getattr in the request scheduler
At this point on the server side, we have a flow in progress
that is waiting on a data buffer. However, the next message is
for a different flow (the tags don't match). Since this
message is relatively large (256K), it is in rendezvous mode
within bmi_tcp and cannot be pulled out of the socket until a
matching receive is posted. The flow that is expected to post
that receive is not running yet because the second I/O request
is stuck in the scheduler.
... so we have a deadlock. The socket is filled with data that
the server isn't allowed to recv yet, and the data that it
really needs next is stuck behind it.
I'm not sure that I described that all that well. At a high
level we have two flows sharing the same socket. The client
started both of them and the messages got interleaved. The
server only started one of them, but is now stuck because it
can't deal with data arriving for the second one.
I am going to try to find a brute force way to serialize I/O
from each pvfs2-client-core just to see if that solves the
problem (maybe only allowing one buffer between pvfs2-client-
core and kernel, rather than 5). If that does look like it
fixed the problem, then we need a more elegant solution.
Maybe waiting for acks before starting flows, or just somehow
serializing flows that share sockets.
-Phil
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers