If we did this within BMI, we would be paying an extra round trip latency time for each large TCP message, which we should probably try to avoid.

I vote for just changing the ordering of sys-io.sm so that it does not post write flows until a positive write ack is received from the server. That is basically equivalent to performing one handshake (or rendezvous) for the whole flow rather than one per BMI message.

The sys-io.sm state machine already does that for reads. Read flows do not get posted until an ack is received from the server.

-Phil

Rob Ross wrote:
Ok. Well we screwed up here. We've either got to be able to pull that data off the wire (presumably at the BMI layer) or we've got to ACK for large messages (either in BMI or flow or elsewhere).

Suggestions on which approach to take and where to implement? Probably most straightforward to do this in BMI, but likely not the most efficient.

Rob

Phil Carns wrote:

Sorry- "rendezvous" the wrong terminology here for what is happening within bmi_tcp at the individual message level. It doesn't implicitly exchange control messages before putting each buffer on the wire.

bmi_tcp will send any size message without using control messages to handshake within bmi_tcp. It is making the assumption that someone at a higher level has already performed handshaking and agreeed that both sides are going to post the appropriate matching operations.

Unexpected messages are the only things BMI is permitted to send without a guarantee that the other side is going to post a recv.

The difference between small and large "normal" messsages in bmi_tcp is not that larger ones will wait to transmit. Both are basically sent the same way. The difference is on the recv side. Small messages are allowed to be temporarily buffered by the receiver until a matching recv is posted, while large messages will not be read into memory until a matching receive buffer is posted. So the actual network transfer will not _complete_ until both sides have posted, but it can definitely begin before the recv is posted.

Flows have the same sort of restriction as large buffers do in bmi_tcp. When a flow is posted, it does not do any hand shaking to make sure both sides are ready to transmit before moving data. It assumes that the request protocol has already sorted that out, so it just starts transmitting.

I think that is what's leading to the problem here- the client has been told to proceed with the flow before waiting to make sure that the server is also ready to transmit a flow.

Once upon a time sys-io.sm did wait for write acks before starting write flows, but that was changed at some point to try to improve performance. We just didn't notice the case that it breaks for until now.

-Phil

Rob Ross wrote:

There's a fundamental issue here that I don't quite get: if we're in rendezvous mode, why is there data on the wire if we aren't ready to receive it? The whole point of rendezvous mode is to *not* send the data until the matching receive has been posted.

What am I missing?

Thanks,

Rob

Phil Carns wrote:

Ok, I think I _might_ see what the problem is with the BMI messaging.

I haven't 100% confirmed yet, but it looks like we have the following scenario:

On the client side:
--------------------
- pvfs2-client-core starts a I/O operation (write) to server X
  - a send (for the request) is posted, which is a small buffer
  - the flow is posted before an ack is received
- the flow itself posts another send for data, which is a large buffer
  - ...

A few notes real quick- I think the above is a performance optimization; we try to go ahead and get the flow going before receiving a positive ack from the server. It will be canceled if we get a negative ack (or fail to get an ack altogether)

- while the above is in progress, pvfs2-client-core starts another write operation to server X (from another application that is hitting the same server)
  - a send for this second request is posted
  - another flow is posted before an ack is received
- depending on the timing, it may manage to post a send for data as well, which is another large buffer - this traffic is interleaved on the same socket as is being used for the first flow, which is still running at this point

On the server side:
--------------------
- the first I/O request arrives
  - it gets past the request scheduler
  - a flow is started and receives the first (large) data buffer
- a different request for the same handle arrives
  - getattr would be a good example, could be from any client
  - this getattr gets queued in the request scheduler behind the write
- the second I/O request arrives
  - it gets queued behind the getattr in the request scheduler

At this point on the server side, we have a flow in progress that is waiting on a data buffer. However, the next message is for a different flow (the tags don't match). Since this message is relatively large (256K), it is in rendezvous mode within bmi_tcp and cannot be pulled out of the socket until a matching receive is posted. The flow that is expected to post that receive is not running yet because the second I/O request is stuck in the scheduler.

... so we have a deadlock. The socket is filled with data that the server isn't allowed to recv yet, and the data that it really needs next is stuck behind it.

I'm not sure that I described that all that well. At a high level we have two flows sharing the same socket. The client started both of them and the messages got interleaved. The server only started one of them, but is now stuck because it can't deal with data arriving for the second one.

I am going to try to find a brute force way to serialize I/O from each pvfs2-client-core just to see if that solves the problem (maybe only allowing one buffer between pvfs2-client-core and kernel, rather than 5). If that does look like it fixed the problem, then we need a more elegant solution. Maybe waiting for acks before starting flows, or just somehow serializing flows that share sockets.

-Phil



_______________________________________________
Pvfs2-developers mailing list
Pvfs2-developers@beowulf-underground.org
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to