Re: [Pvfs2-developers] help debugging request processor/distribution

Sam Lang Wed, 14 Jun 2006 09:53:20 -0700


On Jun 14, 2006, at 8:21 AM, Phil Carns wrote:

If we did this within BMI, we would be paying an extra round triplatency time for each large TCP message, which we should probablytry to avoid.
I vote for just changing the ordering of sys-io.sm so that it doesnot post write flows until a positive write ack is received fromthe server. That is basically equivalent to performing onehandshake (or rendezvous) for the whole flow rather than one perBMI message.

I remember changing that last year to improve the performance of IOops that are larger than 16K, but still small enough that waiting forthe ack mattered. We discussed this a bit offline and came to theconclusion that the path of least resistance is to change the sys-iostate machine back to the way it was (wait for response beforeposting flow).

If this is a significant performance hit, we might considerincreasing the buffer size for eager mode in tcp, so that the smallio state machine gets triggered for larger IOs (right now its 16K Ithink). Kind of a cheap fix for the performance penalty if there isone.

If that doesn't work we could implement rendezvous in bmi_tcp and putthe sys-io state machine back to the way it is now.

I'll try to send a patch of the sys-io state machine waiting for theresponse later today.


-sam

The sys-io.sm state machine already does that for reads. Readflows do not get posted until an ack is received from the server.
-Phil

Rob Ross wrote:
Ok. Well we screwed up here. We've either got to be able to pullthat data off the wire (presumably at the BMI layer) or we've gotto ACK for large messages (either in BMI or flow or elsewhere).Suggestions on which approach to take and where to implement?Probably most straightforward to do this in BMI, but likely notthe most efficient.
Rob
Phil Carns wrote:
Sorry- "rendezvous" the wrong terminology here for what ishappening within bmi_tcp at the individual message level. Itdoesn't implicitly exchange control messages before putting eachbuffer on the wire.
bmi_tcp will send any size message without using control messagesto handshake within bmi_tcp. It is making the assumption thatsomeone at a higher level has already performed handshaking andagreeed that both sides are going to post the appropriatematching operations.
Unexpected messages are the only things BMI is permitted to sendwithout a guarantee that the other side is going to post a recv.
The difference between small and large "normal" messsages inbmi_tcp is not that larger ones will wait to transmit. Both arebasically sent the same way. The difference is on the recvside. Small messages are allowed to be temporarily buffered bythe receiver until a matching recv is posted, while largemessages will not be read into memory until a matching receivebuffer is posted. So the actual network transfer will not_complete_ until both sides have posted, but it can definitelybegin before the recv is posted.
Flows have the same sort of restriction as large buffers do inbmi_tcp. When a flow is posted, it does not do any hand shakingto make sure both sides are ready to transmit before movingdata. It assumes that the request protocol has already sortedthat out, so it just starts transmitting.
I think that is what's leading to the problem here- the clienthas been told to proceed with the flow before waiting to makesure that the server is also ready to transmit a flow.
Once upon a time sys-io.sm did wait for write acks beforestarting write flows, but that was changed at some point to tryto improve performance. We just didn't notice the case that itbreaks for until now.
-Phil

Rob Ross wrote:
There's a fundamental issue here that I don't quite get: ifwe're in rendezvous mode, why is there data on the wire if wearen't ready to receive it? The whole point of rendezvous modeis to *not* send the data until the matching receive has beenposted.
What am I missing?

Thanks,

Rob

Phil Carns wrote:
Ok, I think I _might_ see what the problem is with the BMImessaging.
I haven't 100% confirmed yet, but it looks like we have thefollowing scenario:
On the client side:
--------------------
- pvfs2-client-core starts a I/O operation (write) to server X
  - a send (for the request) is posted, which is a small buffer
  - the flow is posted before an ack is received
- the flow itself posts another send for data, which is alarge buffer
  - ...
A few notes real quick- I think the above is a performanceoptimization; we try to go ahead and get the flow going beforereceiving a positive ack from the server. It will be canceledif we get a negative ack (or fail to get an ack altogether)
- while the above is in progress, pvfs2-client-core startsanother write operation to server X (from another applicationthat is hitting the same server)
  - a send for this second request is posted
  - another flow is posted before an ack is received
- depending on the timing, it may manage to post a send fordata as well, which is another large buffer- this traffic is interleaved on the same socket as is beingused for the first flow, which is still running at this point
On the server side:
--------------------
- the first I/O request arrives
  - it gets past the request scheduler
  - a flow is started and receives the first (large) data buffer
- a different request for the same handle arrives
  - getattr would be a good example, could be from any client
- this getattr gets queued in the request scheduler behindthe write
- the second I/O request arrives
  - it gets queued behind the getattr in the request scheduler
At this point on the server side, we have a flow in progressthat is waiting on a data buffer. However, the next message isfor a different flow (the tags don't match). Since thismessage is relatively large (256K), it is in rendezvous modewithin bmi_tcp and cannot be pulled out of the socket until amatching receive is posted. The flow that is expected to postthat receive is not running yet because the second I/O requestis stuck in the scheduler.
... so we have a deadlock. The socket is filled with data thatthe server isn't allowed to recv yet, and the data that itreally needs next is stuck behind it.
I'm not sure that I described that all that well. At a highlevel we have two flows sharing the same socket. The clientstarted both of them and the messages got interleaved. Theserver only started one of them, but is now stuck because itcan't deal with data arriving for the second one.
I am going to try to find a brute force way to serialize I/Ofrom each pvfs2-client-core just to see if that solves theproblem (maybe only allowing one buffer between pvfs2-client-core and kernel, rather than 5). If that does look like itfixed the problem, then we need a more elegant solution.Maybe waiting for acks before starting flows, or just somehowserializing flows that share sockets.
-Phil
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] help debugging request processor/distribution

Reply via email to