On Nov 30, 2006, at 6:58 PM, Scott Atchley wrote:

On Nov 30, 2006, at 4:31 PM, Sam Lang wrote:

Right now all our operations (or transactions, as you call them) start with an unexpected message from the client, and end with an expected message from the server. I don't know if that's a design requirement of BMI though, or just an artifact of how we use it in PVFS. I _think_ the BMI interfaces were meant to allow expected messages in either direction in any order, and its left up to the upper layers to make sure they get posted right, but again, I would have to defer to one of the BMI sages.

Hmmm. I assumed that for any operation, that there would be a back and forth between client and server ending with a expected send from server to the client:

Client          Server
   |     unex      |
   |-------------->|
   |               |
   |      ex       |
   |<--------------|
   |               |
   |      ex       |
   |-------------->|
   |               |
   |      ex       |
   |<--------------|
   |               |

with a minimum of unexpected client to server followed by an expected from server to client. If this is the case I might be able to do a simple flow control on the client using a reference count (increment on send to server S and decrement on receive from S).

Are you saying that a single operation may not ping pong back and forth but have multiple expected sends in a single direction?

Client          Server
   |     unex      |
   |-------------->|
   |               |
   |      ex       |
   |<--------------|
   |               |
   |      ex       |
   |-------------->|
   |               |
   |      ex       |
   |-------------->|
   |               |
   |      ex       |
   |-------------->|
   |               |
   |      ex       |
   |<--------------|
   |               |

If so, would each of the receives (and matching sends) use different tags? Also, this case presents a resource starvation risk. Since the BMI method does not know about the entire operation (how many sends/receives), it is possible that it could start the operation but not be able to get the additional resources for the subsequent sends/receives to complete it.

Your example above is currently how writes work. The client sends an unexpected message to the server (a control message for the IO, file info, size of the IO, etc.), which posts an expected receive, and then sends an expected back to the client. The client posts a receive for the expected before sending the unexpected. After the receive of the expected message at the client completes (this is a 'ready for IO' message from the server), It posts a send of the actual IO (this will be up to FlowBufferSize). Once that send completes, it posts another one, and assumes that the server has already posted another receive (based on the size of the entire IO). Once all the IO has completed at the server (including pushing the data to disk), the server sends a response ack message, which the client posted a receive for before doing any of the actual IO.

I think the ordering of posts goes something like this for a write:

client:                                                                 server:
------------------------------------------------------------------------ ------------------------

                                                                                
post_unexp
post_recv(ready_ack)                                            
post_send(IO_request)
                                                                                
wait(IO_request)
                                                                                
post_recv(IO1)
                                                                                
post_send(ready_ack)
wait(ready_ack)
post_send(IO1)
post_recv(write_ack)
                                                                                
wait(IO1)
                                                                                
post_recv(IO2)
wait_for_send_completion(IO1)
post_send(IO2)
                                                                                
wait(IO2)
                                                                                
post_recv(IO3)
...                                                                             
...
post_send(ION)
                                                                                
wait(ION)
                                                                                
post_send(write_ack)
wait(write_ack)


It looks like the flow code on the server doesn't actually post the next recv of IO (IO2), until the first recv has completed (IO1), so its possible that the client posts (and starts) the next send before the server posts the next receive, although its probably unlikely. The server posts the next recv (IO2) once the first recv completes, as well as posting another recv (IO3) if necessary after the write to disk of the completed receive from IO1, so receives will begin to be posted before the current receive completes, allowing the server to post receives before the client posts associated sends. This is essentially what the flow looks like on the server:

time
-------------->

[---BMI RECV IO1---][----BMI RECV IO2----][----------BMI RECV IO4----------][-------------------BMI RECV IO7----------------] [-DISK WRITE IO1-][------BMI RECV IO3------][--------BMI RECV IO5--------][---BMI RECV... [-DISK WRITE IO2-][------------------BMI RECV IO6---------------][---BMI RECV... [-DISK WRITE IO3-][--- BMI RECV... [-DISK WRITE IO4-][---BMI RECV.... [-DISK WRITE IO5-] [-DISK WRITE IO6-] [- DISK WRITE IO7-]

(I hope the columns match up ok there, you may need to resize your window for best viewing :-)).

The [---] show the post and completion times of BMI receive operations, and associated writes of the received data to disk. Each BMI receive uses a separate buffer (up to a max of 8 buffers). Every time a bmi recv completes, two things happen, the associated trove write is posted, and a new bmi recv is posted. So over time, bmi receives will get posted at the server before bmi sends get posted at the client, but the second and maybe third bmi receives posted may be posted after the bmi sends at the client.

To answer your specific questions:

The same bmi tag is passed to each of the post_send and post_recv calls for the entire IO operation.

As to hitting resource limits, the client doesn't post the next send until the previous send has completed. I think with enough IO operations from different clients happening concurrently, it may be possible to run into the resource issues you speak of, but I need to verify that.


Are you able to do some kind of pre-posting if you know there's always an expected coming back?

-sam

I assumed that BMI always posted a receive for an expected incoming send? Does it not? I would hope that BMI or a higher layer would pre-post the receive before calling the send function. If not, let me know.

Yes it always posts a receive for an expected message. For most expected messages the receive is guaranteed to be posted before the peer posts the send. That doesn't appear to guaranteed in the IO case though, as I mentioned above.

Hope this helps.

-sam


Scott


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to