Re: [Pvfs2-developers] BMI questions

Sam Lang Fri, 01 Dec 2006 01:34:20 -0800


On Nov 30, 2006, at 6:58 PM, Scott Atchley wrote:

On Nov 30, 2006, at 4:31 PM, Sam Lang wrote:
Right now all our operations (or transactions, as you call them)start with an unexpected message from the client, and end with anexpected message from the server. I don't know if that's a designrequirement of BMI though, or just an artifact of how we use it inPVFS. I _think_ the BMI interfaces were meant to allow expectedmessages in either direction in any order, and its left up to theupper layers to make sure they get posted right, but again, Iwould have to defer to one of the BMI sages.
Hmmm. I assumed that for any operation, that there would be a backand forth between client and server ending with a expected sendfrom server to the client:
Client          Server
   |     unex      |
   |-------------->|
   |               |
   |      ex       |
   |<--------------|
   |               |
   |      ex       |
   |-------------->|
   |               |
   |      ex       |
   |<--------------|
   |               |
with a minimum of unexpected client to server followed by anexpected from server to client. If this is the case I might be ableto do a simple flow control on the client using a reference count(increment on send to server S and decrement on receive from S).
Are you saying that a single operation may not ping pong back andforth but have multiple expected sends in a single direction?
Client          Server
   |     unex      |
   |-------------->|
   |               |
   |      ex       |
   |<--------------|
   |               |
   |      ex       |
   |-------------->|
   |               |
   |      ex       |
   |-------------->|
   |               |
   |      ex       |
   |-------------->|
   |               |
   |      ex       |
   |<--------------|
   |               |
If so, would each of the receives (and matching sends) usedifferent tags? Also, this case presents a resource starvationrisk. Since the BMI method does not know about the entire operation(how many sends/receives), it is possible that it could start theoperation but not be able to get the additional resources for thesubsequent sends/receives to complete it.

Your example above is currently how writes work. The client sends anunexpected message to the server (a control message for the IO, fileinfo, size of the IO, etc.), which posts an expected receive, andthen sends an expected back to the client. The client posts areceive for the expected before sending the unexpected. After thereceive of the expected message at the client completes (this is a'ready for IO' message from the server), It posts a send of theactual IO (this will be up to FlowBufferSize). Once that sendcompletes, it posts another one, and assumes that the server hasalready posted another receive (based on the size of the entire IO).Once all the IO has completed at the server (including pushing thedata to disk), the server sends a response ack message, which theclient posted a receive for before doing any of the actual IO.


I think the ordering of posts goes something like this for a write:

client:                                                                 server:

------------------------------------------------------------------------------------------------


                                                                                
post_unexp
post_recv(ready_ack)                                            
post_send(IO_request)
                                                                                
wait(IO_request)
                                                                                
post_recv(IO1)
                                                                                
post_send(ready_ack)
wait(ready_ack)
post_send(IO1)
post_recv(write_ack)
                                                                                
wait(IO1)
                                                                                
post_recv(IO2)
wait_for_send_completion(IO1)
post_send(IO2)
                                                                                
wait(IO2)
                                                                                
post_recv(IO3)
...                                                                             
...
post_send(ION)
                                                                                
wait(ION)
                                                                                
post_send(write_ack)
wait(write_ack)

It looks like the flow code on the server doesn't actually post thenext recv of IO (IO2), until the first recv has completed (IO1), soits possible that the client posts (and starts) the next send beforethe server posts the next receive, although its probably unlikely.The server posts the next recv (IO2) once the first recv completes,as well as posting another recv (IO3) if necessary after the write todisk of the completed receive from IO1, so receives will begin to beposted before the current receive completes, allowing the server topost receives before the client posts associated sends. This isessentially what the flow looks like on the server:


time
-------------->

[---BMI RECV IO1---][----BMI RECV IO2----][----------BMI RECVIO4----------][-------------------BMI RECV IO7----------------][-DISK WRITE IO1-][------BMIRECV IO3------][--------BMI RECV IO5--------][---BMI RECV...[-DISK WRITE IO2-][------------------BMI RECVIO6---------------][---BMI RECV...[-DISK WRITE IO3-][---BMI RECV...[-DISKWRITE IO4-][---BMI RECV....[-DISK WRITE IO5-][-DISK WRITE IO6-][-DISK WRITE IO7-]

(I hope the columns match up ok there, you may need to resize yourwindow for best viewing :-)).

The [---] show the post and completion times of BMI receiveoperations, and associated writes of the received data to disk. EachBMI receive uses a separate buffer (up to a max of 8 buffers). Everytime a bmi recv completes, two things happen, the associated trovewrite is posted, and a new bmi recv is posted. So over time, bmireceives will get posted at the server before bmi sends get posted atthe client, but the second and maybe third bmi receives posted may beposted after the bmi sends at the client.


To answer your specific questions:

The same bmi tag is passed to each of the post_send and post_recvcalls for the entire IO operation.

As to hitting resource limits, the client doesn't post the next senduntil the previous send has completed. I think with enough IOoperations from different clients happening concurrently, it may bepossible to run into the resource issues you speak of, but I need toverify that.

Are you able to do some kind of pre-posting if you know there'salways an expected coming back?
-sam
I assumed that BMI always posted a receive for an expected incomingsend? Does it not? I would hope that BMI or a higher layer wouldpre-post the receive before calling the send function. If not, letme know.

Yes it always posts a receive for an expected message. For mostexpected messages the receive is guaranteed to be posted before thepeer posts the send. That doesn't appear to guaranteed in the IOcase though, as I mentioned above.


Hope this helps.

-sam


Scott


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] BMI questions

Reply via email to