On Nov 30, 2006, at 6:58 PM, Scott Atchley wrote:
On Nov 30, 2006, at 4:31 PM, Sam Lang wrote:
Right now all our operations (or transactions, as you call them)
start with an unexpected message from the client, and end with an
expected message from the server. I don't know if that's a design
requirement of BMI though, or just an artifact of how we use it in
PVFS. I _think_ the BMI interfaces were meant to allow expected
messages in either direction in any order, and its left up to the
upper layers to make sure they get posted right, but again, I
would have to defer to one of the BMI sages.
Hmmm. I assumed that for any operation, that there would be a back
and forth between client and server ending with a expected send
from server to the client:
Client Server
| unex |
|-------------->|
| |
| ex |
|<--------------|
| |
| ex |
|-------------->|
| |
| ex |
|<--------------|
| |
with a minimum of unexpected client to server followed by an
expected from server to client. If this is the case I might be able
to do a simple flow control on the client using a reference count
(increment on send to server S and decrement on receive from S).
Are you saying that a single operation may not ping pong back and
forth but have multiple expected sends in a single direction?
Client Server
| unex |
|-------------->|
| |
| ex |
|<--------------|
| |
| ex |
|-------------->|
| |
| ex |
|-------------->|
| |
| ex |
|-------------->|
| |
| ex |
|<--------------|
| |
If so, would each of the receives (and matching sends) use
different tags? Also, this case presents a resource starvation
risk. Since the BMI method does not know about the entire operation
(how many sends/receives), it is possible that it could start the
operation but not be able to get the additional resources for the
subsequent sends/receives to complete it.
Your example above is currently how writes work. The client sends an
unexpected message to the server (a control message for the IO, file
info, size of the IO, etc.), which posts an expected receive, and
then sends an expected back to the client. The client posts a
receive for the expected before sending the unexpected. After the
receive of the expected message at the client completes (this is a
'ready for IO' message from the server), It posts a send of the
actual IO (this will be up to FlowBufferSize). Once that send
completes, it posts another one, and assumes that the server has
already posted another receive (based on the size of the entire IO).
Once all the IO has completed at the server (including pushing the
data to disk), the server sends a response ack message, which the
client posted a receive for before doing any of the actual IO.
I think the ordering of posts goes something like this for a write:
client: server:
------------------------------------------------------------------------
------------------------
post_unexp
post_recv(ready_ack)
post_send(IO_request)
wait(IO_request)
post_recv(IO1)
post_send(ready_ack)
wait(ready_ack)
post_send(IO1)
post_recv(write_ack)
wait(IO1)
post_recv(IO2)
wait_for_send_completion(IO1)
post_send(IO2)
wait(IO2)
post_recv(IO3)
...
...
post_send(ION)
wait(ION)
post_send(write_ack)
wait(write_ack)
It looks like the flow code on the server doesn't actually post the
next recv of IO (IO2), until the first recv has completed (IO1), so
its possible that the client posts (and starts) the next send before
the server posts the next receive, although its probably unlikely.
The server posts the next recv (IO2) once the first recv completes,
as well as posting another recv (IO3) if necessary after the write to
disk of the completed receive from IO1, so receives will begin to be
posted before the current receive completes, allowing the server to
post receives before the client posts associated sends. This is
essentially what the flow looks like on the server:
time
-------------->
[---BMI RECV IO1---][----BMI RECV IO2----][----------BMI RECV
IO4----------][-------------------BMI RECV IO7----------------]
[-DISK WRITE IO1-][------BMI
RECV IO3------][--------BMI RECV IO5--------][---BMI RECV...
[-DISK WRITE IO2-][------------------BMI RECV
IO6---------------][---BMI RECV...
[-DISK WRITE IO3-][---
BMI RECV...
[-DISK
WRITE IO4-][---BMI RECV....
[-DISK WRITE IO5-]
[-DISK WRITE IO6-]
[-
DISK WRITE IO7-]
(I hope the columns match up ok there, you may need to resize your
window for best viewing :-)).
The [---] show the post and completion times of BMI receive
operations, and associated writes of the received data to disk. Each
BMI receive uses a separate buffer (up to a max of 8 buffers). Every
time a bmi recv completes, two things happen, the associated trove
write is posted, and a new bmi recv is posted. So over time, bmi
receives will get posted at the server before bmi sends get posted at
the client, but the second and maybe third bmi receives posted may be
posted after the bmi sends at the client.
To answer your specific questions:
The same bmi tag is passed to each of the post_send and post_recv
calls for the entire IO operation.
As to hitting resource limits, the client doesn't post the next send
until the previous send has completed. I think with enough IO
operations from different clients happening concurrently, it may be
possible to run into the resource issues you speak of, but I need to
verify that.
Are you able to do some kind of pre-posting if you know there's
always an expected coming back?
-sam
I assumed that BMI always posted a receive for an expected incoming
send? Does it not? I would hope that BMI or a higher layer would
pre-post the receive before calling the send function. If not, let
me know.
Yes it always posts a receive for an expected message. For most
expected messages the receive is guaranteed to be posted before the
peer posts the send. That doesn't appear to guaranteed in the IO
case though, as I mentioned above.
Hope this helps.
-sam
Scott
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers