Hi Julian,
I've included comments inline.
-sam
On Dec 17, 2005, at 9:35 AM, Julian Martin Kunkel wrote:
Hi,
I try to document the handling of some operations.
However the IO handling with flow is a bit complicated ;-)
I try to summarize only important steps of the IO process
(including states of
the state-machines) and would be very happy if you could have a
look and give
me hints if something is wrong or if I forgot a important state.
For me especially messages send between client and server and trove-
operations
are important.
C means client state, S means server state.
pvfs2_client_io_sm
C1) Get file attributes and size using pvfs2_client_getattr_sm
C2) Find target datafiles using the distribution function
C3) Send a message(PVFS_SERV_IO) to every server participating
during IO
operation to initiate flow
Looks good so far. I've just recently committed some changes to the
client IO state machine that checks the size of IO to be done. If
the size fits within the transport layer's limit for an unexpected
message size (for tcp is 16K), instead of starting a flow to each
server, the IO is packed into the request (for writes) or response
(for reads). All this is done from a separate "small IO" operation
and state machine.
S: pvfs2_io_sm
S1) prelude_sm
S2) Send a positive Acknowledge if permissions allow access or a
negative if
a error occurs
C4) If we get a positive ACK for a server start flow for that server.
S3) Setup a flow (job_flow)
Post the flow, probe for the flowprotocoll which does handle the
specified
transfer type call flowproto_post for that protocoll.
In our case flowproto-multiqueue, initializes several buffers
(currently 8)
and makes a setup depending on the two endpoints of the flow and
runs the
appropriate callback function to start the flow.
Now a flow will be established between client and server, which
transfers at
maximum 256KByte of data per message.
If operation is write (SRC=BMI TARGET=TROVE):
trove_write_callback_fn
initialize the bmi recv connection and is called when a
trove_write is done
and updates a performance counter. Currently only one buffer is
used at a
time.
bmi_recv_callback_fn is called when bmi receives data, calls
trove_bstream_write_list.
A read operation starts for every buffer bmi_send_callback_fn
which
initiates a communication and updates a performance counter also
calls
trove_bstream_read_list to read Data. The trove_read_callback_fn
is executed
when a trove read is completed and starts a bmi send operation
for the data
read.
S4) Flow ends: send a ack to the client if it was a write
operation.
C4) Client sticks in this state until the transmission is completed
or a
transfer error occured during the flow, retry to do the IO in step 3.
C5) Analyze if the transfer is succesful and the amount of data
transfered
using the distribution function or whether an IO error occured.
C6) For a read request it can be necessary to get the sizes of all
datafiles
to detect the correct file size read, this happens when a hole is
within the
requested file area.
With the small IO changes I also committed some changes to the way we
zero the memory regions where holes exist. Previously we were
zeroing the entire memory region at the beginning of the IO request.
The changes I've made determine the actual regions during this
analyze results phase and zero only those regions.
Note: The value of the performance counters is processed and stored
by a
different state machine. It can be used to analyze the transfered
data within
a period for example by the karma tool.
Another question: Why is it necessary to get the sizes of the
datafiles when
reading a hole ?
Remember that C2 was find the target datafiles, so we are operating
on potentially just a subset of the datafiles. Using the sizes we
have from that subset, the analyze step looks for a size (mapped to
the logical domain) that is past the end of the file request. If we
find one, we know the request is not past EOF, so the total file size
ends at the end of the request. If we don't find one, we have to get
the other datafile sizes not in the subset and check those for a size
that is past the end of the file request.
Does that clarify the problem?
-sam
Thanks a lot for your help,
Julian
_______________________________________________
PVFS2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
PVFS2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers