Hi Julian,

I've included comments inline.

-sam

On Dec 17, 2005, at 9:35 AM, Julian Martin Kunkel wrote:

Hi,
I try to document the handling of some operations.
However the IO handling with flow is a bit complicated ;-)
I try to summarize only important steps of the IO process (including states of the state-machines) and would be very happy if you could have a look and give
me hints if something is wrong or if I forgot a important state.
For me especially messages send between client and server and trove- operations
are important.

C means client state, S means server state.

pvfs2_client_io_sm
C1) Get file attributes and size using pvfs2_client_getattr_sm
C2) Find target datafiles using the distribution function
C3) Send a message(PVFS_SERV_IO) to every server participating during IO
operation to initiate flow

Looks good so far. I've just recently committed some changes to the client IO state machine that checks the size of IO to be done. If the size fits within the transport layer's limit for an unexpected message size (for tcp is 16K), instead of starting a flow to each server, the IO is packed into the request (for writes) or response (for reads). All this is done from a separate "small IO" operation and state machine.

        S: pvfs2_io_sm
        S1) prelude_sm
S2) Send a positive Acknowledge if permissions allow access or a negative if
                a error occurs
C4) If we get a positive ACK for a server start flow for that server.
        S3) Setup a flow (job_flow)
Post the flow, probe for the flowprotocoll which does handle the specified
                transfer type call flowproto_post for that protocoll.
In our case flowproto-multiqueue, initializes several buffers (currently 8) and makes a setup depending on the two endpoints of the flow and runs the
                appropriate callback function to start the flow.
Now a flow will be established between client and server, which transfers at
                maximum 256KByte of data per message.
                
If operation is write (SRC=BMI TARGET=TROVE): trove_write_callback_fn initialize the bmi recv connection and is called when a trove_write is done and updates a performance counter. Currently only one buffer is used at a
                time.
                bmi_recv_callback_fn is called when bmi receives data, calls
                trove_bstream_write_list.
                
                A read operation starts for every buffer bmi_send_callback_fn 
which
initiates a communication and updates a performance counter also calls trove_bstream_read_list to read Data. The trove_read_callback_fn is executed when a trove read is completed and starts a bmi send operation for the data
                read.
                        
                S4) Flow ends: send a ack to the client if it was a write 
operation.
C4) Client sticks in this state until the transmission is completed or a
        transfer error occured during the flow, retry to do the IO in step 3.
C5) Analyze if the transfer is succesful and the amount of data transfered
        using the distribution function or whether an IO error occured.
C6) For a read request it can be necessary to get the sizes of all datafiles to detect the correct file size read, this happens when a hole is within the
        requested file area.


With the small IO changes I also committed some changes to the way we zero the memory regions where holes exist. Previously we were zeroing the entire memory region at the beginning of the IO request. The changes I've made determine the actual regions during this analyze results phase and zero only those regions.

Note: The value of the performance counters is processed and stored by a different state machine. It can be used to analyze the transfered data within
a period for example by the karma tool.

Another question: Why is it necessary to get the sizes of the datafiles when
reading a hole ?


Remember that C2 was find the target datafiles, so we are operating on potentially just a subset of the datafiles. Using the sizes we have from that subset, the analyze step looks for a size (mapped to the logical domain) that is past the end of the file request. If we find one, we know the request is not past EOF, so the total file size ends at the end of the request. If we don't find one, we have to get the other datafile sizes not in the subset and check those for a size that is past the end of the file request.

Does that clarify the problem?

-sam


Thanks a lot for your help,
Julian
_______________________________________________
PVFS2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
PVFS2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to