Hi,

I wanted to continue this conversation, re-raising topics of I/O parallelism 
and data consistency, and opening the general topic of error detection and 
recovery.

This note looks forward to the later mail in which the signatures for new 
begin/end IO routines are described e.g., RXAFS_StartAsyncFetch, as well as 
back to Tom's original list of issues.

First, some function prototypes have been sent to this list, but not, I think 
the full set of what Hartmut, you have been working on, plus, I don't think 
Tom, you have provided feedback as yet on what has been posted.

Second, I think it very likely that we are clearly trying to have a discussion 
that includes rxOSD in the form it's incorporated into OpenAFS, but potentially 
opening up forward-looking discussion on future protocol concepts, and, where 
possible, potential overlap that will take us further in that direction.

I think it was clear that there was some consensus on the Begin/End IO 
operations, but I still have questions, on reviewing the thread.

I. I may be leaving things out (help solicited) or seem way-out-there (sorry), 
but I'll attempt to formalize my current questions and reactions about what's 
been specified so far:

1. considering the protocol post begin/end I/O transactions (which include 
offset and range information), can we clarify that there is an ability for 
different clients (or the same client) to carry out non-overlapping I/O 
operations in parallel?  I think it was implicitly clear that this is provided 
for--is that self-evident?

a) if yes, then I would like to raise the question -when- should such 
operations be allowed (see II.3)

2. final disposition on data version;  I think it is agreed that the 
coordinating file server assigns successive data versions as and when (possibly 
parallel) mutating I/O operations complete--is that agreed/self-evident?

3. the current proposal (read as above) allows I/O operations on a contiguous 
byte range to be atomic, but not others--for example, I could imagine a 
structured I/O description which allows for a sequence of operations, on one or 
more files, to constitute a transaction.  I can imagine file server/OSD 
implementations which would make a fuller transactional semantics useful.  
Would it be out of the question to future proof in this direction?

II.  Hartmut, one of your most recent mails raised the issue of handling 
non-completing I/O operations, and I think that provides a good segue to my 
next questions:

1. what is the overall data consistency guarantee for current OSD volumes, and 
looking forward, extended ones we might define?

a. since the coordinating file server allocates data versions post hoc, a 
server acting as an OSD no longer has a mechanism to track them;  I can imagine 
ways forward, though I am not certain I understand all the issues

1) clients could send data version information with every component I/O 
operation--this costs nothing, and provides information which may be used for 
reliable I/O strategies, error state identification, and recovery

2) an additional operation finalizing -component- I/O operations could be 
added, which transfers the final data version to OSDs

b. any mutating I/O operation which fails to complete, or to complete 
successfully, puts the distributed system in an inconsistent state

1) a not uncommon problem is likely to be component I/O operation failure due 
to network partitioning--recovery for this case seems possible, in that the 
client could re-try the operation on the coordinating fileserver, and if 
successful, complete the I/O transaction as normal;  perhaps it already does 
this?

(but what if it fails?  see II.3)

2) Integrity checking

2a) Tom raised the question of data checksumming;  Currently, I believe we on 
rx packet checksums and integrity checking to accomplish reliability 
guarantees.  Newer filesystems such as ZFS have raised the visibility of data 
checksum operations.  Logically it seems possible to identify approaches which 
checksum data only in component I/O, and others which involve the coordinating 
file servers.  Is it possible that we should incorporate placeholders for this? 
 Tom, do you have ideas about what would be required?

(and, what if it fails?  see II.3)

2b) Tom raised the question of support for parity computation in support of 
raidn OSD implementation.  Clearly this could make the protocol substantially 
more useful for some applications.  Is it possible that we should incorporate 
placeholders for this?  Tom, do you have ideas about what would be required?

(and, what if it fails?  see II.3)

3) Tom raised the question of support for data journalling.  Perhaps this seems 
way-out-there, but in fact, the current generation of local file system 
technology actually supports low-cost point-in-time snapshot functionality.  In 
that context, I think I can imagine (achievable) implementations supporting a 
protocol in which the transaction boundary is extended to the component I/O 
servers (OSDs) and commit/rollback semantics were implemented on it.  I think 
it would be valuable to distinguish between strong and weak transactions--the 
former supporting full nesting and isolation guarantees (which possibly are not 
provided by any local file system implementation today), and the latter rolling 
back to a consistent state but potentially lacking isolation (and so rolling 
back concurrent operations), and being much easier and less costly (both 
senses) to implement.  I sense that rollback would be controlled by 
coordinating file server, either at the request of a client holding an open 
transaction, or itself, on detecting failed, mutating operations.


Thanks,


Matt


(Thanks to Tom for read-through and feedback, all errors mine.)


----- "Hartmut Reuter" <[email protected]> wrote:

> Tom's idea to have a Start-of-I/O-rpc and a Stop-I/O-rpc to enforce
> data
> consistency is great. I think it would not be very difficult to
> implement this.
>
> Caching of the information returned by GetOSDlocation could reduce
> traffic on the wire, but is not really essential. So if we still do
> one
> GetOSDlocation per I/O we can use GetOSDlocaltion as
> Start-of-I/O-rpc.
>
> So for write I would propose that the fileserver has to keep the
> information about Fid, offset, length, host, and time in a table or
> chain and keep it there until the storeMini has happened. So also
> extended callbacks for file ranges would become possible. For the
> write
> case storeMini would function as End-of-I/O-rpc.
>
> While the entry for write exists all incoming GetOSDlocation RPCs
> have
> to wait. This is the same behavior as happens for FetchData or
> StoreData
> while another Storedata has the write lock on the vnode.
>
> Up to this point everything would work fine also with the clients out
> here in our cell.
>
> However, there could be reads still under way while a new write is
> starting. It's not that probable because unfortunately reads are
> always
> for single chunks only, but it's still possible. To protect also
> these
> reads requires an End-of-I/O-rpc for read. A new bit in the flag used
> in
> GetOSDlocation could indicate that the client promises to send at the
> end of a read operation an appopriate rpc.
>
> With this fleg set GetOSDlocation would also create (or find an
> existing) entry in the before mentioned table or chain. A field
> readers
> would be incremented and after the I/O is finished decremented by the
> End-of-I/O-rpc. As long as there are readers write requests have to
> wait.
>
> The legacy interface for non rxosd prepared clients, of course, would
> have to honor this table as well. But here things are easier because
> everything happens within a single rpc (FetchData or StoreData).
>
> An open question is how the fileserver should handle missing
> End-of-I/O-rpcs. Therefore the timestamp field. The
> FiveMinuteCheckLWP
> could look for out-timed transactions....
>
> -Hartmut
> -----------------------------------------------------------------
> Hartmut Reuter                  e-mail                 [email protected]
>                                    phone                  +49-89-3299-1328
>                                    fax                    +49-89-3299-1301
> RZG (Rechenzentrum Garching)           web    http://www.rzg.mpg.de/~hwr
> Computing Center of the Max-Planck-Gesellschaft (MPG) and the
> Institut fuer Plasmaphysik (IPP)
> -----------------------------------------------------------------
> _______________________________________________
> OpenAFS-devel mailing list
> [email protected]
> https://lists.openafs.org/mailman/listinfo/openafs-devel

-- 

Matt Benjamin

The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309


-- 

Matt Benjamin

The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309

-- 

Matt Benjamin

The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://linuxbox.com

tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309
_______________________________________________
OpenAFS-devel mailing list
[email protected]
https://lists.openafs.org/mailman/listinfo/openafs-devel

Reply via email to