Hi, I wanted to continue this conversation, re-raising topics of I/O parallelism and data consistency, and opening the general topic of error detection and recovery.
This note looks forward to the later mail in which the signatures for new begin/end IO routines are described e.g., RXAFS_StartAsyncFetch, as well as back to Tom's original list of issues. First, some function prototypes have been sent to this list, but not, I think the full set of what Hartmut, you have been working on, plus, I don't think Tom, you have provided feedback as yet on what has been posted. Second, I think it very likely that we are clearly trying to have a discussion that includes rxOSD in the form it's incorporated into OpenAFS, but potentially opening up forward-looking discussion on future protocol concepts, and, where possible, potential overlap that will take us further in that direction. I think it was clear that there was some consensus on the Begin/End IO operations, but I still have questions, on reviewing the thread. I. I may be leaving things out (help solicited) or seem way-out-there (sorry), but I'll attempt to formalize my current questions and reactions about what's been specified so far: 1. considering the protocol post begin/end I/O transactions (which include offset and range information), can we clarify that there is an ability for different clients (or the same client) to carry out non-overlapping I/O operations in parallel? I think it was implicitly clear that this is provided for--is that self-evident? a) if yes, then I would like to raise the question -when- should such operations be allowed (see II.3) 2. final disposition on data version; I think it is agreed that the coordinating file server assigns successive data versions as and when (possibly parallel) mutating I/O operations complete--is that agreed/self-evident? 3. the current proposal (read as above) allows I/O operations on a contiguous byte range to be atomic, but not others--for example, I could imagine a structured I/O description which allows for a sequence of operations, on one or more files, to constitute a transaction. I can imagine file server/OSD implementations which would make a fuller transactional semantics useful. Would it be out of the question to future proof in this direction? II. Hartmut, one of your most recent mails raised the issue of handling non-completing I/O operations, and I think that provides a good segue to my next questions: 1. what is the overall data consistency guarantee for current OSD volumes, and looking forward, extended ones we might define? a. since the coordinating file server allocates data versions post hoc, a server acting as an OSD no longer has a mechanism to track them; I can imagine ways forward, though I am not certain I understand all the issues 1) clients could send data version information with every component I/O operation--this costs nothing, and provides information which may be used for reliable I/O strategies, error state identification, and recovery 2) an additional operation finalizing -component- I/O operations could be added, which transfers the final data version to OSDs b. any mutating I/O operation which fails to complete, or to complete successfully, puts the distributed system in an inconsistent state 1) a not uncommon problem is likely to be component I/O operation failure due to network partitioning--recovery for this case seems possible, in that the client could re-try the operation on the coordinating fileserver, and if successful, complete the I/O transaction as normal; perhaps it already does this? (but what if it fails? see II.3) 2) Integrity checking 2a) Tom raised the question of data checksumming; Currently, I believe we on rx packet checksums and integrity checking to accomplish reliability guarantees. Newer filesystems such as ZFS have raised the visibility of data checksum operations. Logically it seems possible to identify approaches which checksum data only in component I/O, and others which involve the coordinating file servers. Is it possible that we should incorporate placeholders for this? Tom, do you have ideas about what would be required? (and, what if it fails? see II.3) 2b) Tom raised the question of support for parity computation in support of raidn OSD implementation. Clearly this could make the protocol substantially more useful for some applications. Is it possible that we should incorporate placeholders for this? Tom, do you have ideas about what would be required? (and, what if it fails? see II.3) 3) Tom raised the question of support for data journalling. Perhaps this seems way-out-there, but in fact, the current generation of local file system technology actually supports low-cost point-in-time snapshot functionality. In that context, I think I can imagine (achievable) implementations supporting a protocol in which the transaction boundary is extended to the component I/O servers (OSDs) and commit/rollback semantics were implemented on it. I think it would be valuable to distinguish between strong and weak transactions--the former supporting full nesting and isolation guarantees (which possibly are not provided by any local file system implementation today), and the latter rolling back to a consistent state but potentially lacking isolation (and so rolling back concurrent operations), and being much easier and less costly (both senses) to implement. I sense that rollback would be controlled by coordinating file server, either at the request of a client holding an open transaction, or itself, on detecting failed, mutating operations. Thanks, Matt (Thanks to Tom for read-through and feedback, all errors mine.) ----- "Hartmut Reuter" <[email protected]> wrote: > Tom's idea to have a Start-of-I/O-rpc and a Stop-I/O-rpc to enforce > data > consistency is great. I think it would not be very difficult to > implement this. > > Caching of the information returned by GetOSDlocation could reduce > traffic on the wire, but is not really essential. So if we still do > one > GetOSDlocation per I/O we can use GetOSDlocaltion as > Start-of-I/O-rpc. > > So for write I would propose that the fileserver has to keep the > information about Fid, offset, length, host, and time in a table or > chain and keep it there until the storeMini has happened. So also > extended callbacks for file ranges would become possible. For the > write > case storeMini would function as End-of-I/O-rpc. > > While the entry for write exists all incoming GetOSDlocation RPCs > have > to wait. This is the same behavior as happens for FetchData or > StoreData > while another Storedata has the write lock on the vnode. > > Up to this point everything would work fine also with the clients out > here in our cell. > > However, there could be reads still under way while a new write is > starting. It's not that probable because unfortunately reads are > always > for single chunks only, but it's still possible. To protect also > these > reads requires an End-of-I/O-rpc for read. A new bit in the flag used > in > GetOSDlocation could indicate that the client promises to send at the > end of a read operation an appopriate rpc. > > With this fleg set GetOSDlocation would also create (or find an > existing) entry in the before mentioned table or chain. A field > readers > would be incremented and after the I/O is finished decremented by the > End-of-I/O-rpc. As long as there are readers write requests have to > wait. > > The legacy interface for non rxosd prepared clients, of course, would > have to honor this table as well. But here things are easier because > everything happens within a single rpc (FetchData or StoreData). > > An open question is how the fileserver should handle missing > End-of-I/O-rpcs. Therefore the timestamp field. The > FiveMinuteCheckLWP > could look for out-timed transactions.... > > -Hartmut > ----------------------------------------------------------------- > Hartmut Reuter e-mail [email protected] > phone +49-89-3299-1328 > fax +49-89-3299-1301 > RZG (Rechenzentrum Garching) web http://www.rzg.mpg.de/~hwr > Computing Center of the Max-Planck-Gesellschaft (MPG) and the > Institut fuer Plasmaphysik (IPP) > ----------------------------------------------------------------- > _______________________________________________ > OpenAFS-devel mailing list > [email protected] > https://lists.openafs.org/mailman/listinfo/openafs-devel -- Matt Benjamin The Linux Box 206 South Fifth Ave. Suite 150 Ann Arbor, MI 48104 http://linuxbox.com tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 -- Matt Benjamin The Linux Box 206 South Fifth Ave. Suite 150 Ann Arbor, MI 48104 http://linuxbox.com tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 -- Matt Benjamin The Linux Box 206 South Fifth Ave. Suite 150 Ann Arbor, MI 48104 http://linuxbox.com tel. 734-761-4689 fax. 734-769-8938 cel. 734-216-5309 _______________________________________________ OpenAFS-devel mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-devel
