Tom Keiser wrote: > Hi All, > > The other day, we had a small discussion on Gerrit patch #70 regarding > adoption of several RxOSD protocol bits by OpenAFS. Simon and Derrick > both suggested that I should move the discussion over to -devel so > that it reaches a broader audience. As a bit of background, I'm > writing a white paper regarding the RxOSD protocol. It's not quite > ready for release, but I get the sense that we need to start the > discussion sooner rather than later. Here are several key issues from > my forthcoming paper, which I outlined in the Gerrit commentary: > > 1) extended callbacks cannot be implemented
Why not? > 2) mandatory locking semantics cannot be implemented Why not? Once it is implemented for FatchData ans StoreData the same code could be used in GetOSDlocation. > 3) lack of read-atomicity means that read-only clones fetches can now > return intermediate values, thus clone transitions are no longer > atomic The fileserver does a copy on write for files in OSD exactly in the same way as it does for normal AFS files. Also the rxosd doesn't allow write RPCs from clients to files with a link count > 1. During the volserver operation GetOSDmetadata returns busy and after an update the usual callbacks are sent to the client and honored also for OSD files (VerifyCache). > 4) lack of write-atomicity entirely changes the meaning of DV in the protocol If you want read and write from different clients at the same time in a way that can produce inconsistencies you should use advisory locking. Also without OSD an intermediate StoreData because of the cache becoming full may leed to an intermediate data version which from the application's point of view is inconsistent. > 5) the data "mirroring" capability is race-prone, and thus not consistent > 6) the GetOSDlocation RPC conflates a number of operations, and should be > split Basically it is used to prepare I/O to OSDs for both cases: read and write. I don't know whether you mean this. It also has kind of a debug interface for "fs osdmetadata -cm <file>" to allow you to see what the client gets returned on GetOSDlocation. The program get_osd_location which is called by GetOSDlocation is also used for the legacy-interface to serve I/O to OSD for old clients. If you use the HSM funtionality of AFS/OSD a file which is presently off-line (on tape in an underlying HSM system) must be brought back to an on-line OSD before the client can access it. In the OSD-aware client this is done already when the file is opened by the application letting the application wait in the open system call. Only for the legacy interface it can only be done during the FetchData or Storedata RPC. Therefore this functionality was put also into get_osd_location. Is it that what you mean? > 7) insufficient information is communicated on the wire to support the > distributed transaction management necessary to ensure atomicity of > various operations I admit reading and writing data from/to OSDs is not as atomic as doing the same to a classical AFS fileserver. And I think to make it atomic would require substantial overhead and even more complexity. Therefore I think files stored in OSDs never should and will replace normal AFS files totally, but this technique should be used where the usage pattern of the files does not require such atomicy. In our case e.g. all user home directories, software ... remain in normal AFS files. But our cell contains very many long time archives for data produced by experiments and other sources (digitized photo libraries, audio and video documents) whith typically are written only once. So for this kind of data atomicy is not required at all. > 8) there is no means to support real-time parity computation in > support of data redundancy How should that look like? BTW, AFS/OSD keeps md5 checksums for archival copies of files in OSD. This feature already proved to be very useful when we had a problem with the underlying HSM system DSM-HSM. > 9) osd location metadata should be cacheable It is implemented only for embedded shared filesystems (GPFS or Lustre /vicep partitions accessed directly from the client). I admit that especially in the case of reading files it could reduce the number of RPCs sent to the fileserver because still each chunk requires separate RPCs. However, my idea to this ponit is it would be better to allow the client to prefetch a reasonable number of chunks in a single RPC. > 10) the wire protocol is insufficient to support any notion of data > journalling What kind of journaling you have in mind here? > > Many of these issues will eventually need to be discussed on > afs3-standardization. Lacking a formal internet draft, I suspect > there may be some value in starting a discussion here. At the very > least, it may help us with dependency analysis of the major > enhancements in the pipeline. Coming out of these ten points, I see a > few major classes of issues that will require discussion and planning: > > a) convergence of RxOSD with other protocol changes (XCB, byte-range > locking, perhaps others) > b) changes to cache coherence, especially DV semantics > c) tackling the thorny issue of distributed transactions > d) future-proofing (distributed RAID-n, journalling, rw repl, etc.) > e) protocol design issues (RXAFS_GetOSDlocation, the means by which > location data is XDR encoded, etc.) > f) reference implementation code issues (DAFS integration, MP-fastness > of metadata server extensions, etc.) > > > -Tom > _______________________________________________ > OpenAFS-devel mailing list > [email protected] > https://lists.openafs.org/mailman/listinfo/openafs-devel -Hartmut ----------------------------------------------------------------- Hartmut Reuter e-mail [email protected] phone +49-89-3299-1328 fax +49-89-3299-1301 RZG (Rechenzentrum Garching) web http://www.rzg.mpg.de/~hwr Computing Center of the Max-Planck-Gesellschaft (MPG) and the Institut fuer Plasmaphysik (IPP) ----------------------------------------------------------------- _______________________________________________ OpenAFS-devel mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-devel
