-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Attached please find a draft proposal proposing interfaces for
byte-range locking and delegation, and supporting semantic discussion.
The draft has seen 2 rounds of early review, and concepts in the
document were discussed at the OpenAFS Hackathon at Ohiolinux. Thanks
to all who have assisted.
Please assist further :)
Thanks,
Matt
- --
Matt Benjamin
The Linux Box
206 South Fifth Ave. Suite 150
Ann Arbor, MI 48104
http://linuxbox.com
tel. 734-761-4689
fax. 734-769-8938
cel. 734-216-5309
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iD8DBQFJBl7kJiSUUSaRdSURCLYnAJ9u6u/kLigBiuZicbHa9NiOldklHACcCDBF
mmTT0GnNBdD4a1JGpi94S7U=
=2/zW
-----END PGP SIGNATURE-----
AFS Byte-Range Locking and Delegation
Matt Benjamin <[EMAIL PROTECTED]>
10/26/2008
Status of this Memo
This document specifies a standards track protocol extension for
the OpenAFS community, and requests discussion and suggestions
for improvements.
Key Words
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL
NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" in this document are to be interpreted as described in
Internet Engineering Task Force RFC 2119.
Abstract
The AFS-3 protocol supports file locks, but only on whole files,
only in advisory mode, and using an inefficient protocol.
Efficient support for byte-range file locking, together with the
stronger semantics with which they are associated, are required
to improve the suitability of AFS as a LAN file-sharing protocol
for both Unix and Windows clients. Applications on the Windows
platform, in particular (e.g., Microsoft Office), actually
require byte-range locking to function correctly. Emulation in
the client has alleviated most serious problems, albeit, with
reduced semantics.
We propose protocol enhancements facilitating server-coordinated
byte-range locks, atomic lock up/down-grade support, improved
semantics for files under byte-range lock control, protocol
support for wait-on-lock with fairness, and mandatory lock
enforcement for clients on request. A conditional strengthened
callback semantics (``delegation''), governing file data and
locks, is proposed to reduce network and file-server workload for
uncontested file lock operations.
Table of Contents
Status of this Memo
Key Words
Abstract
1 AFS-3 File Locking
1.1 Analysis
2 Byte-Range Locking Interfaces
2.1 Dependencies
2.2 Backward Compatibility
2.3 Concepts
2.3.1 General
2.3.2 Lock Management
2.3.3 Deferred Locks
2.4 Constants
2.4.1 Lock Flags
AFSLock_Flag_Mand
AFS_LockFlagWait
2.4.2 Lock Status
AFSLock_Flag_Extend_Ok
AFSLock_Flag_Undelegate_Ok
2.4.3 Callback Constants
2.4.4 Callback Result Constants
AFSCB_Cancel_ExtendLocks
AFSCB_Cancel_RevokeLocks
AFSCB_Flag_ExtendLocks
AFSCB_Flag_ExtendLocks
2.5 Data Types
2.5.1 AFSByteRangeLock
Fid
Type
Owner
Uniq
Offset
Length
ExpirationTime
2.5.2 AFSByteRangeLockSeq
2.5.3 AFSLockFlagsSeq
2.5.4 HostIdentifierSeq
2.5.5 AFSCB_ResultData Redefinition
AFSCB_Result_ReturnLocks
AFSCB_Result_ResponseDeferred
2.6 Procedures
2.6.1 SetByteRangeLock
Notes
Error Codes
EACCES
EWOULDBLOCK
EDEADLK
EINVAL
ENOLCK
2.6.2 ReleaseByteRangeLock
Notes
Error Codes
EINVAL
2.6.3 UpgradeByteRangeLock
Error Codes
EINVAL
EWOULBLOCK
EDEADLK
2.6.4 DowngradeByteRangeLock
Notes
Error Codes
EINVAL
2.6.5 GetByteRangeLockStatus
Error Codes
EACCES
2.6.6 CancelByteRangeLock
2.6.7 AssertExtendLocks
2.7 Windows & Unix Lock Semantics
2.7.1 Byte-Range Locking
2.7.2 Read/Write vs. Shared/Exclusive
2.7.3 Atomic Lock Open
2.8 Mandatory Enforcement
2.8.1 Governing Ideas
2.8.2 Enforcement Rules
3 Delegation
3.1 Dependencies
3.2 Backward Compatibility
3.3 Lock Delegation
3.4 File Delegation
3.4.1 Semantic Changes
3.4.2 Delegation
3.4.3 Revocation
3.5 Constants
3.5.1 Delegation Types
AFS_DType_General
3.5.2 Callback Constants
AFSCB_Flag_Delegation
AFSCB_Cancel_RevokeDelegation
AFSCB_Flag_RevokeDelegation
AFSCB_Flag_ExtremePrejudice
3.6 DataTypes
3.6.1 AFSDelegation
Fid
Type
Flags
Offset
Length
ExpirationTime
3.6.2 AFSExtendedCallBack
3.7 Procedures
3.7.1 RequestDelegation
Fid
Type
Flags
Offset
Length
Delegation
Error Codes
EACCES
EWOULDBLOCK
EINVAL
3.7.2 UndelegateReturningLocks
4 Appendix A: XDR Grammar (afsint.xg)
5 Appendix A: XDR Grammar (afscbint.xg)
1 AFS-3 File Locking
While AFS-3 does support file locking, it permits locking of
whole-files only, and provides this support inefficiently. AFS
clients can take locks on any file object, with the granularity
of an entire file, using the RXAFS_SetLock procedure, and release
them with the RXAFS_ReleaseLock procedure. AFS uses a poll-based
locking model. AFS file locks, once issued, are considered to
persist only for 5 minutes, unless extended by the requesting
client using the RXAFS_ExtendLock procedure. This simplifies the
AFS file server, but complicates clients and wastes network
capacity. The OpenAFS file server implementaion, based on the
original Transarc AFS file server, tracks locks directly in its
on-disk volume structures. Considering the 5-minute duration
asserted for file locks, the reason for this decision is clearly
not to support lock persistence for long periods, although it may
have been intended to allow locks to persist through server
restarts (or crashes). The disk package tracks lock type
(LockRead or LockWrite), numbers of clients holding locks, and a
timestamp. Lock ownership, while in many cases may be reliably
inferred, is not recorded. Hence, a broken or malicious client
might release locks it never set (i.e., locks set by other
clients). The AFS protocol also does not permit atomic lock
upgrades (or downgrades).
1.1 Analysis
The AFS locking protocol is unfair, and wasteful of client and
network resources. We propose solutions to fairness and
efficiency problems in this proposal.
2 Byte-Range Locking Interfaces
2.1 Dependencies
The byte-range lock feature depends on support for extended
callback notifications and extended host tracking support in
client and server.
2.2 Backward Compatibility
AFS clients and servers will indicate their support for
byte-range locking through new client and file server capability
flags:
const CLIENT_CAPABILITY_BYTE_RANGE_LOCK = 0x0008;
const VICED_CAPABILITY_BYTE_RANGE_LOCK = 0x0010;
2.3 Concepts
2.3.1 General
An AFS file server is responsible to coordinate byte-range
locking requests and, optionally, enforce mandatory locking
semantics relative to file operations, initiated at different
clients. By contrast with the traditional AFS file locking
protocol, the proposed byte-range locking protocol makes an
attempt to associate locks with a unique subject, specifically, a
ViceID and unique identifier which could correspond to a unique
session or process executing on the client machine. Clients
(cache-manager processes not co-located in memory) request and
release byte-range locks through a pair of interfaces
(RequestByteRangeLock, ReleaseByteRangeLock) similar to those
provided by the traditional AFS locking implementation. The same
lock types (read and write, in general regarded as ``shared'' or ``
exclusive'') locks are defined as in traditional AFS locking.
Additional arguments and flags are provided to permit selection
of desired lock ranges, intention to ``wait'' on the lock (i.e.,
willing to accept a deferred issue of the lock at such time as
the file server can grant the lock, if it cannot be granted
immediately), and desired special semantics--currently, the
client may request mandatory enforcement. Clients already holding
a read or write lock on a range may atomically upgrade or
downgrade the lock to the orthogonal type, i.e., they need not
release a lock of one type before requesting the other type,
avoiding the race condition present in the traditional AFS
locking protocol. Byte-range locks are permanently associated
with an owner, the client which requested the lock. A lock may
not be released by a client which never owned it. Administrative
users may under various circumstances have need to identify the
owner and state of locks on a locked file, and to revoke file
locks administratively. This proposal includes RPCs allowing
administrative users to perform these operations, and suggests
exposure through new AFS pioctls and the fs command.
2.3.2 Lock Management
Lock management in the proposed interface is completely redefined
relative to the file locking in AFS-3. Concepts are borrowed from
AFS cache management, including the callback concept. A
byte-range lock may be regarded as a special-purpose callback. A
file server may use the ExtendedCallBack interface to request
re-assertion of existing locks, revoke file delegations (which
may include client-issued byte-range locks), or cancel locks
completely.
2.3.3 Deferred Locks
Where possible, locks are granted immediately with the completion
of the SetByteRangeLock request. A file server MAY, on explicit
request and subject to client capability, agree to prospectively
issue a lock to an interested client at a future time, when the
requested lock becomes available. Such deferred locks constitute
a promise to issue the lock with best-effort consideration of
fairness. A new procedure in the client RPC interface
(AsyncIssueByteRangeLock) is provided to effect asynchronous
issue of a deferred lock to a waiting client. Deferred locks may
themselves be canceled.
2.4 Constants
2.4.1 Lock Flags
The following flag constants are defined for use in the Flags
member of the AFSByteRangeLock structure and equivalently in the
Flags argument of the SetByteRangeLock procedure, with the same
semantics:
const AFSLock_Flag_Mand = 1; /* req. enforcement */
const AFSLock_Flag_Wait = 2; /* req. async wait on lock */
AFSLock_Flag_Mand
Requests mandatory enforcement when sent with a SetByteRangeLock
request or in a deferred AFSByteRangeLock instance. Asserts
mandatory enforcement in an AFSByteRangeLock instance.
AFS_LockFlagWait
Requests deferred lock if immediate lock cannot be granted when
sent with a SetByteRangeLock request. Indicates deferred lock in
an AFSByteRangeLock instance. The SetByteRangeLock procedure may
return locks in this state, subject to client capability and if
so requested in the Flags argument.
2.4.2 Lock Status
The following flag constants are provided to coordinate advanced
lock-management operations:
const AFSLock_Flag_Extend_Ok = 4; /* extended */
const AFSLock_Flag_Undelegate_Ok = 8; /* undelegated, asserted */
AFSLock_Flag_Extend_Ok
Returned from AssertExtendLocks in OutStatus array, indicates
lock confirmation.
AFSLock_Flag_Undelegate_Ok
Returned from UndelegateReturningLocks in OutStatus array,
indicates server agreement to assert undelegated lock.
2.4.3 Callback Constants
The following callback cancellation types and flags are provided,
to facilitate lock management through the ExtendedCallback
interface:
const AFSCB_Cancel_ExtendLocks = 7; /* re-assert locks, or lose
them */
const AFSCB_Cancel_RevokeLocks = 8; /* locks on Fid revoked */
2.4.4 Callback Result Constants
The following constant is provided as a descriminator for the
AFSCB_ResultData member of AFSCBExtendedCallbackResult allowing
clients to indicate their intention to defer returning locks or
delegations in a subsequent RPC on the file server:
const AFSCB_Result_ResponseDeferred = 2;
The following constant is provided as a descriminator for the
AFSCB_ResultData member of AFSCBExtendedCallbackResult allowing
clients to indicate their intention to return locks in the
CallBack_Result_Array OUT parameter:
const AFSCB_Result_ReturnLocks = 3;
AFSCB_Cancel_ExtendLocks
When sent as the reason for cancellation in an ExtendedCallback
notification, indicates the server requires re-assertion of all
locks on FID using the file server's AssertExtendLocks procedure.
The client MUST execute the procedure for all locks it asserts on
FID prior to the ExpirationTime in the callback, else it MUST
consider any locks it held on FID to be canceled.
AFSCB_Cancel_RevokeLocks
When sent as the reason for cancellation in an ExtendedCallback
notification, indicates administrative cancellation of all locks
on FID.
const AFSCB_Flag_AssertLocks = 4; /* request ExtendLock */
const AFSCB_Flag_RevokeLocks = 8; /* locks cancelled, sorry */
AFSCB_Flag_ExtendLocks
Has the same meaning and effect as AFSCB_Cancel_ExtendLocks, but
may be sent with an arbitrary extended callback message.
AFSCB_Flag_ExtendLocks
Has the same meaning and effect as AFSCB_Cancel_RevokeLocks, but
may be sent with an arbitrary extended callback message.
2.5 Data Types
2.5.1 AFSByteRangeLock
The AFSByteRangeLock data type represents a byte-range lock
issued by an AFS file server:
struct AFSByteRangeLock {
AFSFid Fid;
afs_uint32 Type;
afs_uint32 Owner;
afs_uint32 Uniq;
afs_uint32 Flags;
afs_uint64 Offset;
afs_uint64 Length;
afs_uint64 ExpirationTime;
};
Fid
The Fid on which the lock is held.
Type
The type of lock requested, LockRead or LockWrite. A byte-range
read lock is a non-exclusive read assertion on the stated range,
which may be shared by any number of readers and no writers. A
byte-range lock is an exclusive write assertion on the stated
range.
Owner
The ViceID in use by the client requesting the lock.
Uniq
Value uniquely identifying a session or process context at the
client.
Offset
The distance in bytes from beginning-of-file to the start of the
locked range.
Length
Length in bytes of the locked range.
ExpirationTime
AFSByteRangeLock instances may be regarded as a special-purpose
callback. Instances persist until canceled, or until
ExpirationTime is reached.
2.5.2 AFSByteRangeLockSeq
A variable-length array of type AFSByteRangeLock used for bulk
calls for asserting and returning locks recalled from delegation.
const AFS_LOCK_SEQ_MAX = 10000;
typedef AFSByteRangeLock AFSByteRangeLockSeq <AFS_LOCK_SEQ_MAX>;
2.5.3 AFSLockFlagsSeq
An array of flags used in parallel with AFSByteRangeLockSeq,
above.
const AFS_LOCK_SEQ_MAX = 10000;
typedef afs_int32 AFSLockFlagsSeq <AFS_LOCK_SEQ_MAX>;
2.5.4 HostIdentifierSeq
const AFS_LOCK_SEQ_MAX = 10000;
typedef AFSLockHostIdentifierSeq <AFS_LOCK_SEQ_MAX>;
An array of HostIdentifier structures used by the
GetByteRangeLockStatus procedure to report client machines
holding locks.
2.5.5 AFSCB_ResultData Redefinition
The AFSCB_ResultData union defined in the Callback Extended
Information draft is redefined (upward compatibly), as the
following:
union AFSCB_ResultData switch (afs_uint32 Result_Type) {
case AFSCB_Result_NoResult:
void;
case AFSCB_Result_ResponseDeferred:
void;
case AFSCB_Result_ReturnLocks:
AFSByteRangeLockSeq AssertedLocks_Array;
};
AFSCB_Result_ReturnLocks
The result is used to return (synchronously, in the
ExtendedCallBack RPC) a list of byte-range locks being extended
in response to an extended callback notification of type
AFSCB_Flag_AssertLocks, or asserted in response to one of type
AFSCB_Cancel_RevokeDelegation or sent with the flag
AFSCB_Flag_RevokeDelegation.
AFSCB_Result_ResponseDeferred
The result is used to indicate that the client will not assert or
return locks synchronously in the ExtendedCallBack RPC (and will
instead assert or return locks using the asychronous RPCs
provided.)
2.6 Procedures
2.6.1 SetByteRangeLock
Requests a lock of type Type on Fid, on the range [Offset,
Offset+Length). Type must be one of LockRead or LockWrite. Owner
shall be set to the ViceID corresponding to the requesting
process or equivalent, or to 0 if this is not known. Uniq shall
be set to a value uniquely identifying the requesting process or
equivalent. On Unix-like systems, Uniq could be set to the PID of
the requesting process.
proc SetByteRangeLock(
IN AFSFid *Fid,
afs_uint32 Type,
afs_uint32 Flags,
afs_uint32 Owner,
afs_uint32 Uniq,
afs_uint64 Offset,
afs_uint64 Length,
OUT AFSByteRangeLock *Lock
) = 65601;
Notes
On successful return the file server has granted the requested
lock, and Lock points to the server's asserted AFSByteRangeLock
structure. If the client has requested and the server agrees to
issue a deferred lock, Lock points to the server's asserted
deferred AFSByteRangeLock structure. The client may safely
determine if it has been granted a deferred lock by inspecting
the value of Lock->Flags.
The returned Lock structure MUST NOT differ from the request with
respect to range, except in the case where the requested lock
would overlap with a lock of the same type already held by the
same client, in which case, the locks are merged and the merged
range returned in Lock. The returned Lock structure MAY differ
from request with respect to Flags.
The value of the Flags argument may alter the semantics and/or
processing of the call:
⢠if (Flags & AFSLock_Flag_Mand), file server is requested to
enforce mandatory locks on writes to or truncate overlapping
with the locked range--if the file server is willing to provide
mandatory enforcement, it MAY set the corresponding flag in
Lock, and if so MUST restrict writes on the asserted range to
the holding client for the duration of the lock
⢠if (Flags & AFSLock_Flag_Wait), file server is requested to
issue a deferred lock if the requested lock may not be
immediately granted--the file server MAY grant a deferred lock
in response to this request, indicating its agreement by
setting the corresponding flag in Lock. Lock is in this
instance an indicator only of the deferred lock promise
Error Codes
EACCES
The caller does not have the necessary rights.
EWOULDBLOCK
The server is unable to grant the request due to conflicting
locks. If a deferred lock was requested, a Flags value of
AFSLock_Flag_Wait indicates the deferred lock is granted.
EDEADLK
The server declines to grant the requested lock (or deferred
lock) because granting it would cause a deadlock.
EINVAL
An illegal lock type was specified.
ENOLCK
The server has insufficient resources to grant the lock, or the
requesting client or file has too many locks outstanding. (No
specific limits are mandated or suggested by this document.)
2.6.2 ReleaseByteRangeLock
Releases the byte-range lock represented in Lock, asserted to be
held by the calling client.
proc ReleaseByteRangeLock(
IN AFSByteRangeLock *Lock
) = 65602;
Notes
When an AFS client intends to release a byte-range write lock, it
MUST ensure that any changed data in the effected range has been
sent to the file server with the appropriate StoreData RPC, and
that the RPC completed successfully. This requirement is based on
an implied assertion that holding a lock on some region of a file
implies, invariantly, an up-to-date view on the locked region.
Error Codes
EINVAL
The caller does not own the corresponding lock.
2.6.3 UpgradeByteRangeLock
Upgrades the byte-range lock represented in Lock, asserted to be
held by the calling client, from its current type (which should
be LockRead) to LockWrite. The upgrade is executed atomically (no
opportunity exists for another client to set a conflicting lock
in the upgraded range while the upgrade is being executed).
proc UpgradeByteRangeLock(
IN AFSByteRangeLock *Lock,
afs_uint32 Type
) = 65603;
Error Codes
EINVAL
The caller does not own the corresponding lock or it is not of
the correct type.
EWOULBLOCK
The lock could not be granted due to conflicting locks.
EDEADLK
The lock could not be granted because granting it, with deferral,
would cause deadlock.
2.6.4 DowngradeByteRangeLock
Downgrades the byte-range lock represented in Lock, asserted to
be held by the calling client, from its current type (which
should be LockWrite) to LockRead. The downgrade is executed
atomically (no opportunity exists for another client to set a
conflicting lock in the downgraded range while the downgrade is
being executed).
proc DowngradeByteRangeLock(
IN AFSByteRangeLock *Lock,
afs_uint32 Type
) = 65604;
Notes
When an AFS client intends to downgrade a byte-range write lock,
it MUST ensure that any changed data in the effected range has
been sent to the file server with the appropriate StoreData RPC,
and that the RPC completed successfully. This requirement is
based on an implied assertion that holding a lock on some region
of a file implies, invariantly, an up-to-date view on the locked
region.
(Allowing the store obligation to be transfered to the release of
the read lock that should result from the DowngradeByteRangeLock
call is theoretically justified, but weakens consistency, and
does not seem to entail any strong benefit to the client.)
Error Codes
EINVAL
The caller does not own the corresponding lock or it is not of
the correct type.
2.6.5 GetByteRangeLockStatus
Diagnostic procedure provided to permit system administrators to
identify client machines and software running on those clients
that are currently holding locks on a file. Fid is the file to
report on. The call returns parallel variable-length arrays of
locks and their associated hosts. The procedure may only be
executed by the AFS super user or members of the
system:administrators group.
proc GetByteRangeLockStatus(
IN Fid,
OUT AFSByteRangeLockSeq *AssertedLocks_Array,
AFSLockHostIdentifierSeq *Clients_Array
) = 65605;
Error Codes
EACCES
The caller does not have the necessary rights.
2.6.6 CancelByteRangeLock
The CancelByteRangeLock procedure permits system administrators
to revoke active locks that may be obstructing normal operations,
perhaps due to a system or network problem. Fid is the file on
which to revoke locks. If successful, all locks in range [Offset,
Offset+Length) are canceled If a value of 0 is given for Offset
and Length the range is taken to span the entire file. The
procedure may only be executed by the AFS super user or members
of the system:administrators group.
proc CancelByteRangeLocks(
IN Fid,
afs_uint64 Offset,
afs_uint64 Length
) = 65606;
2.6.7 AssertExtendLocks
On receipt of an AFSCB_Cancel_ExtendLocks or
AFSCB_Flag_ExtendLocks notification through the extended callback
interface, a client MUST either:
⢠return any locks it asserts in AssertedLocks_Array, the type of
union AFSCB_ResultData for these calls
â if the server rejects any locks asserted by the client, it
will so notify client in a subsequent cancellation message
⢠set a result of AFSCB_Result_ResponseDeferred, and execute the
AssertExtendLocks bulk call before the ExpirationTime in the
AFSExtendedCallback structure sent with the callback
Fid is the file for which locks are being extended. Flags
contains indication of special semantics (e.g., mandatory
enforcement) being asserted, if any. AssertedLocks_Array points
to a variable length array of AFSByteRangeLock structures the
client asserts to hold. At the completion of the call, the
parallel array OutResult indicates the server's confirmation (or
refusal) to extend each asserted lock--a value of (Flags &
AFSLock_Flag_Extend_Ok) indicates confirmation.
/* Assert locks on Fid, on request */
AssertExtendLocks(
IN AFSFid Fid,
afs_uint32 Flags,
AFSByteRangeLockSeq *AssertedLocks_Array,
OUT AFSLockFlagsSeq *OutResult
) = 65607;
2.7 Windows & Unix Lock Semantics
Implementation of interoperable locking behavior presents
challenges for a distributed file system like AFS, which must
support clients on platforms which do not agree precisely on the
semantics desirable or possible to enforce.
2.7.1 Byte-Range Locking
As byte-range locking is effectively required for correct
behavior of Windows applications, the OpenAFS for Windows client
has been forced to implement a locally-enforced byte-range
locking mechanism. In the Windows client today, local byte-range
are shadowed by a whole-file lock in AFS. With the introduction
of server-coordinated byte-range locking, the Windows client is
expected to use server byte-range locks when possible.
2.7.2 Read/Write vs. Shared/Exclusive
In the current OpenAFS for Windows client, shared (whole-file)
locks are mapped to AFS read locks, and Windows exclusive
(whole-file) locks are mapped to AFS write locks. This mapping
applies equally for byte-range locks.
2.7.3 Atomic Lock Open
Windows provides the ability to open and lock a file in a single
operation, and key Windows applications such as Microsoft Office
rely this behavior. Although this behavior has no direct
equivalent in the AFS protocol (which does not provide an OPEN
file operation) the correct behavior from the point of view of
Windows applications is already emulated by the Windows client.
2.8 Mandatory Enforcement
Mandatory enforcement of file locks is considered a requirement
for Windows interoperation. The rules proposed here reflect some
consideration and discussion of unique features in AFS, and also
compromises made in competing systems intended to support mixed
Windows and Unix clients, particularly NFSv4.
2.8.1 Governing Ideas
⢠Byte-range locks may be taken out on a file under the same
circumstances under which a whole file might be taken out in
traditional AFS
⢠Clients asserting advisory locks on a file by definition do not
expect any special semantics from the file system; however, it
seems logically reasonable that advisory and mandatory locks
should interact equivalently as locks, and so where this
document asserts that in a given scenario, a lock by a client A
would conflict with a lock held by a client B, it is is not
considered relevant whether either client's lock is advisory or
mandatory
⢠The mechanism of lock enforcement is to fail the operation
being attempted, a hint shall be sent in the return code of the
reason for failure
⢠An operation which fails due to conflict with an existing lock
fails completely
⢠Mandatory enforcement is taken to mean enforcement, generally,
of write denial in any locked range, including by clients not
observing any locking protocol
⢠Data intended to be written outside any conflicting locked
range on a file with at least one mandatory locked range,
considering the view of locks on the file at the fileserver
when the write request is processed, is not written
⢠Since applications exist, particularly for the command line
(e.g., tar) which know nothing about locks, and may have
legitimate reason to read (though not write) data protected by
mandatory locks, relaxed semantics are enforced for reads by
clients reading outside any range they have themselves
locked--such reads never conflict with lock enforcement--the
view of data provided to such a client shall be whatever is
available, conforming to regular AFS semantics
⢠Mandatory enforcement of a read or write lock is asserted to
govern only the StoreData operation (by other clients), and
not, e.g., the various directory change operations or FetchData[footnote:
Mandatory read lock enforcement is silly, Eisler 2006. More
importantly, it causes difficulties for the AFS cache consistency
model.
]
2.8.2 Enforcement Rules
⢠If a client A has a mandatory lock of any type on a range R in
a file F, then StoreData operations by any other client B which
would alter data in any overlapping range or truncate F such as
to reduce or eliminate R, the conflicting operation (initiated
by B) fails
3 Delegation
3.1 Dependencies
The delegation feature depends on support for extended callback
notifications (and its dependencies) and on byte-range locking
support in client and server.
3.2 Backward Compatibility
AFS clients and servers will indicate their support for
delegation through new client and file server capability flags:
const CLIENT_CAPABILITY_DELEGATION = 0x0010;
const VICED_CAPABILITY_DELEGATION = 0x0020;
3.3 Lock Delegation<sub:Lock-Delegation>
The concept of delegation is introduced to prevent the stronger
file semantics introduced by the proposed byte-range locking
mechanisms from introducing a performance degradation, in the
case of a single client making uncontested use of byte-range
locks. Since the Windows client (and also, less importantly, the
Linux client) currently provide locally-enforced byte-range locks
(shadowed by a whole file lock in AFS) to clients requesting
them, and since Windows applications in particular (e.g.,
Microsoft Office) make extensive use of such locks, this is in
fact a common and probably important case.
3.4 File Delegation
In developing the concepts in this proposal and the previously
submitted Callback Extended Information proposal we have
considered ideas from NFSv4 and other recent systems, such as the
(incomplete) CRFS system, and in particular, we have attempted to
suggest an evolutionary path for AFS which might provide the
stronger file semantics and efficient handling of mutable data
that we think a modern distributed file system should
provide--while not sacrificing the powerful caching features
which make AFS valuable and unique.
Reconsideration of NFSv4 delegation in the light of final drafts
of the Callback Extended Information proposal has influenced us
to think that a concept of AFS delegation might be developed in
which the lock delegation concept suggested in section [sub:Lock-Delegation]
, combined with more deterministic semantics for files primarily
under client vs. files primarily under server control, would form
the key concepts.
Delegation is the NFSv4 file caching mechanism, and also supports
lock delegation. However, delegation has more deterministic
semantics in NFSv4 than caching presently has in AFS. Adding an
explicit delegation concept to AFS provides an opportunity to
tighten the semantics for delegated and undelegated files in AFS.
In particular, as in the Extended Callback Information proposal,
we are interested in improving AFS cache consistency with respect
to mutable data. AFS clients (e.g., the OpenAFS Windows client)
are already moving away from traditional AFS sync-on-close
behavior, toward a continuous, best-effort sync behavior. The
OpenAFS Roadmap contains language, with which we agree,
indicating that best-effort synchronisation is actually more
efficient than sync-on-close. We propose to formalize this
behavior and define it as specified behavior for clients
supporting delegation, and operating on a file without an
explicit byte-range delegation from the file server.
While clearly related to NFSv4 delegations (and also Oplocks in
the Microsoft CIFS protocol), the delegation concept proposed
here for AFS differs from NFSv4 delegation. In particular, since
the AFS protocol supports caching explicitly through existing
protocol mechanisms, the delegation concept is introduced to
strengthen AFS caching semantics in specific situations only, and
is in no sense new caching mechanism.
NFSv4 supports read and write file delegations, concepts which
overlap but are inconsistent with the AFS caching model. An NFSv4
read delegation confers permission to cache a file (or byte
range, under the byte-range delegation proposal), and carries an
assertion that no client has a write delegation. An NFSv4 write
delegation confers permission to cache file writes. Since in AFS
caching is always permitted, and clients always notified of file
changes, an AFS client with a callback on a file by definition
always has the equivalent of an NFSv4 read delegation. In our
proposal, an AFS delegation somewhat resembles an NFSv4 write
delegation. A client with a delegation on a byte range may cache
writes in the range, at its discretion, until the delegation is
recalled. Read and write operations from contending clients will
induce the fileserver to recall overlapping delegations it may
have issued in the affected range. The contending operations will
not complete until the client whose delegation is being recalled
has had an opportunity to flush its changes and return any locks
it issued while the delegation was in effect.
NFSv4.1 (May 2008) supports directory delegations[8]. This
proposal does not include directory delegation. Experience gained
implementing and using AFS file delegations should help to
clarify whether directory delegations would be a useful addition
in future. (For example, to facilitate implementation of
hierarchical server-to-server replication as implemented for
NFSv4 in [11].)
Since 2005, an NFSv4 extension to support byte-range delegations
has been proposed[9].[footnote:
I do not find evidence in NFSv4.1 Draft 23 that Byte-Range
Delegations were included NFSv4.1, but they may be a
NetApp-implemented extension.
] The stated motivation for NFSv4 byte-range delegations,
supported by analysis of the suggested protocol changes, is to
facilitate cache-coherent updates by multiple writers, or writers
and readers, on disjoint byte ranges in a file[10]. More
specifically, byte-range delegation is an NFSv4 mechanism to
permit partial file caching, which AFS has always supported
(range-based when using extended callback information), together
with a type of range based invalidation.
Thus in the context of NFSv4, byte-range delegation significantly
overlaps in function with the general AFS caching model and with
extended callback information. Early versions of this proposal
defined whole-file delegation only, arguing that this would
provide best-effort visibility of changes across clients, with
good efficiency, and that it would be sufficient to efficiently
support the live multimedia stream example used to motiviate
NFSv4 byte-range delegations in [10]. Early reviewers have argued
for inclusion of byte-range delegation, in consideration that it
is more expressive (not a whole-file caching hack) and would be
desireable for applications such as distributed databases or HPC
applications. Correspondingly, the current proposal now includes
a byte range delegation concept. Clients iteratively and
aggressively updating or locking in disjoint ranges of a file
would be eligible to operate in disjoint, byte-range delegations.
Further feedback from reviewers is requested. Feedback on
specific applications and usage models we should support would be
especially helpful.
3.4.1 Semantic Changes
For AFS, I suggest the following semantics and supporting
mechanisms for delegation:
⢠only files may be delegated
⢠with respect to file data, a file delegation, if mutually
accepted in client and file server, shall indicate a
strengthened semantics for file caching such that
â a byte-range under delegation shall be regarded as under
exclusive control of one client, which may then observe any
synchronisation/flush semantics on the range for the duration
of the delegation
â a byte range not under delegation shall be regarded as under
server control, potentially shared by multiple readers and/or
writers, such that clients must observe more strict
synchronisation/flush semantics, defined to mean an
obligation to flush changes continuously at best effort, with
the special exception that
â revocation of a file delegation shall obligate the client to
whom file was formerly delegated to store any data changed
during the period of delegation, and to ``return'' the
now-resynchronised byte range to the file server using its
UndelegateReturningLocks procedure, within a time window
provided by the server in its AFSCB_Cancel_RevokeDelegation
callback cancellation message
⢠with respect to byte-range locks,
â a byte range under delegation shall be regarded as under
exclusive control of one client, which may then issue
byte-range locks of any type within the range, without
consideration of the file server
â a byte range not under delegation shall be regarded as under
server control, such that all locking requests must be
executed at the file server, using the interfaces defined in
this proposal
â recall of a byte range delegation shall obligate the client
to whom file was formerly delegated to ``return'' the
now-resynchronised byte range and all issued locks to the
file server using its UndelegateReturningLocks procedure,
within the time window provided by the server in its
AFSCB_Cancel_RevokeDelegation callback cancellation message
3.4.2 Delegation
It is reasonable for a file server to issue a byte range
delegation in response to any of several file operations on a
byte range which is not already delegated, and which is not known
to be unsuitable for delegation for operational reasons. This
proposal assumes that, presuming a client and server are mutually
capable of delegation, the general behavior of the file server
should be to issue a delegation if no rule or heuristic would
prevent it.
⢠The file server MAY issue a delegation in response to any of
the FetchData or StoreData operations
⢠The file server MUST NOT issue a delegation for any byte range
for which there is an existing delegation--and in fact, in any
case where it might do so, it MUST recall the conflicting
delegation(s) (see section [sub:Revocation]).
⢠The file server SHOULD NOT issue a delegation if it has
heuristic information that would suggest delegating a
particular byte range would be inefficient, i.e., because a
given file is frequently operated on by a variety of clients
⢠The file server SHOULD NOT issue a delegation in response to
FetchStatus operations in absence of other supporting
information, as these are commonly issued to clients scanning
directories
It is expected that clients may request, or that a file server
may offer clients, a delegation on a range larger than the
smallest range compatible with the file operation or explicit
request which triggered the delegation.
3.4.3 Revocation<sub:Revocation>
A file server may recall file delegations at any time, for any
reason. A file server must recall file delegations when a client
other than the one to which a delegation has been issued performs
any of the following operations on the file:
⢠StoreData operations
⢠FetchData operations
⢠any other operation that strongly indicates liklihood of intent
to read or alter file contents (e.g., any Open indication,
should it be added to the AFS protocol)
When the file server wishes to recall a file delegation, it
issues an AFSCB_Cancel_RevokeDelegation notification to the
client via the ExtendedCallback interface. Alternatively, it may
send AFSCB_FlagRevokeDelegation to any other ExtendedCallback
notification message.
3.5 Constants
3.5.1 Delegation Types
The current proposal defines one delegation type. The possibility
to define new delegation types, with new semantics, is provided
for potential future proposals.
const AFS_DType_General = 0;
AFS_DType_General
Represents general delegation as defined in this proposal.
3.5.2 Callback Constants
The following extended callback event type is added:
const AFSCB_Event_Delegation = 13;
The following callback cancellation types and flags are provided,
to permit management of delegations through the ExtendedCallback
interface:
const AFSCB_Flag_Delegation = 2; /* file delegation */
const AFSCB_Cancel_RevokeDelegation = 9; /* delegation is revoked
*/
const AFSCB_Flag_RevokeDelegation = 16; /* delegation is revoked
*/
The following constant is provided as a descriminator for the
AFSCB_ResultData member of AFSCBExtendedCallbackResult allowing
clients to indicate their intention to defer returning locks or
delegations in a subsequent RPC on the file server:
const AFSCB_Result_ResponseDeferred = 2;
The following constant is provided as a descriminator for the
AFSCB_ResultData member of AFSCBExtendedCallbackResult allowing
clients to indicate their intention to return locks in the
CallBack_Result_Array OUT parameter:
const AFSCB_Result_ReturnLocks = 3;
AFSCB_Flag_Delegation
When set in the Flags member of an AFSExtendedCallback structure,
indicates that the callback promise includes file delegation. The
delegation persists for the life of the callback, unless recalled
through an ExtendedCalback notification.
AFSCB_Cancel_RevokeDelegation
When sent as the reason for cancellation in an ExtendedCallback
notification, indicates that the file delegation on FID has been
recalled. The client MUST store all data in FID which has changed
during the period of delegation, and then execute the file
server's UndelegateReturningLocks procedure for all locks it
asserts on FID, prior to the ExpirationTime in the extended
callback message.
AFSCB_Flag_RevokeDelegation
Has the same meaning and effect as AFSCB_Cancel_RevokeDelegation,
but may be sent with an arbitrary extended callback message.
AFSCB_Flag_ExtremePrejudice
Combined with AFSCB_Flag_RevokeDelegation, indicates that the
resync/lock return period for an already-recalled delegation is
over. The client is requested to stop lock-return activity.
3.6 DataTypes
3.6.1 AFSDelegation
The AFSDelegation data type represents a delegation issued by a
fileserver to some client on a specific byte range in Fid.
struct AFSDelegation {
AFSFid Fid;
afs_uint32 Type;
afs_uint32 Flags;
afs_uint64 Offset;
afs_uint64 Length;
afs_uint64 ExpirationTime;
};
Fid
The Fid being delegated.
Type
The type of the delegation, currently restricted to
AFS_DType_General.
Flags
An array of flag values provided for future extension.
Offset
The starting offset of the byte range being delegated.
Length
The length of the byte range being delegated.
ExpirationTime
Time in seconds since the Epoch after which the delegation must
be considered invalid. A server implementation MAY offer a new
AFSDelegation effectively extending the expiration time of an
existing delegation at any convenient time. (Clients may also
request a new delegation explicitly using the RequestDelegation
interface prior to ExpirationTime to request an extension.)
3.6.2 AFSExtendedCallBack
A new value of AFSCB_Event_Delegation is added to union
AFSCB_NotificationData used in struct AFSExtendedCallBack. The
type of the union at AFSCB_EventDelegation is AFSDelegation. The
new extended callback notification is used by the file server to
indicate it has granted a file delegation on FID to client.
3.7 Procedures
3.7.1 RequestDelegation
The RequestDelegation procedure is added to the fileserver
interface, permitting the client to request an explicit
delegation on a byte range. A client implementation MAY chose to
make an explicit delegation request based on a client application
fadvise or madvise API call, or similar mechanism appropriate to
its platform.
/* Request explicit delegation of a byte range */
RequestDelegation(
IN AFSFid Fid,
afs_uint32 Type,
afs_uint32 Flags,
afs_uint64 Offset,
afs_uint64 Length,
AFSDelegation *Delegation
) = 65608;
Fid
The Fid being delegated.
Type
The type of the delegation, currently restricted to
AFS_DType_General.
Flags
An array of flag values provided for future extension.
Offset
The starting offset of the byte range being delegated.
Length
The length of the byte range being delegated.
Delegation
The Delegation returned from the fileserver, if granted.
Error Codes
EACCES
The caller does not have the necessary rights.
EWOULDBLOCK
The server is unable to grant the request due to conflicting
delegation.
EINVAL
An illegal delegation type or range was specified.
3.7.2 UndelegateReturningLocks
The UndelegateReturningLocks bulk call MUST be executed by
clients on receipt of an AFSCB_Cancel_RevokeDelegation or
AFSCB_Flag_RevokeDelegation notification through the extended
callback interface. The call must be executed before the
ExpirationTime in the AFSExtendedCallback structure sent with the
callbck. Fid is the file for which locks are being extended.
Flags contains indication of special semantics (e.g., mandatory
enforcement) being asserted, if any. AssertedLocks_Array points
to a variable length array of AFSByteRangeLock structures the
client asserts to hold. At the completion of the call, parallel
array OutResult indicates the server's confirmation (or refusal)
to assert each returned lock after undelegation--a value of
(Flags & AFSLock_Flag_Undelegate_Ok) indicates confirmation.
/* Confirm undelegation and req. assert locks, if any */
UndelegateReturnLocks(
IN AFSFid Fid,
afs_uint32 Flags,
AFSByteRangeLockSeq *AssertedLocks_Array,
OUT AFSLockFlagsSeq *OutResult
) = 65609;
4 Appendix A: XDR Grammar (afsint.xg)
const VICED_CAPABILITY_BYTE_RANGE_LOCK = 0x0010;
const VICED_CAPABILITY_DELEGATION = 0x0020;
const AFSLock_Flag_Mand = 1; /* req. enforcement */
const AFSLock_Flag_Wait = 2; /* req. wait on lock */
const AFS_DType_General = 0;
const AFSCB_Event_Delegation = 13;
struct AFSByteRangeLock {
AFSFid Fid;
afs_uint32 Type;
afs_uint32 Flags;
afs_uint32 Owner;
afs_uint32 Uniq;
afs_uint64 Offset;
afs_uint64 Length;
afs_uint64 ExpirationTime;
};
struct AFSDelegation {
AFSFid Fid;
afs_uint32 Type;
afs_uint32 Flags;
afs_uint64 Offset;
afs_uint64 Length;
afs_uint64 ExpirationTime;
};
/* Request byte-range file lock */
proc SetByteRangeLock(
IN AFSFid *Fid,
afs_uint32 Type,
afs_uint32 Flags,
afs_uint32 Owner,
afs_uint32 Uniq,
afs_uint64 Offset,
afs_uint64 Length,
OUT AFSByteRangeLock *Lock
) = 65601;
/* Release byte-range file lock */
proc ReleaseByteRangeLock(
IN AFSByteRangeLock *Lock
) = 65602;
/* Upgrade byte-range file lock (i.e., from Read to Write) */
proc UpgradeByteRangeLock(
IN AFSByteRangeLock *Lock,
afs_uint32 Type
) = 65603;
/* Downgrade byte-range file lock (i.e., from Write to Read) */
proc DowngradeByteRangeLock(
IN AFSByteRangeLock *Lock,
afs_uint32 Type
) = 65604;
/* Request lock status report (system:administrators) */
proc GetByteRangeLockStatus(
IN Fid,
OUT AFSByteRangeLockSeq *AssertedLocks_Array,
AFSLockHostIdentifierSeq *Clients_Array
) = 65605;
/* administratively cancel locks (system:administrators) */
proc CancelByteRangeLocks(
IN Fid,
afs_uint64 Offset,
afs_uint64 Length
) = 65606;
const AFS_LOCK_SEQ_MAX = 10000;
typedef AFSByteRangeLock AFSByteRangeLockSeq <AFS_LOCK_SEQ_MAX>;
typedef AFSLockFlagsSeq <AFS_LOCK_SEQ_MAX>;
const AFSLock_Flag_Extend_Ok = 4; /* extended */
const AFSLock_Flag_Undelegate_Ok = 8; /* undelegated, asserted */
/* Assert locks on Fid, on request */
AssertExtendLocks(
IN AFSFid Fid,
afs_uint32 Flags,
AFSByteRangeLockSeq *AssertedLocks_Array,
OUT AFSLockFlagsSeq *OutResult
) = 65607;
/* Request explicit delegation of a byte range */
RequestDelegation(
IN AFSFid Fid,
afs_uint32 Type,
afs_uint32 Flags,
afs_uint64 Offset,
afs_uint64 Length,
AFSDelegation *Delegation
) = 65608;
/* Confirm undelegation and req. assert locks, if any */
UndelegateReturnLocks(
IN AFSFid Fid,
afs_uint32 Flags,
AFSByteRangeLockSeq *AssertedLocks_Array
) = 65609;
5 Appendix A: XDR Grammar (afscbint.xg)
const CLIENT_CAPABILITY_BYTE_RANGE_LOCK = 0x0008;
const CLIENT_CAPABILITY_DELEGATION = 0x008;
/* Revoke-Delegation Cancellation Type */
const AFSCB_Cancel_ExtendLocks = 7; /* re-assert locks, or lose
them */
const AFSCB_Cancel_RevokeLocks = 8; /* locks on Fid revoked */
const AFSCB_Cancel_RevokeDelegation = 9; /* delegation is revoked
*/
/* Delegation Callback Flag */
const AFSCB_Flag_Delegation = 2; /* file delegation */
/* Cancellation Flags */
const AFSCB_Flag_AssertLocks = 4; /* request ExtendLock */
const AFSCB_Flag_RevokeLocks = 8; /* locks cancelled, sorry */
const AFSCB_Flag_RevokeDelegation = 16; /* delegation is revoked
*/
/* confirm issue of deferred lock requests */
proc AsyncIssueByteRangeLock(
IN HostIdentifier *Server,
AFSByteRangeLockSeq <AFS_LOCK_SEQ_MAX>
) = 65540;
/* extended callback expansion for delegation */
struct AFSCB_Data_Delegation {
AFSFid Fid;
afs_uint32 Flags;
afs_uint64 Offset;
afs_uint64 Length;
afs_uint64 ExpirationTime;
};
union AFSCB_NotificationData switch (afs_uint32 Event_Type) {
case AFSCB_Event_StoreData:
AFSCB_Data_StoreData u_store_data;
case AFSCB_Event_StoreACL:
void;
case AFSCB_Event_StoreStatus:
AFSCB_Data_StoreStatus u_store_status;
case AFSCB_Event_CreateFile:
AFSCB_Data_CreateFile u_create_file;
case AFSCB_Event_MakeDir:
AFSCB_Data_MakeDir u_make_dir;
case AFSCB_Event_Symlink:
AFSCB_Data_Symlink u_symlink;
case AFSCB_Event_Link:
AFSCB_Data_Link u_link;
case AFSCB_Event_RemoveFile:
AFSCB_Data_RemoveFile u_remove_file;
case AFSCB_Event_RemoveDir:
AFSCB_Data_RemoveDir u_remove_dir;
case AFSCB_Event_Rename:
AFSCB_Data_Rename u_rename;
case AFSCB_Event_Deleted:
void;
case AFSCB_Event_ReleaseLock:
AFSCB_Data_Lock u_lock;
case AFSCB_Event_Cancel:
void;
case AFSCB_Event_Delegation:
AFSCB_Data_Delegation u_delegation;
};
References
[1] Bradner, S., "Key words for use in RFCs to Indicate
Requirement Levels", BCP 14, RFC 2119, March 1997.
[2] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame,
C., Eisler, M., and D. Noveck, "Network File System (NFS) version
4 Protocol", RFC 3530, April 2003.
[3] Edward R Zayas, "AFS-3 Programmer's Reference: File
Server/Cache Manager Interface", Transarc Corporation,
FS-00-D162, 20th August 1991
[4] Paul J. Leach, Dilip C. Naik. A Common Internet File System
(CIFS/1.0) Protocol
[http://www.tools.ietf.org/html/draft-leach-cifs-v1-spec-01],
1997.
[5] Jake Edge. CRFS and POHMELFS
[http://lwn.net/Articles/267896/].
[6] OpenAFS Roadmap [http://openafs.org/roadmap.html].
[7] S. Shepler, M. Eisler, D. Noveck. NFS Version 4 Minor Version
1
[http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-23.txt],
May 2008.
[8] T. Myklebust, J. Fields, W. Adamson, P. Honeyman. Network
File System (NFS) version 4 byte range delegations
[http://tools.ietf.org/html/draft-myklebust-nfsv4-byte-range-delegations-00],
October 2005.
[9] Trond Myklebust. Byte Range Delegations.
[https://www3.ietf.org/proceedings/05nov/slides/nfsv4-3.pdf ],
November 2006.
[10] Jiaying Zhang and Peter Honeyman, "Reliable Replication at
Low Cost," CITI Technical Report 06-2, January 2006.
_______________________________________________
AFS3-standardization mailing list
[email protected]
http://michigan-openafs-lists.central.org/mailman/listinfo/afs3-standardization