Hi there Linux file system developers,
Andreas pointed me at this thread, so I thought I'd stick my head in and
try to give a little perspective to parallel work occurring in the FreeBSD
world, and in the ex-POSIX.1e world. I'll also give my viewpoints on EA
and ACL semantics, although I should point out that my opinion is
certainly mutable, and possibly irrelevant :-). Also, this is a rather
long and verbose message that describes both requirements, goals, and some
design decisions made.
When I started work on a POSIX.1e implementation for FreeBSD, the most
challenging aspect to deal with was where to store new file labels.
Adding label handling to kernel objects is straight forward, as are new
access control checks, etc. To attempt to address this in a general
manner, I chose Extended Attributes. Prior to beginning work on an EA
implementation, I did look carefully at the other options available, and
summarized a number of these over time on the posix1e mailing list
(archived on www.securityfocufs.com). A number of members of other
developer communities have also talked about their experiences with
backing stores for new security labels there.
Developers from SGI's Trusted IRIX group considered and have documented a
number of techniques, including shadow inodes, "Plan G" file-backed
extended attributes, and making use of reserved fields in the inode under
EFS. Under XFS, they built support for EAs directly into the file system.
Other trusted operating system groups have backed attributes similarly:
often extra fields in the inode are absconded with either to contain the
data directly, or index into another database. Under at least one HP
implementation, for example, there's actually a userland database that
maintains labels and provide information to the kernel on demand. Solaris
uses a shadow inode technique to store ACLs, also. For those not familiar
with this technique, it involves bundling inodes in pairs: one inode
stores the traditional file and address space; the other stores attribute
data and is only accessible through well-defined system APIs.
As I document in my recent BSDCon paper, my primary goals in developing
extended attribute support lay in in a number of requirements, some
imposed by the purpose of the EAs, other imposed by portability concerns:
1) EAs must be associated with the inode, and not with the name. This
allows security attributes to be maintained in a similar manner to
today's permissions, maximimizing consistency between different
security mechanisms (mandatory, discretionary, etc).
2) Atomic updates of labels on an inode must be possible: via locking, or
fundamentally atomic operations. For an ACL or a MAC label, it is
unacceptable to have two updates be simultaneously and partially
applied resulting in an inconsistent label.
3) EAs used by system policies must be protected from arbitrary
modification by the owners of the object, as they may correspond with
security attributes that the owner may not be permitted to modify (MAC
labels, et al).
4) EAs used for system policies may also be private to the system itself:
for example, it may not be desirable for a file owner to access keying
material bound to the file system object.
I also optimized the semantics for a number of other goals that weren't
strict requirements:
1) Multiple EAs are desirable, based on a namespace permitting them to be
easily distinguished and managed. For example, it is desirable to
store file ACLs seperately from capabilities, MLS, Type Enforcement
labels, etc.
2) It should be possible to distinguish the following types of situations:
EAs are not available for the file system, this EA is not defined for
this object, this EA is defined but 0-length, and this EA is defined
and has data.
3) The EA interface should be able to support a variety of backing stores.
In particular, the EA interface should not enforce particular syntax on
a particular attribute, but the underlying implementation may do so.
This is fundamental to the VFS concept. This allows some file systems
to support generic file system EAs for applications, but others to only
provide support for particular optimized attributes (i.e., exporting
permissions via an EA interface). This should be indestinguishable
from above the VFS except that some submissions of EA data might be
rejected based on formatting. This would allow future implementations
to optimize a particular attribute, such as storing the MAC label
directly in the inode where previously it had been in a less efficient
file-backed EA.
4) For the purposes of management, it should be possible to retrieve a
list of EAs on the file, relating to fs-specific backup mechanisms
(dump), and for non-fs-specific backup mechanisms in the application
namespace (tar).
For political reasons as well as technical, I opted not to try to make EAs
be the same as file forks, despite interest from Apple in doing so. In my
mind, EAs are operated on atomically, and not as complete address spaces.
We already have a nice abstraction for a name with multiple complete
addresses under it: the directory. Not only that, but we support a
complete hierarchal namespace under the name (sub-directories).
So, the semantics documented in my paper:
1) EAs on a file system inode are a set of (name, data) pairs. For each
inode, an attribute name may be defined or undefined, and if defined,
may be associated with zero of more bytes of data. This is similar to
environmental variables in common shells.
2) The attribute data, unlike the file itself, does not comprise a
complete address space accessible through an independent file
descriptor: it is not a file fork. Instead, a relatively inflexible
set of API calls are provided to set, get, and remove the data
associated with a particular named attribute on the file. These
operations are atomic for a particular name on the inode, meaning that
it is not possible for a get operation to return an inconsistent view
of a particular attribute: writes and reads of a particular attribute
are serializable.
3) Protection models may exist protecting particular attributes, on the
basis of namespace, or discretionary/mandatory access control policies.
This protection should make it possible for EAsto be used to safely
store system attributes such as ACLs, mandatory access control labels,
and capabilities. As application writers may want to make use of
attributes from user-land, it may be desirable not to preclude this
use.
4) And (4) from above: it is possible to retrieve the list of accessible
EAs for the caller of a system call.
Based on these goals, the final optimizations for my current
implementation were:
1) Rapid development time -- I should be able to implement the mechanism
rapidly (three week prototyping time)
2) Optimize for a fixed set of attributes, with fixed maximum sizes: all
of the security labels I'm using fall into this, and it's a strict
subset of the general EA behavior defined above.
3) Optimize for not requiring people to modify their underlying file
system storage: while this is fine for a new file system, my goal was
to rapidly move from no code to a moderate code base, yet allow people
to experiment without replacing toasting partitions, new newfs/fsck
versions, etc.
I selected a file-backed implementation, which is probably not relevant to
the discussion of EA semantics in and of themselves, but motivates the
interface I selected. This implementation is quite similar to that used
by SGI in "Plan G" prior to XFS, so there are at least two worked examples
of such a system.
I support two name spaces, much in the style of SGI's implementation,
although I'm currently considering moving to additional namespaces as
discussed in detail on the POSIX.1e mailing list. The two namespaces are
(a) the system namespace, modifiable only with appropriate privilege, and
(b) the user namespace, modifable based on DAC and MAC restrictions on the
normal file contents. These are semantically very similar to SGI's "root"
and "user" namespaces, only I avoided the name "root" since TrustedBSD
doesn't have one of those, and it really connotes the use of privilege,
not a DAC-like uid mechanism. Based on a suggestion from Andreas, I
currently prefix system attributes with '$', and other names are in the
user namespace. This also differs from the SGI approach where there are
two seperate namespaces based on a namespace flag. A hierarchal namespace
might also be a reasonable choice.
The namespaces proposed on the POSIX.1e list by Casey at SGI (and worked
on by Andreas also, I believe, with less helpful comments from myself),
consist of (and not identical names to these, I should point out):
Kernel namespace -- accessible only from within the implementation
Privileged namespace -- accessible from userland with appropriate
privilege
Owner namespace -- accessible by the object owner
User namespace -- accessible subject to DAC restrictions on the file
These are details, but address a number of the desirable aspects: the need
for attributes managed purely in the kernel or the file system itself, and
exported from the kernel via more general interfaces. Also, the desire to
have a namespace used in userland, but requiring privilege to manage.
Etc.
Unlike in Andreas's implementation, I made different decisions about
at what level a conversion is performed from the EA binary blob to a
structured data format provided by POSIX calls such as acl_get_file(),
cap_get_file(), mac_get_file(), etc. In my implementation, ACLs are a
structured data type within the kernel, and defined syntactically at the
VFS layer, and expected to be understood by file systems (although the
semantics for each file system may vary). This also allows file systems
to not treat ACLs as having the same semantics as EAs, although
substantial differences in semantics might be bad. Andreas's
implementation makes this conversion in the userland libraries, using EAs
as a general way to pass extensible file system meta-data to and from the
kernel. While that is possible in FreeBSD also, my choice with ACLs
reflected my desire to support other ACL formats at the kernel level:
i.e., Coda/AFS ACLs, POSIX ACLs, etc.
For the purposes of syscall semantics, the available operations are atomic
get/replace/remove, where replace creates if the EA is currently
undefined. An array of iovec's is supported to set/get the values,
allowing scatter gather operations, but always from offset 0 in the EA.
If multiple-operation "transactions" are required, file locking can be
employed from userland if applications agree on a locking protocol
appropriate for their use, or within kernel, an exclusive vnode lock can
make operations atomic with respect to other vnode consumers (although
this doesn't propagate it down the VFS as an atomic transaction).
For reference, I'm including here the vnode operations that I defined.
For ACLs, atomic get/set are provided as well as a check routine which
allows the caller to determine if the provided ACL is both syntactically
and semantically appropriate for the target object. As with stat() and
chmod(), relative changes are non-atomic without use of a locking protocol
from userland, or vnode locking within the kernel. The ACL type indicates
the type of ACL for the object: in POSIX.1e, this has to do with
ACL_TYPE_DEFAULT and ACL_TYPE_ACCESS. I also have types for AFS/Coda
ACLs, etc. The struct ACL has fixed syntax, but semantics that depend on
the type passed.
#
#% getacl vp L L L
#
vop_getacl {
IN struct vnode *vp;
IN acl_type_t type;
OUT struct acl *aclp;
IN struct ucred *cred;
IN struct proc *p;
};
#
#% setacl vp L L L
#
vop_setacl {
IN struct vnode *vp;
IN acl_type_t type;
IN struct acl *aclp;
IN struct ucred *cred;
IN struct proc *p;
};
#
#% aclcheck vp = = =
#
vop_aclcheck {
IN struct vnode *vp;
IN acl_type_t type;
IN struct acl *aclp;
IN struct ucred *cred;
IN struct proc *p;
};
For vnode operations, there are the atomic get and set described earlier,
with the same multi-operation transaction choices. For those not familiar
with it, struct uio provides a scatter/gather target for userland/kernel
transfers, which allow the consumer of the operation to be unaware of the
source (potentially it can be across the network, but the only two
commonly used today in FreeBSD are kernel and userspace targets). If the
struct uio is NULL for a set, it the file system is requested to perform a
remove operation. To retrieve an attribute list, a NULL name is passed,
a nul-terminated list of nul-terminated strings is returned.
#
#% getextattr vp L L L
#
vop_getextattr {
IN struct vnode *vp;
IN const char *name;
INOUT struct uio *uio;
IN struct ucred *cred;
IN struct proc *p;
};
#
#% setextattr vp L L L
#
vop_setextattr {
IN struct vnode *vp;
IN const char *name;
INOUT struct uio *uio;
IN struct ucred *cred;
IN struct proc *p;
};
Right now, code supporting this in FFS is provided in FreeBSD 5.0-CURRENT,
and will appear for the first time in a formal release in 5.0-RELEASE next
year. This means the interfaces are still, to some extent, changeable.
The current semantics satisfy most consumers, although the namespace may
still be expanded subject to further discussion. Given that this is a
fairly portable interface -- similar implementations exist in TRIX + Plan
G, XFS, Andreas's ACL/EA code, and in FreeBSD, I think there is some
precedent for using an interface similar to this.
In the FreeBSD world, some file systems will start exporting
file-system-native attributes using EAs in the near future. For example,
HPFS will export native HPFS EAs using this interface, instead of using
ioctl()'s as it does today. Similarly, MSDOSFS will export its
attributes, which map poorly into file modes and BSD file flags. There
has also been idle talk of scrapping the VOP_{GET,SET}ATTR interface, as
it has poor semantics and lots of historical baggage, and simply replacing
all invocations with VOP_{GET,SET}EXTATTR.
Thanks for listening, and apologies if this was not appropriate for the
content for this list--I only subscribed this afternoon :-). I'd
encourage you to take a look at the POSIX1e mailing list: you can
subscribe by sending e-mail to [EMAIL PROTECTED], or read
archives, as mentioned above, on www.securityfocus.com. It tends to take
a portability-oriented approach to security feature discussion, using the
defunct POSIX.1e D17 as a starting point.
Robert N M Watson FreeBSD Core Team, TrustedBSD Project
[EMAIL PROTECTED] NAI Labs, Safeport Network Services
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]