[Cluster-devel] [PATCH] Add flags option to get xattr method paired to __vfs_getxattr
Add a flag option to get xattr method that could have a bit flag of XATTR_NOSECURITY passed to it. XATTR_NOSECURITY is generally then set in the __vfs_getxattr path. This handles the case of a union filesystem driver that is being requested by the security layer to report back the xattr data. For the use case where access is to be blocked by the security layer. The path then could be security(dentry) -> __vfs_getxattr(dentry...XATTR_NOSECUIRTY) -> handler->get(dentry...XATTR_NOSECURITY) -> __vfs_getxattr(lower_dentry...XATTR_NOSECUIRTY) -> lower_handler->get(lower_dentry...XATTR_NOSECUIRTY) which would report back through the chain data and success as expected, the logging security layer at the top would have the data to determine the access permissions and report back the target context that was blocked. Without the get handler flag, the path on a union filesystem would be the errant security(dentry) -> __vfs_getxattr(dentry) -> handler->get(dentry) -> vfs_getxattr(lower_dentry) -> nested -> security(lower_dentry, log off) -> lower_handler->get(lower_dentry) which would report back through the chain no data, and -EACCES. For selinux for both cases, this would translate to a correctly determined blocked access. In the first case with this change a correct avc log would be reported, in the second legacy case an incorrect avc log would be reported against an uninitialized u:object_r:unlabeled:s0 context making the logs cosmetically useless for audit2allow. This patch series is inert and is the wide-spread addition of the flags option for xattr functions, and a replacement of _vfs_getxattr with __vfs_getxattr(...XATTR_NOSECURITY). Signed-off-by: Mark Salyzyn Cc: Stephen Smalley Cc: linux-ker...@vger.kernel.org Cc: kernel-t...@android.com Cc: linux-security-mod...@vger.kernel.org Cc: sta...@vger.kernel.org # 4.4, 4.9, 4.14 & 4.19 --- Removed from an overlayfs patch set, and made independent. Expect this to be the basis of some security improvements. --- fs/9p/acl.c | 3 ++- fs/9p/xattr.c | 3 ++- fs/afs/xattr.c| 6 +++--- fs/btrfs/xattr.c | 3 ++- fs/ceph/xattr.c | 3 ++- fs/cifs/xattr.c | 2 +- fs/ecryptfs/inode.c | 6 -- fs/ecryptfs/mmap.c| 2 +- fs/ext2/xattr_trusted.c | 2 +- fs/ext2/xattr_user.c | 2 +- fs/ext4/xattr_security.c | 2 +- fs/ext4/xattr_trusted.c | 2 +- fs/ext4/xattr_user.c | 2 +- fs/f2fs/xattr.c | 4 ++-- fs/fuse/xattr.c | 4 ++-- fs/gfs2/xattr.c | 3 ++- fs/hfs/attr.c | 2 +- fs/hfsplus/xattr.c| 3 ++- fs/hfsplus/xattr_trusted.c| 3 ++- fs/hfsplus/xattr_user.c | 3 ++- fs/jffs2/security.c | 3 ++- fs/jffs2/xattr_trusted.c | 3 ++- fs/jffs2/xattr_user.c | 3 ++- fs/jfs/xattr.c| 5 +++-- fs/kernfs/inode.c | 3 ++- fs/nfs/nfs4proc.c | 6 -- fs/ocfs2/xattr.c | 9 +--- fs/orangefs/xattr.c | 3 ++- fs/overlayfs/super.c | 8 --- fs/posix_acl.c| 2 +- fs/reiserfs/xattr_security.c | 3 ++- fs/reiserfs/xattr_trusted.c | 3 ++- fs/reiserfs/xattr_user.c | 3 ++- fs/squashfs/xattr.c | 2 +- fs/xattr.c| 36 +++ fs/xfs/xfs_xattr.c| 3 ++- include/linux/xattr.h | 9 include/uapi/linux/xattr.h| 5 +++-- mm/shmem.c| 3 ++- net/socket.c | 3 ++- security/commoncap.c | 6 -- security/integrity/evm/evm_main.c | 3 ++- security/selinux/hooks.c | 11 ++ security/smack/smack_lsm.c| 5 +++-- 44 files changed, 119 insertions(+), 81 deletions(-) diff --git a/fs/9p/acl.c b/fs/9p/acl.c index 6261719f6f2a..cb14e8b312bc 100644 --- a/fs/9p/acl.c +++ b/fs/9p/acl.c @@ -214,7 +214,8 @@ int v9fs_acl_mode(struct inode *dir, umode_t *modep, static int v9fs_xattr_get_acl(const struct xattr_handler *handler, struct dentry *dentry, struct inode *inode, - const char *name, void *buffer, size_t size) + const char *name, void *buffer, size_t size, + int flags) { struct v9fs_session_info *v9ses; struct posix_acl *acl; diff --git a/fs/9p/xattr.c b/fs/9p/xattr.c index ac8ff8ca4c11..5cfa772452fd 100644 --- a/fs/9p/xattr.c +++ b/fs/9p/xattr.c @@ -139,7 +139,8 @@ ssize_t v9fs_listxattr(struct dentry *dentry, char *buffer, size_t buffer_size) static int v9fs_xattr_handler_get(const struct xattr_handler *handler, struct dentry *dentry,
Re: [Cluster-devel] [GFS2 PATCH] gfs2: eliminate circular lock dependency in inode.c
Hi, On 12/08/2019 14:43, Bob Peterson wrote: - Original Message - The real problem came with renames, though. Function gfs2_rename(), which locked a series of inode glocks, did so in parent-child order due to that patch. But it was still possible to create circular lock dependencies just by doing the wrong combination of renames on different nodes. For example: Node a: mv /mnt/gfs2/sub /mnt/gfs2/tmp_name (rename sub to tmp_name) a1. Same directory, so rename glock is NOT held a2. /mnt/gfs2 is locked a3. Tries to lock sub for rename, but it is locked on node b Node b: mv /mnt/gfs2/sub /mnt/gfs2/dir1/ (move sub to dir1... mv /mnt/gfs2/dir1/sub /mnt/gfs2/ ...then move it back) b1. Different directory, so rename glock IS held b2. /mnt/gfs2 is locked b3. dir1 is locked b4. sub is moved to dir1 and everything is unlocked b5. Different directory, so rename glock IS held again b6. dir1 is locked b7. Lock for /mnt/gfs2 is requested, but cannot be granted because node 1 locked it in step a2. If the parents are being locked before the child, as per the correct locking order, then this cannot happen. The directory in which the child is located should always be locked first, before the child, so that is what protects the operation on a from whatever might be going on, on node b. When you get to step b7, sub is not locked (since it was unlocked in b4) and not locked again. Thus a3 can complete. So this doesn't look like it is the right explanation. Hi, I guess maybe my explanation is lacking. It's not so much a relationship between "parent" and "child" directories as it is "old" and "new" directories. The comments for function vfs_rename() explain the situations in which this can happen, and have been prevented on a single node through the use of s_vfs_rename_mutex. However, that mutex is not cluster-wide, which means the relationship of which inode is the "old" and which inode is the "new" can change indiscriminately without notice and without cluster-wide locking. The whole point of the "a" and "b" scenarios was to illustrate that one node can lock "old", then "new", but the other node can reverse the roles of those same inodes (which is the "old" and which is the "new") and therefore reverse the lock order without notice. Since the old-new relationship itself is not protected, we need some other way to get the lock order correct. My first attempt to fix this was to extend the "rename" glock to have a rename-wide reach so it affected both types of renames rather than today's code which only locks old and new if they're different. I implemented this with a new i_op called by vfs (vfs_rename) to make the rename glock serve as a kind of cluster-wide version vfs's s_vfs_rename_mutex. However, this ended up having a huge performance penalty for my test. My second attempt (the patch I posted) was to lock the inodes in block-number-sort order, because the block number relationships will never change, regardless of which is old and which is new. It made no sense to me to reinvent the wheel wrt locking them in sorted order, so I used gfs2_glock_nq_m which already does that. Regards, Bob Peterson We are doing our best to get rid of the _m glock functions. Sorting things in block number order is a bit of a hack and it would be better to spend the time reducing the number of glocks involved in each operation overall. I have wondered about the performance issues on the rename glock. Simply using that for everything is the obvious easy fix, but perhaps not surprising that you've seen some performance issues with that approach. I wonder if we can come up with a solution to break up the single rename glock into separate glocks using a hashing scheme, or some similar system. That way we might get the advantages of both improved speed and to retain the parent/child locking. Either way, changing the lock ordering of lots of other bits of code is a non-starter, since then it will be incompatible with the way gfs2 has worked since it was created, and also incompatible with the vfs's own locking order that is used for local locks too. Lets see if we can figure out a solution that will just address this particular issue on its own, Steve.
Re: [Cluster-devel] [GFS2 PATCH] gfs2: eliminate circular lock dependency in inode.c
Hi, On 09/08/2019 19:58, Bob Peterson wrote: Hi, This patch fixes problems caused by regressions from patch "GFS2: rm on multiple nodes causes panic" from 2008, 72dbf4790fc6736f9cb54424245114acf0b0038c, which was an earlier attempt to fix very similar problems. The original problem for which it was written had to do with simultaneous link, unlink, rmdir and rename operations on multiple nodes that interfered with one another, due to the lock ordering. The problem was that the lock ordering was not consistent between the operations. The defective patch put in place to solve it (and hey, it worked for more than 10 years) changed the lock ordering so that the parent directory glock was always locked before the child. This almost always worked. Almost. The rmdir version was still wrong because the rgrp glock was added to the holder array, which was sorted, and the locks were acquired in sorted order. That is counter to the locking requirements documented in: Documentation/filesystems/gfs2-glocks.txt which states the rgrp glock glock must always be locked after the inode glocks. Yes, that does need fixing, however it also doesn't entirely make sense, because the parent in that case is locked, but is not being removed, so it's rgrp would not need to be added to the transaction. anyway. The real problem came with renames, though. Function gfs2_rename(), which locked a series of inode glocks, did so in parent-child order due to that patch. But it was still possible to create circular lock dependencies just by doing the wrong combination of renames on different nodes. For example: Node a: mv /mnt/gfs2/sub /mnt/gfs2/tmp_name (rename sub to tmp_name) a1. Same directory, so rename glock is NOT held a2. /mnt/gfs2 is locked a3. Tries to lock sub for rename, but it is locked on node b Node b: mv /mnt/gfs2/sub /mnt/gfs2/dir1/ (move sub to dir1... mv /mnt/gfs2/dir1/sub /mnt/gfs2/ ...then move it back) b1. Different directory, so rename glock IS held b2. /mnt/gfs2 is locked b3. dir1 is locked b4. sub is moved to dir1 and everything is unlocked b5. Different directory, so rename glock IS held again b6. dir1 is locked b7. Lock for /mnt/gfs2 is requested, but cannot be granted because node 1 locked it in step a2. If the parents are being locked before the child, as per the correct locking order, then this cannot happen. The directory in which the child is located should always be locked first, before the child, so that is what protects the operation on a from whatever might be going on, on node b. When you get to step b7, sub is not locked (since it was unlocked in b4) and not locked again. Thus a3 can complete. So this doesn't look like it is the right explanation. (Note that the nodes must be different, otherwise the vfs inode level locking prevents the problem on a single node). Thus, we get into a glock deadlock that looks like this: host-018: G: s:EX n:2/3347 f:DyIqob t:EX d:UN/2368172000 a:0 v:0 r:3 m:150