Re: [RFC] Add vfsmount to vfs helper functions.
Hello. No printable comments, except for that: (e) why don't you guys move the Linus' Serious Mistake to _callers_ of vfs_mknod() and its ilk? Which obviously solves all problems with having vfsmount. Excuse me. I didn't understand what the Linus' Serious Mistake to _callers_ of vfs_mknod() is. Could you give me some URLs or hints? Regards. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Simple tamper-proof device filesystem.
Hello. Indan Zupancic wrote: It seems to me that the alternatives you are proposing include modification of userland applications. But my assumption is that Don't require modification of userland applications. If you want a secure system it isn't that unreasonable to expect applications to not do brain dead things, so not requiring any modifications or config changes seems a bit optimistic to me. It depends. Some users have to continue using brain dead legacy applications without modification because ... the application's source code is not available. the distributor no longer supports the application. the application is too difficult/complicated to reconstruct. For cases where you can expect application won't do brain dead things and/or we can reconstruct application, your approach is OK. In other words, I want to implement without asking applications to use /dev/dynamic/ or something. This filesystem is intended to provide support for legacy applications. (In fact, this filesystem in TOMOYO Linux is for kernel 2.4.30/2.6.11 and later.) Legacy applications should cope with a static /dev/. What is the advantage of your filesystem compared to a static /dev/? I assume a static /dev/ means a /dev/ directory in 2.4 kernels. This filesystem's advantage: (1) Can guarantee filename/attribute pairs. A process with root privilege can do mv /dev/hda1 /dev/hda1.tmp; mv /dev/hda2 /dev/hda1; mv /dev/hda1.tmp /dev/hda2 if /dev is in / partition or is a devfs partition, whereas a process with root privilege cannot do mv /dev/hda1 /dev/hda1.tmp; mv /dev/hda2 /dev/hda1; mv /dev/hda1.tmp /dev/hda2 if /dev is this filesystem unless granted by the configuration file. So, you can guarantee that /dev/hda1 is block-3-1 and /dev/hda2 is block-3-2 . (e.g. mount /dev/hda1 /home won't mount block-3-2 partition on /home .) (2) Can keep nodes that needn't to be deleted/modified for read-only. A process with root privilege can delete /dev/null on / partition or on devfs partition, whereas a process with root privilege cannot delete /dev/null on this filesystem unless granted by the configuration file. So, you can guarantee the node which needn't to be deleted/modified won't be deleted/modified. (e.g. /dev/null is always there with char-1-3 attribute.) (3) Can hide unwanted device nodes. A process with root privilege can create new nodes on / partition or on devfs, whereas a process with root privilege cannot create new nodes on this filesystem that are not specified by configuration file. So, you can expose specific nodes selectively. (e.g. Allow accessing /dev/hda1 , but forbid accessing /dev/hda2 .) Use of a tiny daemon that communicates with udev is not sufficient. The udev is not the only application that modifies /dev files. Oh, it isn't? Which other applications do modify /dev files? I'd like to hear about a few, no matter how obscure or proprietary. And please tell how many of those will stop working with a static /dev with all nodes they might create already existing. I don't know. I'm not using rare software. At least, the tiny daemon should communicate with the kernel so that all requests are checked by the tiny daemon. No, why should the kernel be involved? The tiny daemon would be the only one allowed to modify /dev/, so all mknod commands will be done by it. Of course it means that you might need to modify the two or three apps wanting to create device nodes, or you can make an LD_PRELOAD lib that intercepts mknod commands and sends them to the daemon. No. The kernel must be involved. Suppose the tiny daemon is the only one allowed to modify /dev/ . foo requests mknod /dev/null from chroot() environment. bar requests mknod /dev/null from clone(CLONE_FS) + mount() environment. How can the daemon know where to create the node? How can the daemon determine whether the requested pathname is in /dev directory or not? The process who requests mknod and the process who performs mknod are not always using the same / directory. The daemon must not forbid creation of /dev/null if the realpath() is /tmp/dev/null (i.e. mknod /dev/null after chroot /tmp), because the daemon is not asked to manage /tmp/dev directory. Who can guarantee that the daemon can access all namespaces? The process who requests mknod and the process who performs mknod are not always using the same namespace. If foo or bar is a statically linked or suid-root application (where LD_PRELOAD is ignored), they would attempt to create device nodes directly (i.e. call sys_mknod() instead of communicating with the daemon) and abort due to failure. Not only applications who wants to create device nodes in /dev/ , but also all applications who wants to modify entries in /dev/ . From the beginning, the kernel is deeply involved because in-kernel MAC is essential
Re: [PATCH][RFC] Simple tamper-proof device filesystem.
Hello. Indan Zupancic wrote: That only the tiny daemon can modify /dev/ is done with MAC rules, the ones that should be the default for all applications except udev by default already. For teh kernel nothing changes. OK. You assume use of MAC with enough fine grained access control. Wrong. All nodes are created and thus there's never a need to create new nodes. So /dev/ can't be modified by anyone. This works because all nodes that anyone might want to create already exist. Already exist is not enough. These nodes have to be deletable if requested by appropriate process. These nodes have to be protected by MAC from directly calling mknod()/rename()/unlink()/link()/mount() etc. This is true on a theoretical level. But practically I think you can either run multiple daemons, one for each namespace where you want to control /dev/, If the daemon does not exist in that namespace? or if you really want one daemon you can pass the directory fd to it where the node should be created and use mknodat(). I believe that crosses namespaces correctly. The fd passed to mknodat() is used for starting from specified directory instead for current directory. The object obtained by resolving the rest pathname depends on the / of the calling process. If /var/jail/dev/dyndev/link is a symlink to /dev , a process in chroot(/var/jail/) + chdir(/) will get /var/jail/dev/node and a process not in chroot(/var/jail/) + chdir(/) will get /dev/node by resolving mknodat(fd_for_/var/jail/, dev/dyndev/link/node) . If the process is in the chroot() but the daemon is not in the chroot() , the daemon will create nodes in a wrong location. So, you let the LD_PRELOAD library to solve all directory components before passing the fd to the daemon using UNIX domain socket so that the daemon won't create nodes in a wrong location. OK. It looks like working, although I'm not taking racy condition into account. But I think that the chance that any process needs to create device nodes in a chroot is at the level of fairy existance. Not only mknod() but also rename()/unlink()/link()/mount(bind) etc. that may cause filename/attribute mismatching. How can the daemon know whether the request is trying to manipulate nodes in /dev directory or not? If mount --bind /dev/ /var/dir/ is used, the daemon must check filename/attribute pair when mknod(/var/dir/null) is requested because permitting the request will modify /dev state. If mount --bind /dev/ /var/dir/ is not used, the daemon must not check filename/attribute pair when mknod(/var/dir/null) is requested because permitting the request will not modify /dev state. What does the daemon do? It receives requests from the LD_PRELOAD library using UNIX domain socket and checks filename/attribute pair and issue mknodat()/renameat()/unlinkat()/linkat() etc. when the combination is appropriate? What does the LD_PRELOAD library do? It intercepts all pathname related syscalls (except open()) and solve directory component and determine whether the request is trying to manipulate nodes in /dev direcrtory and forward request to the daemon using UNIX domain socket? Make the daemon and the LD_PRELOAD library bug-and-race free and develop the MAC policy for the daemon and the LD_PRELOAD library and Make this filesystem bug-and-race free. Which one is easier? Regards. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Simple tamper-proof device filesystem.
Hello. Indan Zupancic wrote: Good point, but I assume they all have at least a directory granularity, and then /dev/ can be static and udev and other can have free reign in e.g. /dev/dynamic/. Just use subdirs for the dynamic stuff and this granularity problem is, with slight inconvenience, solved. It seems to me that the alternatives you are proposing include modification of userland applications. But my assumption is that Don't require modification of userland applications. In other words, I want to implement without asking applications to use /dev/dynamic/ or something. This filesystem is intended to provide support for legacy applications. (In fact, this filesystem in TOMOYO Linux is for kernel 2.4.30/2.6.11 and later.) Exploits are in code, and where that code is doesn't matter that much, either kernel or userspace, though if it's exploitable you'll rather not have it in the kernel. So I think it's more secure if the checking would be done by udev than in a special filesystem, even if that means that you're screwed if udev is exploited. Of course you fully trust your own code, naturally. I'm keeping the mechanism as simple as possible so that there is unlikely room (e.g. buffer overflow) for running exploits. A tiny daemon that communicates with udev and does the checking you have now, and if ok it creates the node is really not much more code than your fs, so as hard to exploit too. Then if udev is hacked you have the same guarantee as you have now. Use of a tiny daemon that communicates with udev is not sufficient. The udev is not the only application that modifies /dev files. At least, the tiny daemon should communicate with the kernel so that all requests are checked by the tiny daemon. But use of the tiny daemon (which is a process running in userland) causes a lot of troubles. See the block after the -- boundary -- of this posting. My assumption is that Don't require userland process's assistance, as written at Why not use FUSE?. Protecting certain files from being modified seems to me more generic than enforcing filename/attributes pairs on device nodes. OK. You are saying that from the point of view of what it can. I thought you were saying enforcing filename/attributes pairs from out-of-this-filesystem (e.g. MAC) is more flexible than this-filesystem. rm -f /dev/either-null-or-zero as said before, if this is possible then the MAC config used is wrong. Exactly the same as for your filesystem with mknod /dev/tmp1 c 1 X mount --bind /dev/tmp1 /dev/either-null-or-zero and you count on the MAC to prevent that. An administrator asks MAC to prevent processes (except specific processes who need to do rm -f /dev/either-null-or-zero) from doing rm -f /dev/either-null-or-zero. An administrator asks this filesystem to prevent processes from doing mknod /dev/tmp1 c 1 X. An administrator asks MAC to prevent processes from doing mount --bind /dev/tmp1 /dev/either-null-or-zero. And as for that app, if you trust it to create device nodes, why don't you trust it to make the right nodes too? If that app has a bug that triggers mknod /dev/either-null-or-zero 1$REPLY instead of mknod /dev/either-null-or-zero $REPLY under an unexpected circumstance, it will create unwanted nodes. Thus I don't trust the app. If an administrator wants something else than 3 or 5, you're breaking something. That's the fate of white-list based access control. Does this filesystem sound too strict to support dynamic device? May be this filesystem should be able to permit creation of device nodes that are not listed in the policy file. Can SELinux guarantee the same result as my filesystem even if udev or administrative programs have to be able to modify /dev ? More, because your filesystem doesn't guarantee anything at all on its own. But assuming the MAC is decent enough to protect your fs from being bypassed, I'm sure it can do what's needed fine without your fs. I can't answer for SELinux because I don't know it well. But I trust it can protect files and/or directories, and that's all that's needed to achieve the same end result. I don't know SELinux well, but as far as seeing an example (found by Googling selinux allow mknod) allow udev_t self:capability { chown dac_override dac_read_search fowner fsetid sys_admin sys_nice mknod net_raw net_admin sys_rawio }; I can't find a place to specify filename/attributes pairs in this syntax. So, if the process who is permitted to create device nodes misbehaves, it will generate unexpected filename/attribute pairs. I think SELinux can't guarantee the same result as my filesystem. You seem to assume that the in-kernel implementation is suddenly guaranteed bugfree. I keep the implementation as simple as possible. From your next posting: But I think doing more is getting ridiculous, because if a process can create a device node, it can also access it and do whatever harm could
Re: [PATCH][RFC] Simple tamper-proof device filesystem.
Hello. Indan Zupancic wrote: I want to use this filesystem in case where a process with root privilege was hijacked but the behavior of the hijacked process is still restricted by MAC. 1) If the behaviour can be controlled, why can't the process be disallowed to change anything badly in /dev? Like disallowing anything from modifying existing nodes that weren't created by that process. That would have practically the same effect as your filesystem, won't it? MAC system can prevent hijacked processes from changing anything badly in /dev . But MAC system can't prevent hijacked processes from doing mv /dev/hda1 /dev/hda1.tmp; mv /dev/hda2 /dev/hda1; mv /dev/hda1.tmp /dev/hda2 if permissions to rename device nodes in /dev are given to hijacked processes. This is because MAC implementation doesn't check filename/attribute pairs. But this filesystem can prevent hijacked processes from doing mv /dev/hda1 /dev/hda1.tmp; mv /dev/hda2 /dev/hda1; mv /dev/hda1.tmp /dev/hda2 even if permissions to rename device nodes in /dev are given to hijacked processes. This filesystem is not designed to forbid modifying nodes if that process needn't to modify nodes. This filesystem is designed to forbid breaking filename/attribute pairs of nodes even if that process need to (or permitted to) modify nodes. Or phrased differently, if the MAC system used can't protect /dev, it won't be able to protect other directories either, and if it can't protect e.g. my homedir, doesn't it make the whole MAC system ineffective? And if the MAC system used is ineffective, your filesystem is useless and you've bigger problems to fix. You can use nodev mount option to prevent attackers from opening device files. You can use MAC system to prevent attackers from mounting partitions (other than /dev partition) without nodev option. 2) The MAC system may not be able to guarantee certain combinations of device names and properties, but isn't that policy that shouldn't be in the kernel anyway? But if it is, shouldn't all device nodes be checked? That is, shouldn't it be a global check instead of a filesystem specific one? I think the reason why MAC system doesn't handle filename/attributes pairs is that: Filename and its attributes pairs are conventionally considered as constant and reliable. It makes the MAC's policy syntax complicated to describe this attribute enforcement information in MAC's policy. Thus, this should be a global check. But usually device nodes are only in /dev . 3) Code efficiency. Thousand lines of code just to close one very specific attack, which can be done in lots of different other ways that all need to be prevented by the MAC system. (mounting over it, intercepting open calls, duping the fd, etc.) Is it worth it? This filesystem is doing what MAC system is not doing. So, please don't complain about inability of this filesystem to close all attacks. You can use MAC system to prevent attackers from mounting other filesystem over this filesystem. The filename/attribute pairs are something like system call entry tables. The application will go wrong if __NR_read is mapped to sys_write() and __NR_write is mapped to sys_read(). Userland applications access special functionalities (e.g. /dev/zero and /dev/random) by name (i.e. syscall numbers). Therefore, keeping the filename/attribute pairs tamper-proof is important. You recognize that there is a threat that device nodes may have irregular attribute (e.g. /dev/null existing as a regular file), do you? You don't deny implementing mechanisms somehow to avoid such threat, do you? OK. Then the matter is the comparison of code efficiency. This patch is less than 1100 lines in total. Large part of this patch is for parsing and managing policy file. If you try to extend every MAC implementation (SELinux, SMACK, AppArmor, TOMOYO) so that they can handle filename/attributes pairs (i.e. expand policy file's syntax and both in-kernel and userland data structures, manage strings with variant length and non-printable characters etc.), I think that modification exceeds this patch. I think guaranteeing filename/attribute pairs in filesystem layer can keep MAC system implementation simple and compact. http://www.mail-archive.com/linux-fsdevel@vger.kernel.org/msg10653.html Thank you. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Simple tamper-proof device filesystem.
Hello. [EMAIL PROTECTED] wrote: Ouch. The .c files should generally be built into their own .o files and then the Makefile should do something like obj-$(CONFIG_SYAORAN) += syaoran.o unless there's *really* good reasons for including .c files (such as an otherwise-messy variable-namespace issue or similar). Yes. The final implementation will become so. This is a temporal hack to keep all functions and variables static. Also, has this been double-checked to Do The Right Thing if you have *two* instances of ramfs mounted, one with Syaoran and one without? Yes. The memory for superblock is allocated for each instance. Thus, mounting one as syaoran and the other as tmpfs won't cause problems. (incidentally, all of these should probably be abstracted into a helper function that's 'static inline' so we have just one #ifdef in the definition in a .h file, and none in open .c code). Oh, good idea. Similarly for other places you have #ifdef CONFIG_ in ramfs .c code - see if you can abstract it out. This patch replaces the previous patch and this patch modifies only tmpfs (fs/shm*) files. I'm no longer modifying ramfs (fs/ramfs/*) files. +/* + * Original tmpfs doesn't set ramfs_dir_inode_operations.setattr field. + * Now I'm setting the field to share tmpfs/rootfs/syaoran code. Question for the audience: *should* ramfs set that field so setattr works on ramfs (even if it's just a stub similar to the SELinux fscontext= mount stuff)? Question for Tetsuo: What happens to this code if somebody actually does the above change? Please forget this question. I'm no longer setting ramfs_dir_inode_operations.setattr field. + Applications using well-known device locations under /dev + get the device they want (e.g. an application that accesses + /dev/null can always get a character special device + with major=1 and minor=3). This should say will always get, not can always, as this code will mandate, rather than just make possible. OK. + The list of possible combinations of filename and its attributes + that can exist on this filesystem is defined at mount time + using a configuration file. The format of this file needs to be documented. Yes. It is a line-by-line processable format defined as: filename permission owner group flags type [ symlink_data | major minor ] where flags are bit-wised combinations of * 1: Allow creation of the file. * 2: Allow deletion of the file. * 4: Allow changing permissions of the file. * 8: Allow changing owner or group of the file. * 16: For internal use. Remembers whether this file is opened or not. * 32: Don't create this file at mount time. and here are some example entries: pts 755 0 0 0 d shm 755 0 0 0 d fd 777 0 0 0 l /proc/self/fd stdin 777 0 0 0 l /proc/self/fd/0 stdout 777 0 0 0 l /proc/self/fd/1 stderr 777 0 0 0 l /proc/self/fd/2 null666 0 0 0 c 1 3 zero666 0 0 0 c 1 5 random 644 0 0 0 c 1 8 urandom 644 0 0 0 c 1 9 tty 666 0 0 0 c 5 0 tty0600 0 0 12 c 4 0 cdrom 777 0 0 3 l /dev/scd0 console 600 0 0 1 c 5 1 hda 660 0 6 0 b 3 0 hda1660 0 6 0 b 3 1 initctl 600 0 0 3 p log 666 0 0 15 s rtc 644 0 0 0 c 10 135 ptmx666 0 0 0 c 5 2 ram 777 0 0 3 l /dev/ram0 ram0660 0 6 0 b 1 0 ram1660 0 6 0 b 1 1 sda 660 0 6 0 b 8 0 initrd 660 0 6 1 b 1 250 Full documentation of this filesystem is at http://tomoyo.sourceforge.jp/en/1.5.x/policy-syaoran.html I'm not terribly thrilled by the idea of passing a file to be read by the kernel, but I also understand that if it isn't done before mount, you have a race condition betweet the mount and the load. What race condition is possible? Are you worrying that the file gets modified while reading? Perhaps write some configfs code so that you can 'mount /configfs; cat config.file /configfs/syaoran; mount -t syaoran? If you worry that the file gets modified while reading in kernel space, you will also worry that the file gets modified while doing cat config.file /configfs/syaoran. To use configfs (or whatever approach that is done before mount syscall), some tag for
Re: [PATCH][RFC] Simple tamper-proof device filesystem.
Hello. Indan Zupancic wrote: I think you focus too much on your way of enforcing filename/attributes pairs. So? The same can be achieved by creating the device nodes with expected attributes, and preventing processes from changing those files. The device nodes have to be deletable if some process (including udev) needs to delete. Thus, you cannot unconditionally prevent processes from changing those files. This because expected combinations are known beforehand. Yes. And once those files are present, the MAC system used doesn't have to have special device nodes attributes support. Protecting those files is enough to guarantee filename/attributes pairs. If MAC system needn't to support this filesystem's functionality, who creates those files with warrantee of expected attributes? The udev does? If udev is exploited, who can guarantee? No, this is because rename permission was given for files that it shouldn't had. Do you think all MAC implementation have the same granularity and functionalities? I don't think so. Not all MAC implementation can control with such granularity. This filesystem is designed to be combined with any MAC, although the MAC used with this filesystem should be able to restrict namespace manipulation requests so that this filesystem can remain /dev and visible to userland applications. Either you want a process to manage device names and attributes, and then you give it permission to do that, or you want to enforce certain filename/attribute pairs and then you just do it yourself. If I modify udev to enforce certain filename/attribute pairs and the modified udev was exploited, who can guarantee? Don't trust userland application is the basis of restricting access in kernel space. If you can trust userland application, you don't need in-kernel access control. Will your filesystem prevent the trivial case of rm /dev/hda1 ln -s /dev/hda2 /dev/hda1 Of course. To permit the above operation, the following permissions are needed. hda1660 0 6 2 b 3 1 hda1777 0 0 33 l . Rename permission can be given for /dev in general, but prohibited for certain files in /dev, the ones you want to have specific attributes. It isn't all or nothing. Do you think all MAC implementation can prohibit renaming for certain files in /dev ? It's forbid modifying certain nodes that process needn't to modify versus forbid breaking filename/attribute pairs of certain nodes. Both have the same effect, except that the first one is generic and can be done by existing MAC systems, while the second one needs a special filesystem and a handful of MAC rules to make it effective. Do you think all MAC implementation can do? I think the first one is implementation specific and the second one is generic. It doesn't matter where they are, it's that a different fs than yours could be mounted over it. You say a MAC can prevent that from happening, but a MAC can also prevent all processes except for udev from modifying /dev. But MAC cannot prevent udev from modifying /dev . And what if exploited? Not all MAC can enforce access control over all processes with the granularity you are talking. And what if a process that cannot be controlled with your boolean level granularity exists (e.g. an administrator running his/her administrative applications that require modification of /dev )? A crazy example of administrative applications: (Please don't say Don't use such crazy application.) #! /bin/sh rm -f /dev/either-null-or-zero read mknod /dev/either-null-or-zero c 1 $REPLY echo Administrative task finished successfully. | mail root This filesystem can guarantee /dev/either-null-or-zero is either char-1-3 or char-1-5 by using a policy either-null-or-zero666 0 0 3 c 1 3 either-null-or-zero666 0 0 35 c 1 5 The boolean level granularity (e.g. forbid all processes except for udev , and modify udev to perform name/attribute pair enforcement) is not generic. Userland application sometimes misbehaves. I assume kernel process doesn't misbehave. If you doubt my assumption, you have to doubt in-kernel MAC implementation too. I don't. What I complain about is that it's too specific and does it one chosen job badly. It lacks abstraction. As far as I can see any decent MAC can achieve the same end result as your filesystem, without directly enforcing name/attr pairs. Can SELinux guarantee the same result as my filesystem even if udev or administrative programs have to be able to modify /dev ? The thing is, all special device nodes that are expected to exist by applications are known beforehand. Yes. Thus they can be created statically and can be protected against any modifications with any MAC system. But sometimes some modifications needs to be permitted. Who can guarantee that there is no application (other than udev)
Re: [PATCH][RFC] Simple tamper-proof device filesystem.
Hello. [EMAIL PROTECTED] wrote: Good summary - probably should add that to the patch, drop it into Documentation/syaoran-config.txt or similar... I see. Modification while reading *is* an issue, but can probably be worked around with some clever locking. The race condition I was thinking of was if you had the mount and the policy load be 2 separate events, you could see: (a) issue mount request (b) do something malicious in /dev while.. (c) load the policy that would have prevented (b). This is partly why SELinux has init load the policy *very* early on, before any other userspace have had a chance to run and do things that would have been prevented by policy. So, you suggested to load policy before mount() request so that this filesystem can prevent attackers from doing something malicious by minimizing (i.e. implement as non-blocking operation) the latency between the userland process's call of mount() and the nodes become visible to userland process. I didn't take such cases into account. My assumed usage of this filesystem is that run a script with #!/bin/sh mount -t syaoran -o accept=/etc/ccs/syaoran.conf none /dev exec /sbin/init $@ by passing init=/path/to/this/script to the kernel command line so that /sbin/init can create /dev/initlog on this filesystem. If you mount this filesystem after /sbin/init starts, it will shadow /dev/initctl opened by /sbin/init . Which basically ends up meaning that anybody who can trick the mount into happening can reset the permitted list and create (for example) a mode 666 entry for a hard drive, and go scribbling around at will. Note that you don't seem to do any sanity checking on the path (for instance, that each component is owned by root, and not world-writable) - so anybody who finds a way to get the mount to happen can supply their own list in /home/joeuser/blat or /tmp/surprise-mount-list or wherever. I assume that being able to reach this location means the caller of mount() is root. But, the patches to allow mount() by non-root is in progress? http://lkml.org/lkml/2008/1/8/131 May be I should add some sanity checking on the path. Thank you. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Simple tamper-proof device filesystem.
Hello. Changes from previous posting: (1) I rebased this patch using tmpfs. I didn't know I was making this patch using ramfs... This patch is for 2.6.24-rc6-mm1. Regards. -- Subject: Simple tamper-proof device filesystem. The goal of this filesystem is to guarantee that applications using well-known device locations under /dev get the device they want (e.g. an application that accesses /dev/null can always get a character special device with major=1 and minor=3). This idea sounds silly? Indeed, if you think the root can do whatever he/she wants do do. But this filesystem makes sense when used with access control mechanisms like MAC (mandatory access control). I want to use this filesystem in case where a process with root privilege was hijacked but the behavior of the hijacked process is still restricted by MAC. Why not use FUSE? Because /dev has to be available through the lifetime of the kernel. It is not acceptable if /dev stops working due to SIGKILL or OOM-killer. Why not use SELinux? Because SELinux doesn't guarantee filename and its attribute. As far as I know, no MAC implementation can handle filename and its attribute. I guess this is because Filename and its attributes pairs are conventionally considered as constant and reliable. It makes the MAC's policy syntax complicated to describe this attribute enforcement information in MAC's policy. I want to add functionality that the MACs are missing. Instead of adding this functionality per MAC, I propose to add it as ground work, to be combined with any MAC. Why not drop CAP_MKNOD? Dropping CAP_MKNOD is not enough for emulating this filesystem because a process can still rename()/unlink() to break filename and its attributes handling (e.g. mv /dev/sda1 /dev/sda1.tmp; mv /dev/sda2 /dev/sda1; mv /dev/sda1.tmp /dev/sda2 or unlink /dev/null; touch /dev/null ). This time, I'm implementing this filesystem as an extension to tmpfs because what this filesystem does are nothing but check filename and its attributes in addition to what tmpfs does. Signed-off-by: Tetsuo Handa [EMAIL PROTECTED] --- fs/Kconfig | 18 + include/linux/shmem_fs.h |5 mm/shmem.c | 124 +++ mm/shmem_mac.h | 57 + mm/shmem_mac_debug.c | 183 + mm/shmem_mac_init.c | 486 +++ mm/shmem_mac_main.c | 205 +++ 7 files changed, 1077 insertions(+), 1 deletion(-) --- linux-2.6-mm.orig/mm/shmem.c +++ linux-2.6-mm/mm/shmem.c @@ -736,11 +736,39 @@ static void shmem_truncate(struct inode shmem_truncate_range(inode, inode-i_size, (loff_t)-1); } +#ifdef CONFIG_SYAORAN +#include shmem_mac.h +#include shmem_mac_init.c +#include shmem_mac_main.c +#include shmem_mac_debug.c + +static bool with_mac(struct super_block *sb) +{ + return sb-s_type == syaoran_fs_type; +} +#else +static inline bool with_mac(struct super_block *sb) +{ + return 0; +} +#endif + static int shmem_notify_change(struct dentry *dentry, struct iattr *attr) { struct inode *inode = dentry-d_inode; struct page *page = NULL; int error; +#ifdef CONFIG_SYAORAN + if (with_mac(inode-i_sb)) { + unsigned int flags = 0; + if (attr-ia_valid (ATTR_UID | ATTR_GID)) + flags |= MAY_CHOWN; + if (attr-ia_valid ATTR_MODE) + flags |= MAY_CHMOD; + if (syaoran_may_modify_node(dentry, flags)) + return -EPERM; + } +#endif if (S_ISREG(inode-i_mode) (attr-ia_valid ATTR_SIZE)) { if (attr-ia_size inode-i_size) { @@ -1515,6 +1543,10 @@ shmem_get_inode(struct super_block *sb, default: inode-i_op = shmem_special_inode_operations; init_special_inode(inode, mode, dev); +#ifdef CONFIG_SYAORAN + if (with_mac(sb)) + init_syaoran_inode(inode, mode); +#endif break; case S_IFREG: inode-i_op = shmem_inode_operations; @@ -1739,8 +1771,15 @@ static int shmem_statfs(struct dentry *d static int shmem_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev) { - struct inode *inode = shmem_get_inode(dir-i_sb, mode, dev); + struct inode *inode; int error = -ENOSPC; +#ifdef CONFIG_SYAORAN + if (with_mac(dir-i_sb)) { + if (syaoran_may_create_node(dentry, mode, dev) 0) + return -EPERM; + } +#endif + inode = shmem_get_inode(dir-i_sb, mode, dev); if (inode) { error = security_inode_init_security(inode, dir, NULL, NULL, @@ -1792,6 +1831,13 @@ static int shmem_link(struct dentry *old { struct inode *inode = old_dentry-d_inode; int ret; +#ifdef
[PATCH][RFC] Simple tamper-proof device filesystem.
Hello. Changes from previous posting: (1) Added kernel config so that users can choose whether to compile this filesystem or not. I didn't receive any ACK/NACK regarding whether I'm permitted to implement this filesystem as an extension to tmpfs or not. So, I continued implementing this filesystem as an extension to tmpfs. (2) Removed indirect grabbing of blkdev_open() and chrdev_open(). The previous posting was using indirect approach to call blkdev_open() and chrdev_open() so that users can compile this filesystem as a module without exporting blkdev_open() from fs/block_dev.c and chrdev_open() from fs/char_dev.c . But since tmpfs cannot be compiled as a module, I changed it to direct accessing. (3) Splitted single file into three files. syaoran_init.c: initialization part syaoran_main.c: access control part syaoran_debug.c: taking snapshot part This patch is for 2.6.24-rc6-mm1. Regards. -- Subject: Simple tamper-proof device filesystem. The goal of this filesystem is to guarantee that applications using well-known device locations under /dev get the device they want (e.g. an application that accesses /dev/null can always get a character special device with major=1 and minor=3). This idea sounds silly? Indeed, if you think the root can do whatever he/she wants do do. But this filesystem makes sense when used with access control mechanisms like MAC (mandatory access control). I want to use this filesystem in case where a process with root privilege was hijacked but the behavior of the hijacked process is still restricted by MAC. Why not use FUSE? Because /dev has to be available through the lifetime of the kernel. It is not acceptable if /dev stops working due to SIGKILL or OOM-killer. Why not use SELinux? Because SELinux doesn't guarantee filename and its attribute. As far as I know, no MAC implementation can handle filename and its attribute. I guess this is because Filename and its attributes pairs are conventionally considered as constant and reliable. It makes the MAC's policy syntax complicated to describe this attribute enforcement information in MAC's policy. I want to add functionality that the MACs are missing. Instead of adding this functionality per MAC, I propose to add it as ground work, to be combined with any MAC. Why not drop CAP_MKNOD? Dropping CAP_MKNOD is not enough for emulating this filesystem because a process can still rename()/unlink() to break filename and its attributes handling (e.g. mv /dev/sda1 /dev/sda1.tmp; mv /dev/sda2 /dev/sda1; mv /dev/sda1.tmp /dev/sda2 or unlink /dev/null; touch /dev/null ). This time, I'm implementing this filesystem as an extension to tmpfs because what this filesystem does are nothing but check filename and its attributes in addition to what tmpfs does. Signed-off-by: Tetsuo Handa [EMAIL PROTECTED] --- fs/Kconfig | 18 + fs/ramfs/inode.c | 177 ++ fs/ramfs/syaoran.h | 75 ++ fs/ramfs/syaoran_debug.c | 183 +++ fs/ramfs/syaoran_init.c | 568 +++ fs/ramfs/syaoran_main.c | 207 + 6 files changed, 1222 insertions(+), 6 deletions(-) --- linux-2.6-mm.orig/fs/ramfs/inode.c +++ linux-2.6-mm/fs/ramfs/inode.c @@ -36,6 +36,20 @@ #include asm/uaccess.h #include internal.h +static struct inode *__ramfs_get_inode(struct super_block *sb, int mode, + dev_t dev, bool tmpfs_with_mac); + +#define TMPFS_WITH_MAC1 +#define TMPFS_WITHOUT_MAC 0 +#include linux/quotaops.h + +#ifdef CONFIG_SYAORAN +#include syaoran.h +#include syaoran_init.c +#include syaoran_main.c +#include syaoran_debug.c +#endif + /* some random number */ #define RAMFS_MAGIC0x858458f6 @@ -51,6 +65,12 @@ static struct backing_dev_info ramfs_bac struct inode *ramfs_get_inode(struct super_block *sb, int mode, dev_t dev) { + return __ramfs_get_inode(sb, mode, dev, TMPFS_WITHOUT_MAC); +} + +static struct inode *__ramfs_get_inode(struct super_block *sb, int mode, + dev_t dev, const bool tmpfs_with_mac) +{ struct inode * inode = new_inode(sb); if (inode) { @@ -65,10 +85,18 @@ struct inode *ramfs_get_inode(struct sup switch (mode S_IFMT) { default: init_special_inode(inode, mode, dev); +#ifdef CONFIG_SYAORAN + if (tmpfs_with_mac) + init_syaoran_inode(inode, mode); +#endif break; case S_IFREG: inode-i_op = ramfs_file_inode_operations; inode-i_fop = ramfs_file_operations; +#ifdef CONFIG_SYAORAN + if (tmpfs_with_mac) + init_syaoran_inode(inode, mode); +#endif break
Re: [PATCH][RFC] Simple tamper-proof device filesystem.
Hello. Willy Tarreau wrote: Your patch is very confusing. In your description, as well as in the comments you talk about tmpfs, but your patch does not touch even one line of tmpfs and only changes ramfs. Even your variables and arguments refer to tmpfs. The Kconfig entry indicates that the feature depends on TMPFS too. Judging from the following comment : * Original tmpfs doesn't set ramfs_dir_inode_operations.setattr field. I suspect that you confuse both filesystems. - ramfs is in fs/ramfs and is always compiled in, you cannot disable it - tmpfs is in mm/shmem.c and is optional. It also supports options that ramfs does not (eg: size) and data may be swapped. Please understand that I'm not discussing the usefulness of your patch, I'm just trying to avoid a huge confusion. Oh, I thought the filesystem mounted by mount -t tmpfs none /tmp is tmpfs and the source code of tmpfs is located in fs/ramfs directory. So, I should write the description as an extension to ramfs rather than an extension to tmpfs. I'll fix it in next posting. Thank you. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.
Hello. Serge E. Hallyn wrote: I apologize if I'm commiting a faux pas by asking this, but any chance of renaming this to something like strictdev or sdev, or at least with 'dev' in it somewhere? You are not commiting a faux pas. But, this naming is my personal feeling. ;-) You can see the origin at http://I-love.SAKURA.ne.jp/tomoyo/index-en.html . Happy Holidays! - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH][RFC] Simple tamper-proof device filesystem.
Hello. Thank you for attending discussion for previous posting (starting from http://lkml.org/lkml/2007/12/16/23 ). The previous posting was for feasibility test to know whether this kind of trivial filesystem is acceptable for mainline. Now, it seems that there is a little chance for accepting. Therefore I rebased the patch using the -mm tree. Regards. -- Subject: Simple tamper-proof device filesystem. The goal of this filesystem is to guarantee that applications using well-known device locations under /dev get the device they want (e.g. an application that accesses /dev/null can always get a character special device with major=1 and minor=3). This idea sounds silly? Indeed, if you think the root can do whatever he/she wants do do. But this filesystem makes sense when used with access control mechanisms like MAC (mandatory access control). I want to use this filesystem in case where a process with root privilege was hijacked but the behavior of the hijacked process is still restricted by MAC. Why not use FUSE? Because /dev has to be available through the lifetime of the kernel. It is not acceptable if /dev stops working due to SIGKILL or OOM-killer. Why not use SELinux? Because SELinux doesn't guarantee filename and its attribute. As far as I know, no MAC implementation can handle filename and its attribute. I guess this is because Filename and its attributes pairs are conventionally considered as constant and reliable. It makes the MAC's policy syntax complicated to describe this attribute enforcement information in MAC's policy. I want to add functionality that the MACs are missing. Instead of adding this functionality per MAC, I propose to add it as ground work, to be combined with any MAC. Why not drop CAP_MKNOD? Dropping CAP_MKNOD is not enough for emulating this filesystem because a process can still rename()/unlink() to break filename and its attributes handling (e.g. mv /dev/sda1 /dev/sda1.tmp; mv /dev/sda2 /dev/sda1; mv /dev/sda1.tmp /dev/sda2 or unlink /dev/null; touch /dev/null ). This time, I'm implementing this filesystem as an extension to tmpfs because what this filesystem does are nothing but check filename and its attributes in addition to what tmpfs does. Signed-off-by: Tetsuo Handa [EMAIL PROTECTED] --- fs/ramfs/inode.c | 101 - fs/ramfs/syaoran.h | 1066 + 2 files changed, 1160 insertions(+), 7 deletions(-) --- linux-2.6-mm.orig/fs/ramfs/inode.c +++ linux-2.6-mm/fs/ramfs/inode.c @@ -35,6 +35,7 @@ #include linux/sched.h #include asm/uaccess.h #include internal.h +#include syaoran.h /* some random number */ #define RAMFS_MAGIC0x858458f6 @@ -49,7 +50,8 @@ static struct backing_dev_info ramfs_bac BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP, }; -struct inode *ramfs_get_inode(struct super_block *sb, int mode, dev_t dev) +struct inode *__ramfs_get_inode(struct super_block *sb, int mode, dev_t dev, + const int mac) { struct inode * inode = new_inode(sb); @@ -65,10 +67,19 @@ struct inode *ramfs_get_inode(struct sup switch (mode S_IFMT) { default: init_special_inode(inode, mode, dev); + if (mac) { + if (S_ISBLK(mode)) + inode-i_fop = wrapped_def_blk_fops; + else if (S_ISCHR(mode)) + inode-i_fop = wrapped_def_chr_fops; + inode-i_op = syaoran_file_inode_operations; + } break; case S_IFREG: inode-i_op = ramfs_file_inode_operations; inode-i_fop = ramfs_file_operations; + if (mac) + inode-i_op = syaoran_file_inode_operations; break; case S_IFDIR: inode-i_op = ramfs_dir_inode_operations; @@ -79,12 +90,19 @@ struct inode *ramfs_get_inode(struct sup break; case S_IFLNK: inode-i_op = page_symlink_inode_operations; + if (mac) + inode-i_op = syaoran_symlink_inode_operations; break; } } return inode; } +struct inode *ramfs_get_inode(struct super_block *sb, int mode, dev_t dev) +{ + return __ramfs_get_inode(sb, mode, dev, 0); +} + /* * File creation. Allocate an inode, and we're done.. */ @@ -92,9 +110,17 @@ struct inode *ramfs_get_inode(struct sup static int ramfs_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev) { - struct inode * inode = ramfs_get_inode(dir-i_sb, mode, dev); + struct inode *inode; int
Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.
Hello. Radoslaw Szkodzinski (AstralStorm) wrote: Actually, who needs to create device nodes? Just prohibit everyone from creating them, except installer and udev personality. This means removing CAP_MKNOD on a global scale. What happens if the root tampers udev's configuration file? The udev will create inappropriate (i.e. filename with unexpected attributes) device nodes, won't it? Also, creating device nodes is not the only threat. The root can do # mv /dev/sda1 /dev/tmp; mv /dev/sda2 /dev/sda1; mv /dev/tmp /dev/sda2 to rename/unlink device nodes. After all, revoking CAP_MKNOD is not enough for guaranteeing filename and its attributes. This filesystem is designed to guarantee filename and its attributes, but this filesystem has additional access control capability. You can forbid mknod/unlink /dev/null if you want nobody to do so. You can forbid chmod/chown /dev/null if you want nobody to do so. Well... it is not fair to refer only udev's configuration file. If configuration file of this filesystem is tampered, this filesystem will create inappropriate device nodes. So, some access control mechanism for protecting configuration files is recommended for both udev and this filesystem. Regards. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.
Hello. Indan Zupancic wrote: If MAC can avoid all that, then why can't it also avoid tampering with /dev? If MAC implementation handles filename and its attributes pair, this filesystem is not needed. But I don't know MAC implementations that handle this pair. SELinux's granularity is allow foo_t to create block device file in dev_t directory. TOMOYO's granularity is allow foo to create block device file named /dev/sda1. Both don't enforce filename and its attributes pair, thus the attacker with root privilege can create fake device files if he/she is permitted to create device files by MAC's policy. It would be possible to handle this pair within MAC's policy by expanding their policy syntaxes, but offloading this handling on filesystem can make MAC's policy syntax simple because filename and its attributes pairs are conventionally constant. You won't let foo_t to create /dev/sda1 with block-8-1 attributes and let bar_t to create /dev/sda1 with block-8-2 attributes, will you? You don't want to describe attribute information to every entry in MAC's policy, do you? It is redundant to describe this attribute enforcement information in MAC's policy unless you want to break conventional filename and its attributes pairs. What security does your filesystem add at all, if it's useless without a MAC doing all the hard work? Allow / partition to be mounted for read-only mode. Allow /dev partition to be enforced filename and its attributes to avoid /dev/null spoofing (create /dev/null as a regular file for eavesdropping purpose). This filesystem adds filename and its attributes enforcement, but it is overridable if this filesystem is used without MAC. This filesystem adds unoverridable filename and its attributes enforcement if this filesystem is used with MAC. Regards. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.
Hello. Al Boldi wrote: I think the answer is obvious: Tetsuo wants to add functionality that the MACs are missing. So, instead of adding this functionality per MAC, he proposes to add it as ground work, to be combined with any MAC. Yes, that's right. This filesystem is designed to be used with TOMOYO Linux, but this filesystem can be used with other MAC implementations too. Thank you. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.
( This is a reply to http://lkml.org/lkml/2007/12/17/27 .) Hello. David Wagner wrote: But the point is that it's not enough just to prevent attackers from mounting other filesystems over this filesystem. I can think of all sorts of ways that an admin-level attacker might be able to prevent other administrators from logging in. If your defense strategy involves trying to enumerate all of those possible ways and then shut them down one by one, you're relying upon a defense strategy known as blacklisting. Blacklisting has a terrible track record in the security field, because it's too easy to overlook one pathway. Of course, I assume whitelisting. SELinux and TOMOYO Linux and many other MAC implementations uses whitelisting approach, and this filesystem is whiltelisting approach. This filesystem handles what MAC implementations don't handle. In other words, it is a remaining hole. I'm proposing: Don't you think it is dangerous to assume files in /dev directory have appropriate filename and attributes binding? MAC can restrict processes who can create files in /dev directory, but MAC doesn't enforce filename and attributes binding. So, how about enforcing filename and attributes binding in filesystem layer? Regards. To David Wagner: Could you please Cc: me so that I can reply to your message? I can't reply to your message since I'm reading this ml in daily digest mode. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.
Hello. Serge E. Hallyn wrote: CAP_MKNOD will be removed from its capability I think it is not enough because the root can rename/unlink device files (mv /dev/sda1 /dev/tmp; mv /dev/sda2 /dev/sda1; mv /dev/tmp /dev/sda2). To use your approach, i guess we would have to use selinux (or tomoyo) to enforce that devices may only be created under /dev? Everyone can use this filesystem alone. But use with MAC (or whatever access control mechanisms that prevent attackers from unmounting/overlaying this filesystem) is recomennded. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.
Hello. Serge E. Hallyn wrote: But your requirements are to ensure that an application accessing a device at a well-known location get what it expect. Yes. That's the purpose of this filesystem. So then the main quesiton is still the one I think Al had asked - what keeps a rogue CAP_SYS_MOUNT process from doing mount --bind /dev/hda1 /dev/null ? Excuse me, but I guess you meant mount --bind /dev/ /root/ or something because mount operation requires directories. MAC can prevent a rogue CAP_SYS_MOUNT process from doing mount --bind /dev/ /root/. For example, regarding TOMOYO Linux, you need to give allow_mount /dev/ /root/ --bind 0 permission to permit mount --bind /dev/ /root/ request. Did you mean ln -s /dev/hda1 /dev/null or ln /dev/hda1 /dev/null? No problem. MAC can prevent such requests too. Regards. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.
Hello. Serge E. Hallyn wrote: Nope, try touch /root/hda1 ls -l /root/hda1 mount --bind /dev/hda1 /root/hda1 ls -l /root/hda1 [EMAIL PROTECTED] ~]# touch /root/hda1 [EMAIL PROTECTED] ~]# ls -l /root/hda1 -rw-r--r-- 1 root root 0 Dec 18 12:04 /root/hda1 [EMAIL PROTECTED] ~]# mount --bind /dev/hda1 /root/hda1 [EMAIL PROTECTED] ~]# ls -l /root/hda1 brw-r- 1 root disk 3, 1 Dec 18 2007 /root/hda1 Oh, surprising. I didn't know mount() accepts non-directory for mount-point. But I think this is not a mount operation because I can't see the contents of /dev/hda1 through /root/hda1 . Can I see the contents of /dev/hda1 through /root/hda1 ? Then it sounds like this filesystem is something Tomoyo can use. I had / partition mounted for read-only so that the admin can't do 'mknod /root/hda1 b 3 1' in 2003, and I named it Security Advancement Know-how Upon Readonly Approach for Linux or SAKURA Linux. This filesystem (SYAORAN) is developed to make /dev writable and tamper-proof when / partition is read-only or protected by MAC. TOMOYO is a pathname-based MAC implementation, and SAKURA and SYAORAN were merged into TOMOYO Linux. ;-) Regards. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 0/2] [RFC] Simple tamper-proof device filesystem.
Hello. I have proposed this filesystem a few years ago. Once again, I'm proposing this filesystem toward inclusion into mainline. I'll update for -mm tree if this filesystem is likely acceptable. Regards. (This is a resent message of [00/02] since it seems to be dropped.) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/2] [RFC] Simple tamper-proof device filesystem.
A brief description about SYAORAN: SYAORAN stands for Simple Yet All-important Object Realizing Abiding Nexus. SYAORAN is a filesystem for /dev with Mandatory Access Control. /dev needs to be writable, but this means that files on /dev might be tampered with. SYAORAN can restrict combinations of (pathname, attribute) that the system can create. The attribute is one of directory, regular file, FIFO, UNIX domain socket, symbolic link, character or block device file with major/minor device numbers. SYAORAN can ensure /dev/null is a character device file with major=1 minor=3. Policy specifications for this filesystem is at http://tomoyo.sourceforge.jp/en/1.5.x/policy-syaoran.html Why not use FUSE? Because /dev has to be available through the lifetime of the kernel. It is not acceptable if /dev stops working due to SIGKILL or OOM-killer. Why not use SELinux? Because SELinux doesn't guarantee filename and its attribute. The purpose of this filesystem is to ensure filename and its attribute (e.g. /dev/null is guaranteed to be a character device file with major=1 and minor=3). Signed-off-by: Tetsuo Handa [EMAIL PROTECTED] --- fs/syaoran/syaoran.c | 338 + fs/syaoran/syaoran.h | 964 +++ 2 files changed, 1302 insertions(+) --- /dev/null +++ linux-2.6.24-rc5/fs/syaoran/syaoran.c @@ -0,0 +1,338 @@ +/* + * fs/syaoran/syaoran.c + * + * Implementation of the Tamper-Proof Device Filesystem. + * + * Portions Copyright (C) 2005-2007 NTT DATA CORPORATION + * + * Version: 1.5.3-pre 2007/12/16 + * + * This filesystem is developed using the ramfs implementation. + * + */ +/* + * Resizable simple ram filesystem for Linux. + * + * Copyright (C) 2000 Linus Torvalds. + * 2000 Transmeta Corp. + * + * Usage limits added by David Gibson, Linuxcare Australia. + * This file is released under the GPL. + */ + +/* + * NOTE! This filesystem is probably most useful + * not as a real filesystem, but as an example of + * how virtual filesystems can be written. + * + * It doesn't get much simpler than this. Consider + * that this file implements the full semantics of + * a POSIX-compliant read-write filesystem. + * + * Note in particular how the filesystem does not + * need to implement any data structures of its own + * to keep track of the virtual data: using the VFS + * caches is sufficient. + */ + +#include linux/module.h +#include linux/fs.h +#include linux/pagemap.h +#include linux/highmem.h +#include linux/time.h +#include linux/init.h +#include linux/string.h +#include linux/backing-dev.h +#include linux/sched.h +#include linux/uaccess.h + +static struct super_operations syaoran_ops; +static struct address_space_operations syaoran_aops; +static struct inode_operations syaoran_file_inode_operations; +static struct inode_operations syaoran_dir_inode_operations; +static struct inode_operations syaoran_symlink_inode_operations; +static struct file_operations syaoran_file_operations; + +static struct backing_dev_info syaoran_backing_dev_info = { + .ra_pages = 0,/* No readahead */ + .capabilities = BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_WRITEBACK | + BDI_CAP_MAP_DIRECT | BDI_CAP_MAP_COPY | + BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP, +}; + +#include syaoran.h + +static struct inode *syaoran_get_inode(struct super_block *sb, int mode, + dev_t dev) +{ + struct inode *inode = new_inode(sb); + + if (inode) { + struct timespec now = CURRENT_TIME; + inode-i_mode = mode; + inode-i_uid = current-fsuid; + inode-i_gid = current-fsgid; + inode-i_blocks = 0; + inode-i_mapping-a_ops = syaoran_aops; + inode-i_mapping-backing_dev_info = syaoran_backing_dev_info; + inode-i_atime = now; + inode-i_mtime = now; + inode-i_ctime = now; + switch (mode S_IFMT) { + default: + init_special_inode(inode, mode, dev); + if (S_ISBLK(mode)) + inode-i_fop = wrapped_def_blk_fops; + else if (S_ISCHR(mode)) + inode-i_fop = wrapped_def_chr_fops; + inode-i_op = syaoran_file_inode_operations; + break; + case S_IFREG: + inode-i_op = syaoran_file_inode_operations; + inode-i_fop = syaoran_file_operations; + break; + case S_IFDIR: + inode-i_op = syaoran_dir_inode_operations; + inode-i_fop = simple_dir_operations; + /* +* directory inodes start off with i_nlink == 2 +* (for . entry
[patch 2/2] [RFC] Simple tamper-proof device filesystem.
Signed-off-by: Tetsuo Handa [EMAIL PROTECTED] --- fs/Kconfig | 21 + fs/Makefile |1 + 2 files changed, 22 insertions(+) --- linux-2.6.24-rc5.orig/fs/Kconfig +++ linux-2.6.24-rc5/fs/Kconfig @@ -1555,6 +1555,27 @@ config UFS_DEBUG Y here. This will result in _many_ additional debugging messages to be written to the system log. +config SYAORAN_FS + tristate SYAORAN (Tamper-Proof Device Filesystem) support + help + Say Y or M here to support the Tamper-Proof Device Filesystem. + + SYAORAN stands for + Simple Yet All-important Object Realizing Abiding Nexus. + SYAORAN is a filesystem for /dev with Mandatory Access Control. + + The system can't work if /dev is read-only. + Therefore you need to mount a writable filesystem (such as tmpfs) + for /dev if root fs is read-only. + + But the writable /dev means that files on /dev might be tampered. + For example, if /dev/null is deleted and re-created as a symbolic + link to /dev/hda by an attacker, the contents of the IDE HDD + will be destroyed at a blow. + + SYAORAN can ensure /dev/null is a character device file + with major=1 minor=3. + endmenu menuconfig NETWORK_FILESYSTEMS --- linux-2.6.24-rc5.orig/fs/Makefile +++ linux-2.6.24-rc5/fs/Makefile @@ -118,3 +118,4 @@ obj-$(CONFIG_HPPFS) += hppfs/ obj-$(CONFIG_DEBUG_FS) += debugfs/ obj-$(CONFIG_OCFS2_FS) += ocfs2/ obj-$(CONFIG_GFS2_FS) += gfs2/ +obj-$(CONFIG_SYAORAN_FS)+= syaoran/syaoran.o - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.
Hello. David Newall wrote: Tetsuo Handa wrote: /dev needs to be writable, but this means that files on /dev might be tampered with. I infer that you mean /dev needs to be writable by anyone, not by just its owner or owner and group (conventionally root/root.) This goes against conventional wisdom, which is that /dev must be writable only by the administrator. Why do you say otherwise? I didn't mean that /dev is writable by everybody. I meant that /dev must be mounted for read-write mode (even if one wants to mount / for read-only mode). Regards. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.
Hello. I meant that /dev must be mounted for read-write mode Again, why? You can mount / partition for read-only mode if you wish to do so. But you cannot make /dev directory for read-only. You won't be able to login to the system because /sbin/mingetty fails to chown/chmod /dev/tty* if /dev is mounted for read-only mode. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.
But use of this filesystem is still valid when this filesystem is used with policy based mandatory access control (such as SELinux, TOMOYO Linux) because this filesystem guarantees where policy based mandatory access control can't guarantee (i.e. filename and its attribute). Policy based mandatory access control guarantees that Only Bob can create block device file named sda1 in /dev directory. But it can't guarantee that /dev/sda1 will have block-8-1 attribute. If Bob is malicious and creates /dev/sda1 with block-8-2 attribute, other applications that depends on the attributes of /dev/sda1 goes wrong. So, this filesystem guarantees that /dev/sda1 has block-8-1 attribute. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] [RFC] Simple tamper-proof device filesystem.
Hello. Indan Zupancic wrote: What prevents them from mounting tmpfs on top of /dev, bypassing your fs? Mandatory access control (MAC) prevents them from mounting tmpfs on top of /dev . MAC mediates namespace manipulation requests such as mount()/umount(). Also, if they have root there are plenty of ways to prevent an administrator from logging in, e.g. using iptables or changing the password. MAC mediates execution of /sbin/iptables or /usr/bin/passwd . So, use of this filesystem alone is meaningless because attackers with root privileges can do what you are saying. But use of this filesystem with MAC is still valid because MAC can prevent attackers with root privileges from doing what you are saying. Regards. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with accessing namespace_sem from LSM.
Hello. Christoph Hellwig wrote: Isn't security_inode_create() a part of VFS internals? It's not. security_inode_create is part of the LSM infrastructure, and the actual methods are part of security modules and definitively not VFS internals. The reason why I want to access namespace_sem inside security_inode_create() is that it doesn't receive struct vfsmount parameter. If struct vfsmount *were* passed to security_inode_create(), I have no need to access namespace_sem. And now, since calling down_read(namespace_sem) causes deadlock, I'm looking for a solution. What you said (I'd start looking for design bugs in whatever code you have using it first.) sounds never try to implement pathname based access control at security_inode_create(), which makes AppArmor (for OpenSuSE 10.1/10.2) and TOMOYO unable to apply access control. At first, I thought that this lockdep's warning is a false positive, since struct inode is allocated/freed dynamically. But the warning still appears even after I disabled freeing memory at destroy_inode() in fs/namei.c (so that address of locking object in struct inode never be reused), it is likely genuine. Regards. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with accessing namespace_sem from LSM.
Hello. Christoph Hellwig wrote: Same argument as with the AA folks: it does not have any business looking at the vfsmount. If you create a file it can and in many setups will show up in multiple vfsmounts, so making decisions based on the particular one this creat happens through is wrong and actually dangerous. Thus TOMOYO 1.x doesn't use LSM hooks, and AppArmor for OpenSuSE 10.3 added struct vfsmount parameter for VFS helper functions and LSM hooks. Not all systems use bind mounts. There is likely only one vfsmount which corresponds with a given dentry. What does dangerous mean? It causes crash? Regards. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with accessing namespace_sem from LSM.
Hello. Christoph Hellwig wrote: Any code except VFS internals has no business using it at all and doesn't do that in mainline either. I'd start looking for design bugs in whatever code you have using it first. Isn't security_inode_create() a part of VFS internals? I think security_inode_create() is a part of VFS internals because it is called from vfs_create(). Regards. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Is it illegal to refer namespace_sem while inode's mutex held?
Hello. I'm running my LSM module on kernel 2.6.23 / Debian Sarge. I encountered the following warning message. It seems that calling down_read(namespace_sem) is not permitted inside mutex_lock(inode-i_mutex) , but I'm not sure. Is it illegal to refer namespace_sem while inode's mutex held? === [ INFO: possible circular locking dependency detected ] 2.6.23-tomoyo2.1 #27 --- rcS/1093 is trying to acquire lock: (namespace_sem){}, at: [c017ca7b] m_start+0x11/0x20 but task is already holding lock: (inode-i_mutex){--..}, at: [c0171e79] open_namei+0xf2/0x522 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (inode-i_mutex){--..}: [c017d35b] graft_tree+0x62/0xca [c013ab37] check_prev_add+0xc4/0x1bc [c017d35b] graft_tree+0x62/0xca [c013ac85] check_prevs_add+0x56/0xcb [c013af9c] validate_chain+0x2a2/0x31f [c01312ec] __kernel_text_address+0x18/0x23 [c0104b1b] dump_trace+0x6f/0x87 [c013cc54] __lock_acquire+0x6f2/0x762 [c013af6f] validate_chain+0x275/0x31f [c013d25e] lock_acquire+0x79/0x93 [c017d35b] graft_tree+0x62/0xca [c0331518] __mutex_lock_slowpath+0xea/0x280 [c017d35b] graft_tree+0x62/0xca [c017d35b] graft_tree+0x62/0xca [c017d8f1] do_add_mount+0x8a/0xe7 [c017de52] do_mount+0x1a9/0x1c0 [c0152d76] __alloc_pages+0x64/0x2b6 [c017dc5f] copy_mount_options+0x4d/0x97 [c017e0b5] sys_mount+0x79/0xb5 [c01012f4] name_to_dev_t+0x4d/0x25d [c0331258] schedule_timeout+0x79/0x8d [c019b741] create_proc_entry+0x73/0x86 [c012a023] process_timeout+0x0/0x5 [c04648ff] kernel_init+0x0/0xa3 [c0464e93] prepare_namespace+0x86/0x18e [c0168eb4] sys_access+0x1f/0x23 [c0464998] kernel_init+0x99/0xa3 [c0104aa3] kernel_thread_helper+0x7/0x10 [] 0x - #0 (namespace_sem){}: [c013aa9a] check_prev_add+0x27/0x1bc [c013ac85] check_prevs_add+0x56/0xcb [c013af9c] validate_chain+0x2a2/0x31f [c013cc54] __lock_acquire+0x6f2/0x762 [c0179712] __d_lookup+0xda/0xfa [c013d25e] lock_acquire+0x79/0x93 [c017ca7b] m_start+0x11/0x20 [c013590f] down_read+0x3b/0x71 [c017ca7b] m_start+0x11/0x20 [c017ca7b] m_start+0x11/0x20 [c01d5abe] tmy_do_single_write_perm+0x7e/0xda [c0171a74] vfs_create+0x83/0x105 [c0171d44] open_namei_create+0x47/0x8a [c0171ee3] open_namei+0x15c/0x522 [c01694d3] do_filp_open+0x25/0x39 [c0332cd2] _spin_unlock+0x14/0x1c [c0169691] get_unused_fd_flags+0xb0/0xba [c0169764] do_sys_open+0x44/0xc5 [c01697ff] sys_open+0x1a/0x1c [c0103e6a] syscall_call+0x7/0xb [] 0x other info that might help us debug this: 1 lock held by rcS/1093: #0: (inode-i_mutex){--..}, at: [c0171e79] open_namei+0xf2/0x522 stack backtrace: [c013a37f] print_circular_bug_tail+0x5f/0x67 [c013aa9a] check_prev_add+0x27/0x1bc [c013ac85] check_prevs_add+0x56/0xcb [c013af9c] validate_chain+0x2a2/0x31f [c013cc54] __lock_acquire+0x6f2/0x762 [c0179712] __d_lookup+0xda/0xfa [c013d25e] lock_acquire+0x79/0x93 [c017ca7b] m_start+0x11/0x20 [c013590f] down_read+0x3b/0x71 [c017ca7b] m_start+0x11/0x20 [c017ca7b] m_start+0x11/0x20 [c01d5abe] tmy_do_single_write_perm+0x7e/0xda [c0171a74] vfs_create+0x83/0x105 [c0171d44] open_namei_create+0x47/0x8a [c0171ee3] open_namei+0x15c/0x522 [c01694d3] do_filp_open+0x25/0x39 [c0332cd2] _spin_unlock+0x14/0x1c [c0169691] get_unused_fd_flags+0xb0/0xba [c0169764] do_sys_open+0x44/0xc5 [c01697ff] sys_open+0x1a/0x1c [c0103e6a] syscall_call+0x7/0xb === The location is tmy_do_single_write_perm() (whose call trace is open_namei() - open_namei_create() - security_inode_create()) in the following file http://svn.sourceforge.jp/cgi-bin/viewcvs.cgi/trunk/2.1.x/tomoyo-lsm/patches/tomoyo-hooks.diff?rev=653root=tomoyoview=markup Regards. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Problem with accessing namespace_sem from LSM.
Hello. I found that accessing namespace_sem from security_inode_create() causes lockdep warning when compiled with CONFIG_PROVE_LOCKING=y . === [ INFO: possible circular locking dependency detected ] --- klogd/1798 is trying to acquire lock: (namespace_sem){}, at: [e0f133c7] _aa_perm_dentry+0x80/0x184 [apparmor] but task is already holding lock: (inode-i_mutex){--..}, at: [c02a883e] mutex_lock+0x12/0x15 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: - #1 (inode-i_mutex){--..}: [c0137c89] lock_acquire+0x4b/0x6a [c02a86e6] __mutex_lock_slowpath+0xb0/0x1f6 [c02a883e] mutex_lock+0x12/0x15 [c0180b02] graft_tree+0x5c/0xd4 [c0180e98] do_add_mount+0x84/0x100 [c0181b5f] do_mount+0x602/0x659 [c0181c1a] sys_mount+0x64/0x9b [c0103d9d] sysenter_past_esp+0x56/0x99 - #0 (namespace_sem){}: [c0137c89] lock_acquire+0x4b/0x6a [c0134e34] down_read+0x1e/0x31 [e0f133c7] _aa_perm_dentry+0x80/0x184 [apparmor] [e0f14849] aa_perm_dentry+0x62/0xa4 [apparmor] [e0f167c7] apparmor_inode_create+0x40/0x63 [apparmor] [c01749e5] vfs_create+0x84/0x13e [c01774ec] open_namei+0x169/0x635 [c0166f15] do_filp_open+0x20/0x36 [c0166f6b] do_sys_open+0x40/0xbb [c0167012] sys_open+0x16/0x18 [c0103d9d] sysenter_past_esp+0x56/0x99 other info that might help us debug this: 1 lock held by klogd/1798: #0: (inode-i_mutex){--..}, at: [c02a883e] mutex_lock+0x12/0x15 stack backtrace: [c010555d] show_trace+0xd/0x10 [c0105a99] dump_stack+0x19/0x1b [c0136dc8] print_circular_bug_tail+0x59/0x64 [c01375bd] __lock_acquire+0x7ea/0x973 [c0137c89] lock_acquire+0x4b/0x6a [c0134e34] down_read+0x1e/0x31 [e0f133c7] _aa_perm_dentry+0x80/0x184 [apparmor] [e0f14849] aa_perm_dentry+0x62/0xa4 [apparmor] [e0f167c7] apparmor_inode_create+0x40/0x63 [apparmor] [c01749e5] vfs_create+0x84/0x13e [c01774ec] open_namei+0x169/0x635 [c0166f15] do_filp_open+0x20/0x36 [c0166f6b] do_sys_open+0x40/0xbb [c0167012] sys_open+0x16/0x18 [c0103d9d] sysenter_past_esp+0x56/0x99 If this warning is true, AppArmor shipped with OpenSuSE 10.1 and 10.2 is affected. - Kernel 2.6.16.53-0.16 for OpenSuSE 10.1 - do_add_mount() { /* in fs/namespace.c */ down_write(namespace_sem); graft_tree() { mutex_lock(nd-dentry-d_inode-i_mutex); ... mutex_unlock(nd-dentry-d_inode-i_mutex); } up_write(namespace_sem); } open_namei() { /* in fs/namei.c */ mutex_lock(dir-d_inode-i_mutex); vfs_create() { security_inode_create() { subdomain_inode_create() { /* in security/apparmor/lsm.c */ sd_perm_dentry() { /* in security/apparmor/main.c */ _sd_perm_dentry() { sd_path_begin() { /* in security/apparmor/inline.h */ sd_path_begin2() { down_read(namespace_sem); } } ... sd_path_end() { up_read(namespace_sem); } } } } } } mutex_unlock(dir-d_inode-i_mutex); } - Kernel 2.6.18.8-0.7 for OpenSuSE 10.2 - do_add_mount() { /* in fs/namespace.c */ down_write(namespace_sem); graft_tree() { mutex_lock(nd-dentry-d_inode-i_mutex); ... mutex_unlock(nd-dentry-d_inode-i_mutex); } up_write(namespace_sem); } open_namei() { /* in fs/namei.c */ mutex_lock(dir-d_inode-i_mutex); vfs_create() { security_inode_create() { apparmor_inode_create() { /* in security/apparmor/lsm.c */ aa_perm_dentry() { /* in security/apparmor/lsm.c */ _aa_perm_dentry() { aa_path_begin() { /* in security/apparmor/inline.h */ aa_path_begin2() { down_read(namespace_sem); } } ... aa_path_end() { up_read(namespace_sem); } } } } } } mutex_unlock(dir-d_inode-i_mutex); } AppArmor shipped with OpenSuSE 10.3 and Ubuntu 7.10 will not be affected since kernel was modified to pass vfsmount parameter to VFS helper functions and LSM hooks. TOMOYO Linux 2.x (which is implemented using LSM) is also affected and I'm looking for solution. http://lkml.org/lkml/2007/11/5/55 Possible solution would be to pass vfsmount parameter to VFS helper functions and LSM hooks for all kernels. I do hope that Pass struct vfsmount to ... patches are merged into mainline kernel. Regards. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.24-rc1]EXPORT_SYMBOL(__set_page_dirty_no_writeback);
Hello. Arjan van de Ven wrote: when will you post this filesystem for inclusion into kernel.org kernel? (and please really consider posting the patch together with that patch) (also, if you can give a pointer to the source code of this filesystem you might even get early code review) I have proposed this filesystem at http://lkml.org/lkml/2004/11/1/48 . In short, the filesystem I'm developing is a trivial device filesystem that provides protection mechanism against tampering. Reasons I don't use devfs/udev or fuse or LSM for /dev are: The devfs/udev don't provide protection mechanism against tampering. I don't know implementation that can enforce filename and it's attributes. Label based access control like SELinux doesn't distinguish /dev/sda1 and /dev/sda2, do they? If a process who is permitted to unlink and create /dev/sda1 and /dev/sda2 is cracked, who can ensure that /dev/sda1 is block-8-1 and /dev/sda2 is block-8-2? A situation /dev/sda1 is block-8-2 and /dev/sda2 is block-8-1 can happen. /dev has to be valid throughout the lifetime of system (i.e. from /sbin/init till power failure). Filesystems using fuse will freeze when a system starts /usr/bin/killall at shutdown script, where it is too early to stop working of /dev partition. LSM is used by SELinux, thus there is unlikely chance to call my module to validate a device file's filename and it's attributes. The latest snapshot (which is not following codingstyle) is at http://svn.sourceforge.jp/cgi-bin/viewcvs.cgi/*checkout*/trunk/1.5.x/ccs-patch/include/linux/syaoran.h?content-type=text%2Fplainrev=588root=tomoyo http://svn.sourceforge.jp/cgi-bin/viewcvs.cgi/*checkout*/trunk/1.5.x/ccs-patch/fs/syaoran_2.6.c?content-type=text%2Fplainrev=614root=tomoyo If there is a chance for inclusion into kernel.org kernel, I'm willing to fix codingstyle and submit immediately. Thank you. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does 32.1% non-contiguous mean severely fragmented?
Hello. What filesystem are you using? ext3? ext4? xfs? And are you using any non-standard patches, such as some of the delayed allocation patches that have been floating around? If you're using ext3, that shouldn't be happening. I'm using ext3. I'm running it on kernel 2.6.18-8.1.14.el5 (CentOS 5) for x86_64. I don't know whether some of the delayed allocation patches are used for 2.6.18-8.1.14.el5 kernel. Are you sure the file isn't getting written by some background tasks that you weren't aware of? This seems very strange; what virtualization software are you using? VMware, Xen, KVM? I'm using VMware Workstation 6.0.0 build 45731 for x86_64. It seems that there were some background tasks that delays writing. I tried the following sequence, sync didn't affect. [EMAIL PROTECTED] Ubuntu7.10]# service vmware stop [EMAIL PROTECTED] Ubuntu7.10]# sleep 30 [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 9280 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# sync [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 9280 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# service vmware start [EMAIL PROTECTED] Ubuntu7.10]# vmware [EMAIL PROTECTED] Ubuntu7.10]# service vmware stop [EMAIL PROTECTED] Ubuntu7.10]# sleep 30 [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 9748 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# sync [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 9748 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# service vmware start [EMAIL PROTECTED] Ubuntu7.10]# vmware [EMAIL PROTECTED] Ubuntu7.10]# service vmware stop [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 9749 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# sync [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 9755 extents found, perfection would be 5 extents Thank you. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does 32.1% non-contiguous mean severely fragmented?
Hello. Theodore Tso wrote: Secondly, what results do you get when you run the command hdparm -tT /dev/sda (or /dev/hda if you are using an IDE disk)? [EMAIL PROTECTED] Ubuntu7.10]# hdparm -tT /dev/hda1 /dev/hda1: Timing cached reads: 10384 MB in 2.00 seconds = 5196.44 MB/sec Timing buffered disk reads: 116 MB in 3.02 seconds = 38.36 MB/sec [EMAIL PROTECTED] Ubuntu7.10]# hdparm -tT /dev/hda1 /dev/hda1: Timing cached reads: 10572 MB in 2.00 seconds = 5291.32 MB/sec Timing buffered disk reads: 118 MB in 3.04 seconds = 38.83 MB/sec BIOS setting says it uses AHCI mode. First of all, what does the filefrag program (shipped as part of e2fsprogs, not included in some distributions) say if you run it as root on your VM data file? Here is the result of filefrag. *-f???*.vmdk is splitted in 2 GB each. [EMAIL PROTECTED] Ubuntu7.10]# filefrag * Ubuntu7.10-0: 1 extent found Ubuntu7.10-f001.vmdk: 151 extents found, perfection would be 18 extents Ubuntu7.10-f002.vmdk: 36 extents found, perfection would be 18 extents Ubuntu7.10-f003.vmdk: 5 extents found, perfection would be 1 extent Ubuntu7.10.nvram: 1 extent found Ubuntu7.10.vmdk: 1 extent found Ubuntu7.10.vmsd: 1 extent found Ubuntu7.10.vmx: 1 extent found Ubuntu7.10.vmxf: 1 extent found Ubuntu7.10.vmx.lck: Not a regular file Ubuntu7-f001.10-0: 167 extents found, perfection would be 18 extents Ubuntu7-f002.10-0: 68 extents found, perfection would be 18 extents Ubuntu7-f003.10-0: 20 extents found, perfection would be 18 extents Ubuntu7-f004.10-0: 93 extents found, perfection would be 18 extents Ubuntu7-f005.10-0: 316 extents found, perfection would be 18 extents Ubuntu7-f006.10-0: 27 extents found, perfection would be 18 extents Ubuntu7-f007.10-0: 21 extents found, perfection would be 18 extents Ubuntu7-f008.10-0: 20 extents found, perfection would be 18 extents Ubuntu7-f009.10-0: 78 extents found, perfection would be 18 extents Ubuntu7-f010.10-0: 22 extents found, perfection would be 18 extents Ubuntu7-f011.10-0: 47 extents found, perfection would be 1 extent vmware-0.log: 4 extents found, perfection would be 1 extent vmware-1.log: 3 extents found, perfection would be 1 extent vmware-2.log: 15 extents found, perfection would be 1 extent vmware.log: 3 extents found, perfection would be 1 extent Yes, there are some discontiguous, but the ratio is not so high when considering their file size. Regarding 512MB-sized suspend image, it has more higher ratio of discontiguous, as shown below. When I just power on and suspend at grub, the extent is smaller than perfection. They would be sparse image (memory is allocated but not all memory is accessed). But when I do some operation after login, it yeilds more discontiguous. --- Start VM --- --- Suspend VM --- [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 1 extent found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 14 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 14 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# sync [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 17 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 17 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# sync [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 17 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 17 extents found, perfection would be 5 extents --- Resume and poweroff VM --- --- Start VM --- --- Suspend VM --- [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 751 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# sync [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 3281 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 3281 extents found, perfection would be 5 extents --- Resume and poweroff VM --- What? sync yields more discontiguous? --- Start VM --- --- Suspend VM --- [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 10 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# sync [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 482 extents found, perfection would be 5 extents --- Resume and poweroff VM --- --- Start VM --- --- Suspend VM --- [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 8 extents found, perfection would be 5 extents [EMAIL PROTECTED] Ubuntu7.10]# sync [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem Ubuntu7.10.vmem: 19 extents found, perfection would be 5 extents --- Resume and poweroff VM --- --- Start VM --- --- Suspend VM --- [EMAIL PROTECTED] Ubuntu7.10]#
Re: Does 32.1% non-contiguous mean severely fragmented?
Hello. Theodore Tso wrote: beginning of every single block group. You have a small number of files on your system (349) occupying an average of 348 megabytes. So it's not at all surprising that the contiguous percentage is 32%. I see, thank you. Yes, there are many files splitted in 2GB each. But what is surprising for me is that I have to wait for more than five minutes to save/restore the virtual machine's 512MB-RAM image (usually it takes less than five seconds). Hdparm reports DMA is on and e2fsck reports no errors, so I thought it is severely fragmented. May be I should backup all virtual machine's data and format the partition and restore them. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Does \32.1% non-contigunous\ mean severely fragmented?
Hello. I ran e2fsck and it reported as follows. [EMAIL PROTECTED] ~]# e2fsck -f /dev/hda1 e2fsck 1.39 (29-May-2006) Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information /data/VMware: 349/19546112 files (32.1% non-contiguous), 31019203/39072080 blocks Does non-contiguous mean fragmented? If so, where is ext3defrag? Regards. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Pass struct vfsmount to the inode_create LSM hook
Hello. Andreas Gruenbacher wrote: exec { /usr/bin/gunzip } gzip, -9, some/file/to.gz; The above Perl code executes /usr/bin/gunzip and sets argv[0] to gzip, so this confirms that the value of argv[0] is arbitrary. Well great, we already knew. AppArmor does not look at argv[0] for anything, and doing so would be insane. So please don't jump to the wrong conclusions. I agree that argv[0] checking is different from pathname-based access control or label-based access control, but I want to say argv[0] checking is still needed. If you don't check argv[0], an attacker can request everything like exec { /bin/ls } /sbin/busybox, cat, /etc/shadow; exec { /bin/ls } /sbin/busybox, rm, /etc/shadow; if /bin/ls and /bin/cat and /bin/rm are hardlinks of /sbin/busybox (e.g. embedded systems). Therefore, TOMOYO Linux checks the combination of filename and argv[0] passed to execve(). Thanks. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Pass struct vfsmount to the inode_create LSM hook
Hello. Andreas Gruenbacher wrote: Therefore, TOMOYO Linux checks the combination of filename and argv[0] passed to execve(). So you are indeed trying to control the value of argv[0]? Well, good luck with that, but it's totally insane. You are guaranteed to break some applications. TOMOYO Linux ristricts argv[0] using allow_argv0 syntax. allow_argv0 /bin/bash -bash to allow passing /bin/bash to filename and -bash to argv[0] . allow_argv0 /bin/gzip gunzip to allow passing /bin/gzip to filename and gunzip to argv[0] . allow_argv0 /sbin/busybox cat to allow passing /sbin/busybox to filename and cat to argv[0] . No need to use allow_argv0 syntax if the basename of filename and basename of argv[0] are the same (i.e. allow_argv0 /bin/bash bash is not required). TOMOYO Linux doesn't unconditionally forbid passing different values for filename and argv[0]. TOMOYO Linux allows passing different values for filename and argv[0] only if it is allowed by allow_argv0 syntax. Could you please explain me why this approach breaks applications? If /bin/cat and /bin/rm are binaries or hardlinks to the same busybox binary (rather than symlinks), different profiles could be used for each of them. It is true if all processes are kept under control (e.g. strict policy in SELinux). If there is a process that is not kept under control (e.g. targeted policy in SELinux), you can't protect the application. For example, an administrator may wish to allow users run /bin/ls without applying profiles because /bin/ls won't read/write the content of files. But a malicious user may pass /bin/ls to filename and rm to argv[0] and /etc/shadow to argv[1]. A malicious user may pass /bin/ls to filename and /usr/sbin/httpd to argv[0], resulting behave as /usr/sbin/httpd without applying profiles for /usr/sbin/httpd . Thanks. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
Hello. Casey Schaufler wrote: Sorry, but I don't understand your objection. If AppArmor is configured to allow everyone access to /bin/gzip but only some people access to /bin/gunzip and (important detail) the single binary uses argv[0] as documented and (another important detail) there aren't other links named gunzip to the binary (ok, that's lots of if's) you should be fine. The argv[0] defines the default behavior of hard linked or symbolic linked programs, but the behavior can be overridden using commandline options. If you want to allow access to /bin/gzip but deny access to /bin/gunzip , you also need to deny access to /bin/gzip -d /bin/gzip --decompress /bin/gzip --uncompress. It is impossible to do so because options to override the default behavior depends on program's design and you can't know what programs and what options are there in the system. Even if you know all programs and all options in the system, it is a too tough job to find and reject options that override the default behavior in the kernel space. Well, my point was exactly that App Armor doesn't (as far as I know) do anything to enforce the argv[0] convention, Sounds like an opportunity for improvement then. There are (I think) three types of program invocation. (1) Invocation of hard linked programs. /bin/gzip and /bin/gunzip and /bin/zcat are hard links. There is no problem because you can know which pathname was requested using d_namespace_path() with struct linux_binprm-file . (2) Invocation of symbolic linked programs. /sbin/pidof is a symbolic link to /sbin/killall . There is a problem because you can't know which pathname was requested using d_namespace_path() with struct linux_binprm-file because the symbolic links were already derefernced inside open_exec(). To know which pathname was requested, you need to lookup using struct linux_binprm-filename without LOOKUP_FOLLOW and then use d_namespace_path(). Although there is a race condition that the pathname the symbolic link struct linux_binprm-filename points to may change, but it is inevitable because you can't get dentry and vfsmount of both without LOOKUP_FOLLOW flag and with LOOKUP_FOLLOW flag at the same time. (3) Invocation of dynamically created programs with random names. /usr/sbin/logrotate creates files patterned /tmp/logrotate.?? and executes these dynamically created files. To keep execution of these dynamically created files under control, you need to aggregate pathnames of these files. AppArmor can't define profile if the pathname of programs is random, can it? Usually the argv[0] and the struct linux_binprm-filename are the same, but if you want to do something with argv[0], you will need to handle the (2) case to see whether the argv[0] and struct linux_binprm-filename are the same. Thanks. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSMhook
Hello. I think bind mounts were discussed when shared subtree ( http://lwn.net/Articles/159092/ ) was introduced. For systems that allow users mount their CD/DVDs freely, bind mounts are used and labeling files is a convenient way to deny accessing somebody else's files. But systems that don't allow users mount their CD/DVDs freely, bind mounts needn't to be used and using pathnames is a convenient way to deny accessing somebody else's files. Pathname based access control/auditing system works if the system doesn't use bind mounts. However, there are distributions (e.g. Debian Etch) that always use bind mounts. In such distributions, pathname based access control/auditing system doesn't work. This is not the fault of distributions nor pathname based access control/auditing system. It is possible to solve by passing vfsmount to VFS and LSM functions. SELinux users are having a lot of trouble because pathnames in audit logs are not always complete. AppArmor users are having a lot of trouble because pathnames which a process requested are ambiguous when bind mounts are used. Being able to report pathnames that a process requested is not surprising when considering user friendliness. I beleive passing vfsmount makes both users happy. Thanks. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [d_path 3/7] Add d_namespace_path() to compute namespace relative pathnames
Hello. I've just returned from ELC2007 and I haven't read all posts in this thread yet, but I want to comment to this function. In AppArmor, we are interested in pathnames relative to the namespace root. This is the same as d_path() except for the root where the search ends. Add a function for computing the namespace-relative path. Yes. You came to the same conclusion as TOMOYO Linux does. http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/fs/realpath.c#L39 TOMOYO Linux uses pathnames relative to the namespace root. You do this using d_path()'s way, but there needs some extensions if you want to use d_namespace_path() for access control/auditing purpose. In Linux, all characters other than NULL can be used in its pathname. This means that you can't assume that whitespaces are delimiters. For example, when you process entries in Access . granted/rejected\n format (where . is a pathname and \n is a carriage return, like Access /bin/ls granted\n), an entry Access /bin/ls granted\nAccess /bin/cat granted\n can be produced if . is /bin/ls granted\nAccess /bin/cat. Processing such entry will produce wrong result. Also, you want wildcards (usually *) when doing pathname comparison, but there are files that contains wildcards (for example, /usr/share/guile/1.6/ice-9/and-let*.scm in CentOS 4.4). You need to escape so that you can tell whether * indicates a literal * or a wildcard. Also, in non-English regions, characters that are out of ASCII printable range are included in its pathname (for example, files created via Samba from Windows client). Some programs can't handle characters that have MSB bit on, so you may want to represent all characters without using MSB bit. It may be OK if you use d_namespace_path() for processing a userland's configuration file, but it is not OK if you use it for processing a kernel's configuration file. The kernel has to be able to handle any characters. So, you may want customized version of d_namespace_path()? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/28] Patches to pass vfsmount to LSM inode security hooks
Tony Jones wrote: The following are a set of patches the goal of which is to pass vfsmounts through select portions of the VFS layer sufficient to be visible to the LSM inode operation hooks. I was looking forward to these patches for so long. Chris Wright wrote: This kind of change (or perhaps straight to struct path) is definitely needed from AA. Not only AppArmor, but also TOMOYO Linux needs these patches. TOMOYO Linux is a pathname based access control patch like AppArmor. http://lwn.net/Articles/165132/ I have been asked Why not use LSM? and the answer is always I can't, for VFS helper functions and LSM functions don't receive vfsmount. and I am manually patching locations that call VFS helper functions. But if these Tony's patches are accepted in upstream, TOMOYO Linux would be able to use LSM. I think these patches are also useful for auditing functions, for auditing logs will be able to include absolute pathname instead of partial pathname. I think most people want access logs in the form of pathnames rather than security labels. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html