Re: [PATCH 00/37] Permit filesystem local caching
Daniel Phillips [EMAIL PROTECTED] wrote: These patches add local caching for network filesystems such as NFS. Have you got before/after benchmark results? I need to get a new hard drive for my test machine before I can go and get some more up to date benchmark results. It does seem, however, that the I/O error handling capabilities of FS-Cache work properly:-) David - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Daniel Phillips [EMAIL PROTECTED] wrote: Have you got before/after benchmark results? See attached. These show a couple of things: (1) Dealing with lots of metadata slows things down a lot. Note the result of looking and reading lots of small files with tar (the last result). The NFS client has to both consult the NFS server *and* the cache. Not only that, but any asynchronicity the cache may like to do is rendered ineffective by the fact tar wants to do a read on a file pretty much directly after opening it. (2) Getting metadata from the local disk fs is slower than pulling it across an unshared gigabit ethernet from a server that already has it in memory. These points don't mean that fscache is no use, just that you have to consider carefully whether it's of use to *you* given your particular situation, and that depends on various factors. Note that currently FS-Caching is disabled for individual NFS files opened for writing as there's no way to handle the coherency problems thereby introduced. David --- === FS-CACHE FOR NFS BENCHMARKS === (*) The NFS client has a 1.86GHz Core2 Duo CPU and 1GB of RAM. (*) The NFS client has a Seagate ST380211AS 80GB 7200rpm SATA disk on an interface running in AHCI mode. The chipset is an Intel G965. (*) A partition of approx 4.5GB is committed to caching, and is formatted as Ext3 with a blocksize of 4096 and directory indices. (*) The NFS client is using SELinux. (*) The NFS server is running an in-kernel NFSd, and has a 2.66GHz Core2 Duo CPU and 6GB of RAM. The chipset is an Intel P965. (*) The NFS client is connected to the NFS server by Gigabit Ethernet. (*) The NFS mount is made with defaults for all options not relating to the cache: warthog:/warthog /warthog nfs rw,vers=3,rsize=1048576,wsize=1048576,hard,proto=tcp,timeo=600, retrans=2,sec=sys,fsc,addr=w.x.y.z 0 0 == FEW BIG FILES TEST == Where: (*) The NFS server has two files: [EMAIL PROTECTED] ~]# ls -l /warthog/bigfile -rw-rw-r-- 1 4043 4043 104857600 2006-11-30 09:39 /warthog/bigfile [EMAIL PROTECTED] ~]# ls -l /warthog/biggerfile -rw-rw-r-- 1 4043 4041 209715200 2006-03-21 13:56 /warthog/biggerfile Both of which are in memory on the server in all cases. No patches, cold NFS cache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m1.909s user0m0.000s sys 0m0.520s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m3.750s user0m0.000s sys 0m0.904s CONFIG_FSCACHE=n, cold NFS cache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m2.003s user0m0.000s sys 0m0.124s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m4.100s user0m0.004s sys 0m0.488s Cold NFS cache, no disk cache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m2.084s user0m0.000s sys 0m0.136s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m4.020s user0m0.000s sys 0m0.720s Completely cold caches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m2.412s user0m0.000s sys 0m0.892s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m4.449s user0m0.000s sys 0m2.300s Warm NFS pagecache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m0.067s user0m0.000s sys 0m0.064s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m0.133s user0m0.000s sys 0m0.136s Warm Ext3 pagecache, cold NFS pagecache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m0.173s user0m0.000s sys 0m0.172s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m0.316s user0m0.000s sys 0m0.316s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m1.955s user0m0.000s sys 0m0.244s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m3.596s user0m0.000s sys 0m0.460s === MANY SMALL/MEDIUM FILE READING TEST === Where: (*) The NFS server has an old kernel tree: [EMAIL PROTECTED] ~]# du -s /warthog/aaa 347340 /warthog/aaa [EMAIL
Re: [PATCH 00/37] Permit filesystem local caching
On Thu, Feb 21, 2008 at 9:55 AM, David Howells [EMAIL PROTECTED] wrote: Daniel Phillips [EMAIL PROTECTED] wrote: Have you got before/after benchmark results? See attached. These show a couple of things: (1) Dealing with lots of metadata slows things down a lot. Note the result of looking and reading lots of small files with tar (the last result). The NFS client has to both consult the NFS server *and* the cache. Not only that, but any asynchronicity the cache may like to do is rendered ineffective by the fact tar wants to do a read on a file pretty much directly after opening it. (2) Getting metadata from the local disk fs is slower than pulling it across an unshared gigabit ethernet from a server that already has it in memory. Hi David, Your results remind me of this in case you're interested... http://www.citi.umich.edu/techreports/reports/citi-tr-92-3.pdf - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Hi David, I am trying to spot the numbers that show the sweet spot for this optimization, without much success so far. Who is supposed to win big? Is this mainly about reducing the load on the server, or is the client supposed to win even with a lightly loaded server? When you say Ext3 cache vs NFS cache is the first on the server and the second on the client? Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 00/37] Permit filesystem local caching
Well, the AFS paper that was referenced earlier was written around the time of 10bt and 100bt. Local disk caching worked well then. There should also be some papers at CITI about disk caching over slower connections, and disconnected operation (which should still be applicable today). There are still winners from local disk caching, but their numbers have been reduced. Server load reduction should be a win. I'm not sure if it's worth it from a security/manageability standpoint, but I haven't looked that closely at David's code. -Dan -Original Message- From: Daniel Phillips [mailto:[EMAIL PROTECTED] Sent: Thursday, February 21, 2008 2:44 PM To: David Howells Cc: Myklebust, Trond; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; linux-security-module@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [PATCH 00/37] Permit filesystem local caching Hi David, I am trying to spot the numbers that show the sweet spot for this optimization, without much success so far. Who is supposed to win big? Is this mainly about reducing the load on the server, or is the client supposed to win even with a lightly loaded server? When you say Ext3 cache vs NFS cache is the first on the server and the second on the client? Regards, Daniel ___ NFSv4 mailing list [EMAIL PROTECTED] http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4 - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
David Howells [EMAIL PROTECTED] wrote: Have you got before/after benchmark results? See attached. Attached here are results using BTRFS (patched so that it'll work at all) rather than Ext3 on the client on the partition backing the cache. Note that I didn't bother redoing the tests that didn't involve a cache as the choice of filesystem backing the cache should have no bearing on the result. Generally, completely cold caches shouldn't show much variation as all the writing can be done completely asynchronously, provided the client doesn't fill its RAM. The interesting case is where the disk cache is warm, but the pagecache is cold (ie: just after a reboot after filling the caches). Here, for the two big files case, BTRFS appears quite a bit better than Ext3, showing a 21% reduction in time for the smaller case and a 13% reduction for the larger case. For the many small/medium files case, BTRFS performed significantly better (15% reduction in time) in the case where the caches were completely cold. I'm not sure why, though - perhaps because it doesn't execute a write_begin() stage during the write_one_page() call and thus doesn't go allocating disk blocks to back the data, but instead allocates them later. More surprising is that BTRFS performed significantly worse (15% increase in time) in the case where the cache on disk was fully populated and then the machine had been rebooted to clear the pagecaches. It's important to note that I've only run each test once apiece, so the numbers should be taken with a modicum of salt (bad statistics and all that). David --- === FEW BIG FILES TEST ON BTRFS === Completely cold caches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m2.124s user0m0.000s sys 0m1.260s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m4.538s user0m0.000s sys 0m2.624s Warm NFS pagecache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m0.061s user0m0.000s sys 0m0.064s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m0.118s user0m0.000s sys 0m0.116s Warm BTRFS pagecache, cold NFS pagecache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m0.189s user0m0.000s sys 0m0.188s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m0.369s user0m0.000s sys 0m0.368s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m1.540s user0m0.000s sys 0m1.440s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m3.132s user0m0.000s sys 0m1.724s MANY SMALL/MEDIUM FILE READING TEST ON BTRFS Completely cold caches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero real0m31.838s user0m0.192s sys 0m6.076s Warm NFS pagecache: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero real0m14.841s user0m0.148s sys 0m4.988s Warm BTRFS pagecache, cold NFS pagecache: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero real0m16.773s user0m0.148s sys 0m5.512s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero real2m12.527s user0m0.080s sys 0m2.908s - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Daniel Phillips [EMAIL PROTECTED] wrote: When you say Ext3 cache vs NFS cache is the first on the server and the second on the client? The filesystem on the server is pretty much irrelevant as long as (a) it doesn't change, and (b) all the data is in memory on the server anyway. The way the client works is like this: +-+ | | | NFS |--+ | | | +-+ | +--+ | | | +-+ +--| | | | | | | AFS |-| FS-Cache | | | | |--+ +-+ +--| | | | | | | +--+ +--+ +-+ | +--+ | | | | | | | | +--| CacheFiles |--| Ext3| | ISOFS |--+ | /var/cache | | /dev/sda6 | | |+--+ +--+ +-+ (1) NFS, say, asks FS-Cache to store/retrieve data for it; (2) FS-Cache asks the cache backend, in this case CacheFiles to honour the operation; (3) CacheFiles 'opens' a file in a mounted filesystem, say Ext3, and does read and write operations of a sort on it; (4) Ext3 decides how the cache data is laid out on disk - CacheFiles just attempts to use one sparse file per netfs inode. I am trying to spot the numbers that show the sweet spot for this optimization, without much success so far. What are you trying to do exactly? Are you actually playing with it, or just looking at the numbers I've produced? Who is supposed to win big? Is this mainly about reducing the load on the server, or is the client supposed to win even with a lightly loaded server? These are difficult questions to answer. The obvious answer to both is it depends, and the real answer to both is it's a compromise. Inserting a cache adds overhead: you have to look in the cache to see if your objects are mirrored there, and then you have to look in the cache to see if the data you want is stored there; and then you might have to go to the server anyway and then schedule a copy to be stored in the cache. The characteristics of this type of cache depend on a number of things: the filesystem backing it being the most obvious variable, but also how fragmented it is and the properties of the disk drive or drives it is on. Whether it's worth having a cache depend on the characteristics of the network versus the characteristics of the cache. Latency of the cache vs latency of the network, for example. Network loading is another: having a cache on each of several clients sharing a server can reduce network traffic by avoiding the read requests to the server. NFS has a characteristic that it keeps spamming the server with file status requests, so even if you take the read requests out of the load, an NFS client still generates quite a lot of network traffic to the server - but the reduction is still useful. The metadata problem is quite a tricky one since it increases with the number of files you're dealing with. As things stand in my patches, when NFS, for example, wants to access a new inode, it first has to go to the server to lookup the NFS file handle, and only then can it go to the cache to find out if there's a matching object in the case. Worse, the cache must then perform several synchronous disk bound metadata operations before it can be possible to read from the cache. Worse still, this means that a read on the network file cannot proceed until (a) we've been to the server *plus* (b) we've been to the disk. The reason my client going to my server is so quick is that the server has the dcache and the pagecache preloaded, so that across-network lookup operations are really, really quick, as compared to the synchronous slogging of the local disk to find the cache object. I can probably improve this a little by pre-loading the subindex directories (hash tables) that I use to reduce the directory size in the cache, but I don't know by how much. Anyway, to answer your questions: (1) It may help with heavily loaded networks with lots of read-only traffic. (2) It may help with slow connections (like doing NFS between the UK and Australia). (3) It could be used to do offline/disconnected operation. David - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
On Thursday 21 February 2008 16:07, David Howells wrote: The way the client works is like this: Thanks for the excellent ascii art, that cleared up the confusion right away. What are you trying to do exactly? Are you actually playing with it, or just looking at the numbers I've produced? Trying to see if you are offering enough of a win to justify testing it, and if that works out, then going shopping for a bin of rotten vegetables to throw at your design, which I hope you will perceive as useful. In short I am looking for a reason to throw engineering effort at it. From the numbers you have posted I think you are missing some basic efficiencies that could take this design from the sorta-ok zone to wow! I think you may already be in the wow zone for taking load off a server and I know of applications where an NFS server gets hammered so badly that having the client suck a little in the unloaded case is a price worth paying. But the whole idea would be much more attractive if the regressions were smaller. Who is supposed to win big? Is this mainly about reducing the load on the server, or is the client supposed to win even with a lightly loaded server? These are difficult questions to answer. The obvious answer to both is it depends, and the real answer to both is it's a compromise. Inserting a cache adds overhead: you have to look in the cache to see if your objects are mirrored there, and then you have to look in the cache to see if the data you want is stored there; and then you might have to go to the server anyway and then schedule a copy to be stored in the cache. But looking up the object in the cache should be nearly free - much less than a microsecond per block. If not then there are design issues. I suspect that you are doing yourself a disservice by going all the way through the vfs to do this cache lookup, but this needs to be proved. The characteristics of this type of cache depend on a number of things: the filesystem backing it being the most obvious variable, but also how fragmented it is and the properties of the disk drive or drives it is on. Double caching and vm unawareness of that has to hurt. The metadata problem is quite a tricky one since it increases with the number of files you're dealing with. As things stand in my patches, when NFS, for example, wants to access a new inode, it first has to go to the server to lookup the NFS file handle, and only then can it go to the cache to find out if there's a matching object in the case. So without the persistent cache it can omit the LOOKUP and just send the filehandle as part of the READ? Worse, the cache must then perform several synchronous disk bound metadata operations before it can be possible to read from the cache. Worse still, this means that a read on the network file cannot proceed until (a) we've been to the server *plus* (b) we've been to the disk. The reason my client going to my server is so quick is that the server has the dcache and the pagecache preloaded, so that across-network lookup operations are really, really quick, as compared to the synchronous slogging of the local disk to find the cache object. Doesn't that just mean you have to preload the lookup table for the persistent cache so you can determine whether you are caching the data for a filehandle without going to disk? I can probably improve this a little by pre-loading the subindex directories (hash tables) that I use to reduce the directory size in the cache, but I don't know by how much. Ah I should have read ahead. I think the correct answer is a lot. Your big can-t-get-there-from-here is the round trip to the server to determine whether you should read from the local cache. Got any ideas? And where is the Trond-meister in all of this? Regards, Daniel - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] exporting capability name/code pairs (final)
[3/3] A new example to use kobject/kobj_attribute The attached patch can provide a new exmple to use kobject and attribute. The _show() and _store() method can refer/store the private data field of kobj_attribute structure to know what entries are accessed by users. It will make easier to share a single _show()/_store() method with several entries. KaiGai Kohei [EMAIL PROTECTED] -- samples/kobject/kobject-example.c | 32 1 files changed, 32 insertions(+), 0 deletions(-) diff --git a/samples/kobject/kobject-example.c b/samples/kobject/kobject-example.c index 08d0d3f..f99d734 100644 --- a/samples/kobject/kobject-example.c +++ b/samples/kobject/kobject-example.c @@ -77,6 +77,35 @@ static struct kobj_attribute baz_attribute = static struct kobj_attribute bar_attribute = __ATTR(bar, 0666, b_show, b_store); +/* + * You can store a private data within 'data' field of kobj_attribute. + * It enables to share a single _show() or _store() method with several + * entries. + */ +static ssize_t integer_show(struct kobject *kobj, + struct kobj_attribute *attr, + char *buf) +{ + return scnprintf(buf, PAGE_SIZE, %d\n, (int) attr-data); +} + +static ssize_t integer_store(struct kobject *kobj, +struct kobj_attribute *attr, +const char *buf, size_t count) +{ + int code; + + sscanf(buf, %du, code); + attr-data = (void *) code; + return count; +} + +static struct kobj_attribute hoge_attribute = + __ATTR_DATA(hoge, 0666, integer_show, integer_store, 123); +static struct kobj_attribute piyo_attribute = + __ATTR_DATA(piyo, 0666, integer_show, integer_store, 456); +static struct kobj_attribute fuga_attribute = + __ATTR_DATA(fuga, 0444, integer_show, NULL, 789); /* * Create a group of attributes so that we can create and destory them all @@ -86,6 +115,9 @@ static struct attribute *attrs[] = { foo_attribute.attr, baz_attribute.attr, bar_attribute.attr, + hoge_attribute.attr, + piyo_attribute.attr, + fuga_attribute.attr, NULL, /* need to NULL terminate the list of attributes */ }; -- OSS Platform Development Division, NEC KaiGai Kohei [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] exporting capability name/code pairs (final)
[1/3] Add a private data field within kobj_attribute structure. This patch add a private data field, declared as void *, within kobj_attribute structure. The _show() and _store() method in the sysfs attribute entries can refer this information to identify what entry is accessed. It makes easier to share a single method implementation with several similar entries, like ones to export the list of capabilities the running kernel supported. Signed-off-by: KaiGai Kohei [EMAIL PROTECTED] -- include/linux/kobject.h |1 + include/linux/sysfs.h |7 +++ 2 files changed, 8 insertions(+), 0 deletions(-) diff --git a/include/linux/kobject.h b/include/linux/kobject.h index caa3f41..57d5bf1 100644 --- a/include/linux/kobject.h +++ b/include/linux/kobject.h @@ -130,6 +130,7 @@ struct kobj_attribute { char *buf); ssize_t (*store)(struct kobject *kobj, struct kobj_attribute *attr, const char *buf, size_t count); + void *data; /* a private field */ }; extern struct sysfs_ops kobj_sysfs_ops; diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h index 8027104..6f40ff9 100644 --- a/include/linux/sysfs.h +++ b/include/linux/sysfs.h @@ -50,6 +50,13 @@ struct attribute_group { .store = _store, \ } +#define __ATTR_DATA(_name,_mode,_show,_store,_data) { \ + .attr = {.name = __stringify(_name), .mode = _mode }, \ + .show = _show,\ + .store = _store, \ + .data = (void *)(_data), \ +} + #define __ATTR_RO(_name) { \ .attr = { .name = __stringify(_name), .mode = 0444 }, \ .show = _name##_show, \ -- OSS Platform Development Division, NEC KaiGai Kohei [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] capabilities: implement per-process securebits
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Serge E. Hallyn wrote: | It all looks good to me. | | Since we've confirmed that wireshark uses capabilities it must be using | prctl(PR_SET_KEEPCAPS), so running it might be a good way to verify that | your changes to that codepath (with CONFIG_SECURITY_FILE_CAPABILITIES=n) | are 100% correct, and might set minds at ease. Is that something you're | set up to be able to do? I guess I need someone to offer an existence proof that this particular wireshark code ever worked? (ldd dumpcap|grep libcap). For reference, I'm looking at: wireshark-0.99.7/dumpcap.c:302 void relinquish_privs_except_capture(void) { ~/* CAP_NET_ADMIN: Promiscuous mode and a truckload of other ~ *stuff we don't need (and shouldn't have). ~ * CAP_NET_RAW: Packet capture (raw sockets). ~ */ ~cap_value_t cap_list[2] = { CAP_NET_ADMIN, CAP_NET_RAW }; ~cap_t caps = cap_init(); ~int cl_len = sizeof(cap_list) / sizeof(cap_value_t); ~if (started_with_special_privs()) { ~print_caps(Pre drop, pre set); ~if (prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0) == -1) { ~perror(prctl()); ~} ~cap_set_flag(caps, CAP_PERMITTED, cl_len, cap_list, CAP_SET); ~cap_set_flag(caps, CAP_INHERITABLE, cl_len, cap_list, CAP_SET); [ XXX:AGM since (caps.pE caps.pP) this next line should fail ] ~if (cap_set_proc(caps)) { ~perror(capset()); ~} ~print_caps(Pre drop, post set); ~} ~relinquish_special_privs_perm(); ~print_caps(Post drop, pre set); ~cap_set_flag(caps, CAP_EFFECTIVE, cl_len, cap_list, CAP_SET); ~if (cap_set_proc(caps)) { ~perror(capset()); ~} ~print_caps(Post drop, post set); ~cap_free(caps); } #endif /* HAVE_LIBCAP */ My reading of the above code suggests that the application believes that it can raise/retain effective capabilities that are not in its permitted set. Browsing back in my git tree all the way back to 'v2.6.12-rc2', the following code (cap_capset_check) correctly requires: ~ 96 /* verify the _new_Effective_ is a subset of the _new_Permitted_ */ ~ 97 if (!cap_issubset (*effective, *permitted)) { ~ 98 return -EPERM; ~ 99 } so my question is, why should one expect this wireshark code to work? It looks wrong to me. Thanks Andrew | | -serge | | Thanks | | Andrew ~From 006ddf6903983dd596e360ab1ab8e537b29fab46 Mon Sep 17 00:00:00 2001 From: Andrew G. Morgan [EMAIL PROTECTED] Date: Mon, 18 Feb 2008 15:23:28 -0800 Subject: [PATCH] Implement per-process securebits | [This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES ~ is enabled at configure time.] | Filesystem capability support makes it possible to do away with (set)uid-0 based privilege and use capabilities instead. That is, with filesystem support for capabilities but without this present patch, it is (conceptually) possible to manage a system with capabilities alone and never need to obtain privilege via (set)uid-0. | Of course, conceptually isn't quite the same as currently possible since few user applications, certainly not enough to run a viable system, are currently prepared to leverage capabilities to exercise privilege. Further, many applications exist that may never get upgraded in this way, and the kernel will continue to want to support their setuid-0 base privilege needs. | Where pure-capability applications evolve and replace setuid-0 binaries, it is desirable that there be a mechanisms by which they can contain their privilege. In addition to leveraging the per-process bounding and inheritable sets, this should include suppressing the privilege of the uid-0 superuser from the process' tree of children. | The feature added by this patch can be leveraged to suppress the privilege associated with (set)uid-0. This suppression requires CAP_SETPCAP to initiate, and only immediately affects the 'current' process (it is inherited through fork()/exec()). This reimplementation differs significantly from the historical support for securebits which was system-wide, unwieldy and which has ultimately withered to a dead relic in the source of the modern kernel. | With this patch applied a process, that is capable(CAP_SETPCAP), can now drop all legacy privilege (through uid=0) for itself and all subsequently fork()'d/exec()'d children with: | ~ prctl(PR_SET_SECUREBITS, 0x2f); | [2008/02/18: This version includes an int - long argument fix from Serge.] | Signed-off-by: Andrew G. Morgan [EMAIL PROTECTED] Acked-by: Serge Hallyn [EMAIL PROTECTED] Reviewed-by: James Morris [EMAIL PROTECTED] - --- ~ include/linux/capability.h |3 +- ~ include/linux/init_task.h |3 +- ~ include/linux/prctl.h |9 +++- ~ include/linux/sched.h |3 +- ~ include/linux/securebits.h | 25 --- ~ include/linux/security.h | 14 +++--- ~ kernel/sys.c | 25 +-- ~ security/capability.c |1 + ~
[PATCH 2/3] exporting capability name/code pairs (final)
[2/3] Exporting capability code/name pairs This patch enables to export code/name pairs of capabilities the running kernel supported. A newer kernel sometimes adds new capabilities, like CAP_MAC_ADMIN at 2.6.25. However, we have no interface to disclose what capabilities are supported on the running kernel. Thus, we have to maintain libcap version in appropriate one synchronously. This patch enables libcap to collect the list of capabilities at run time, and provide them for users. It helps to improve portability of library. It exports these information as regular files under /sys/kernel/capability. The numeric node exports its name, the symbolic node exports its code. Please consider to put this patch on the queue of 2.6.25. Thanks, BEGIN EXAMPLE [EMAIL PROTECTED] ~]$ ls -R /sys/kernel/capability/ /sys/kernel/capability/: codes names version /sys/kernel/capability/codes: 0 10 12 14 16 18 2 21 23 25 27 29 30 32 4 6 8 1 11 13 15 17 19 20 22 24 26 28 3 31 33 5 7 9 /sys/kernel/capability/names: cap_audit_controlcap_kill cap_net_raw cap_sys_nice cap_audit_write cap_lease cap_setfcap cap_sys_pacct cap_chowncap_linux_immutable cap_setgid cap_sys_ptrace cap_dac_override cap_mac_admin cap_setpcap cap_sys_rawio cap_dac_read_search cap_mac_override cap_setuid cap_sys_resource cap_fowner cap_mknod cap_sys_admin cap_sys_time cap_fsetid cap_net_admin cap_sys_bootcap_sys_tty_config cap_ipc_lock cap_net_bind_service cap_sys_chroot cap_ipc_ownercap_net_broadcast cap_sys_module [EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/version 0x20071026 [EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/codes/30 cap_audit_control [EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/names/cap_sys_pacct 20 [EMAIL PROTECTED] ~]$ END EXAMPLE -- Signed-off-by: KaiGai Kohei [EMAIL PROTECTED] -- Documentation/ABI/testing/sysfs-kernel-capability | 23 + scripts/mkcapnames.sh | 44 + security/Makefile |9 ++ security/commoncap.c | 99 + 4 files changed, 175 insertions(+), 0 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-capability b/Documentation/ABI/testing/sysfs-kernel-capability index e69de29..402ef06 100644 --- a/Documentation/ABI/testing/sysfs-kernel-capability +++ b/Documentation/ABI/testing/sysfs-kernel-capability @@ -0,0 +1,23 @@ +What: /sys/kernel/capability +Date: Feb 2008 +Contact: KaiGai Kohei [EMAIL PROTECTED] +Description: + The entries under /sys/kernel/capability are used to export + the list of capabilities the running kernel supported. + + - /sys/kernel/capability/version + returns the most preferable version number for the + running kernel. + e.g) $ cat /sys/kernel/capability/version + 0x20071026 + + - /sys/kernel/capability/code/numerical representation + returns its symbolic representation, on reading. + e.g) $ cat /sys/kernel/capability/codes/30 + cap_audit_control + + - /sys/kernel/capability/name/symbolic representation + returns its numerical representation, on reading. + e.g) $ cat /sys/kernel/capability/names/cap_sys_pacct + 20 + diff --git a/scripts/mkcapnames.sh b/scripts/mkcapnames.sh index e69de29..5d36d52 100644 --- a/scripts/mkcapnames.sh +++ b/scripts/mkcapnames.sh @@ -0,0 +1,44 @@ +#!/bin/sh + +# +# generate a cap_names.h file from include/linux/capability.h +# + +CAPHEAD=`dirname $0`/../include/linux/capability.h +REGEXP='^#define CAP_[A-Z_]+[ ]+[0-9]+$' +NUMCAP=`cat $CAPHEAD | egrep -c $REGEXP` + +echo '#ifndef CAP_NAMES_H' +echo '#define CAP_NAMES_H' +echo +echo '/*' +echo ' * Do NOT edit this file directly.' +echo ' * This file is generated from include/linux/capability.h automatically' +echo ' */' +echo +echo '#if !defined(SYSFS_CAP_NAME_ENTRY) || !defined(SYSFS_CAP_CODE_ENTRY)' +echo '#error cap_names.h should be included from security/capability.c' +echo '#else' +echo #if $NUMCAP != CAP_LAST_CAP + 1 +echo '#error mkcapnames.sh cannot collect capabilities correctly' +echo '#else' +cat $CAPHEAD | egrep $REGEXP \ +| awk '{ printf(SYSFS_CAP_NAME_ENTRY(%s,%s);\n, tolower($2), $2); }' +echo +echo 'static struct attribute *capability_name_attrs[] = {' +cat $CAPHEAD | egrep $REGEXP \ +| awk '{ printf(\t%s_name_attr.attr,\n, tolower($2)); } END { print \tNULL, }' +echo '};' + +echo +cat $CAPHEAD | egrep $REGEXP \ +| awk '{ printf(SYSFS_CAP_CODE_ENTRY(%s,%s);\n, tolower($2), $2); }' +echo +echo 'static struct attribute *capability_code_attrs[] = {' +cat
Re: [PATCH 06/37] Security: Separate task security context from task_struct
--- David Howells [EMAIL PROTECTED] wrote: Separate the task security context from task_struct. At this point, the security data is temporarily embedded in the task_struct with two pointers pointing to it. ... diff --git a/security/smack/smack_access.c b/security/smack/smack_access.c index f6b5f6e..722752f 100644 --- a/security/smack/smack_access.c +++ b/security/smack/smack_access.c @@ -164,7 +164,7 @@ int smk_curacc(char *obj_label, u32 mode) { int rc; - rc = smk_access(current-security, obj_label, mode); + rc = smk_access(current-act_as-security, obj_label, mode); if (rc == 0) return 0; diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c index 25cbfa3..a49d94f 100644 --- a/security/smack/smack_lsm.c +++ b/security/smack/smack_lsm.c @@ -102,7 +102,8 @@ static int smack_ptrace(struct task_struct *ptp, struct task_struct *ctp) if (rc != 0) return rc; - rc = smk_access(ptp-security, ctp-security, MAY_READWRITE); + rc = smk_access(ptp-act_as-security, ctp-sec-security, + MAY_READWRITE); if (rc != 0 __capable(ptp, CAP_MAC_OVERRIDE)) return 0; @@ -120,7 +121,7 @@ static int smack_ptrace(struct task_struct *ptp, struct task_struct *ctp) static int smack_syslog(int type) { int rc; - char *sp = current-security; + char *sp = current-act_as-security; rc = cap_syslog(type); if (rc != 0) @@ -359,7 +360,7 @@ static int smack_sb_umount(struct vfsmount *mnt, int flags) */ static int smack_inode_alloc_security(struct inode *inode) { - inode-i_security = new_inode_smack(current-security); + inode-i_security = new_inode_smack(current-act_as-security); if (inode-i_security == NULL) return -ENOMEM; return 0; @@ -777,7 +778,7 @@ static int smack_file_permission(struct file *file, int mask) */ static int smack_file_alloc_security(struct file *file) { - file-f_security = current-security; + file-f_security = current-act_as-security; return 0; } @@ -875,7 +876,7 @@ static int smack_file_fcntl(struct file *file, unsigned int cmd, */ static int smack_file_set_fowner(struct file *file) { - file-f_security = current-security; + file-f_security = current-act_as-security; return 0; } @@ -900,7 +901,7 @@ static int smack_file_send_sigiotask(struct task_struct *tsk, * struct fown_struct is never outside the context of a struct file */ file = container_of(fown, struct file, f_owner); - rc = smk_access(file-f_security, tsk-security, MAY_WRITE); + rc = smk_access(file-f_security, tsk-sec-security, MAY_WRITE); if (rc != 0 __capable(tsk, CAP_MAC_OVERRIDE)) return 0; return rc; @@ -943,7 +944,7 @@ static int smack_file_receive(struct file *file) */ static int smack_task_alloc_security(struct task_struct *tsk) { - tsk-security = current-security; + tsk-sec-security = current-act_as-security; return 0; } @@ -958,7 +959,7 @@ static int smack_task_alloc_security(struct task_struct *tsk) */ static void smack_task_free_security(struct task_struct *task) { - task-security = NULL; + task-sec-security = NULL; } /** @@ -970,7 +971,7 @@ static void smack_task_free_security(struct task_struct *task) */ static int smack_task_setpgid(struct task_struct *p, pid_t pgid) { - return smk_curacc(p-security, MAY_WRITE); + return smk_curacc(p-sec-security, MAY_WRITE); } /** @@ -981,7 +982,7 @@ static int smack_task_setpgid(struct task_struct *p, pid_t pgid) */ static int smack_task_getpgid(struct task_struct *p) { - return smk_curacc(p-security, MAY_READ); + return smk_curacc(p-sec-security, MAY_READ); } /** @@ -992,7 +993,7 @@ static int smack_task_getpgid(struct task_struct *p) */ static int smack_task_getsid(struct task_struct *p) { - return smk_curacc(p-security, MAY_READ); + return smk_curacc(p-sec-security, MAY_READ); } /** @@ -1004,7 +1005,7 @@ static int smack_task_getsid(struct task_struct *p) */ static void smack_task_getsecid(struct task_struct *p, u32 *secid) { - *secid = smack_to_secid(p-security); + *secid = smack_to_secid(p-sec-security); } /** @@ -1016,7 +1017,7 @@ static void smack_task_getsecid(struct task_struct *p, u32 *secid) */ static int smack_task_setnice(struct task_struct *p, int nice) { - return smk_curacc(p-security, MAY_WRITE); + return smk_curacc(p-sec-security, MAY_WRITE); } /** @@ -1028,7 +1029,7 @@ static int smack_task_setnice(struct task_struct *p, int nice) */ static int smack_task_setioprio(struct task_struct *p, int ioprio) { - return smk_curacc(p-security, MAY_WRITE); + return smk_curacc(p-sec-security, MAY_WRITE); } /** @@ -1039,7 +1040,7 @@
Re: [PATCH 07/37] Security: De-embed task security record from task and use refcounting
--- David Howells [EMAIL PROTECTED] wrote: Remove the temporarily embedded task security record from task_struct. Instead it is made to dangle from the task_struct::sec and task_struct::act_as pointers with references counted for each. ... The LSM hooks for dealing with task security are modified to deal with the task security struct directly rather than going via the task_struct as appopriate. ... diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c index a49d94f..dbce607 100644 --- a/security/smack/smack_lsm.c +++ b/security/smack/smack_lsm.c @@ -957,9 +957,22 @@ static int smack_task_alloc_security(struct task_struct *tsk) * points to an immutable list. The blobs never go away. * There is no leak here. */ -static void smack_task_free_security(struct task_struct *task) +static void smack_task_free_security(struct task_security *sec) { - task-sec-security = NULL; + sec-security = NULL; +} + +/** + * task_dup_security - Duplicate task security + * @p points to the task_security struct that has been copied + * + * Duplicate the security structure currently attached to the p-security field + * and attach back to p-security (the pointer itself was copied, so there's + * nothing to be done here). + */ +static int smack_task_dup_security(struct task_security *sec) +{ + return 0; } Thank you for adding this hook. The comment is helpful. /** @@ -2276,17 +2289,17 @@ static int smack_inet_conn_request(struct sock *sk, struct sk_buff *skb, /** * smack_key_alloc - Set the key security blob * @key: object - * @tsk: the task associated with the key + * @context: the task security associated with the key * @flags: unused * * No allocation required * * Returns 0 */ -static int smack_key_alloc(struct key *key, struct task_struct *tsk, +static int smack_key_alloc(struct key *key, struct task_security *context, unsigned long flags) { - key-security = tsk-act_as-security; + key-security = context-security; return 0; } @@ -2304,14 +2317,14 @@ static void smack_key_free(struct key *key) /* * smack_key_permission - Smack access on a key * @key_ref: gets to the object - * @context: task involved + * @context: task security involved * @perm: unused * * Return 0 if the task has read and write to the object, * an error code otherwise */ static int smack_key_permission(key_ref_t key_ref, - struct task_struct *context, key_perm_t perm) + struct task_security *context, key_perm_t perm) { struct key *keyp; @@ -2327,10 +2340,10 @@ static int smack_key_permission(key_ref_t key_ref, /* * This should not occur */ - if (context-act_as-security == NULL) + if (context-security == NULL) return -EACCES; - return smk_access(context-act_as-security, keyp-security, + return smk_access(context-security, keyp-security, MAY_READWRITE); } #endif /* CONFIG_KEYS */ @@ -2430,6 +2443,7 @@ static struct security_operations smack_ops = { .task_alloc_security = smack_task_alloc_security, .task_free_security = smack_task_free_security, + .task_dup_security =smack_task_dup_security, .task_post_setuid = cap_task_post_setuid, .task_setpgid = smack_task_setpgid, .task_getpgid = smack_task_getpgid, No objections from the Smack side. Thank you. Casey Schaufler [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/37] Security: Allow kernel services to override LSM settings for task actions
--- David Howells [EMAIL PROTECTED] wrote: Allow kernel services to override LSM settings appropriate to the actions performed by a task by duplicating a security record, modifying it and then using task_struct::act_as to point to it when performing operations on behalf of a task. This is used, for example, by CacheFiles which has to transparently access the cache on behalf of a process that thinks it is doing, say, NFS accesses with a potentially inappropriate (with respect to accessing the cache) set of security data. This patch provides two LSM hooks for modifying a task security record: (*) security_kernel_act_as() which allows modification of the security datum with which a task acts on other objects (most notably files). (*) security_create_files_as() which allows modification of the security datum that is used to initialise the security data on a file that a task creates. ... --- a/security/smack/smack_lsm.c +++ b/security/smack/smack_lsm.c @@ -976,6 +976,36 @@ static int smack_task_dup_security(struct task_security *sec) } /** + * smack_task_kernel_act_as - Set the subjective context in a security record + * @p points to the task that nominated @secid. + * @sec points to the task security record to be modified. + * @secid specifies the security ID to be set + * + * Set the security data for a kernel service. + */ +static int smack_task_kernel_act_as(struct task_struct *p, + struct task_security *sec, u32 secid) +{ + return -ENOTSUPP; +} + +/** + * smack_task_create_files_as - Set the file creation label in a security record + * @p points to the task that nominated @inode. + * @sec points to the task security record to be modified. + * @inode points to the inode to use as a reference. + * + * Set the file creation context in a security record to the same as the + * objective context of the specified inode + */ +static int smack_task_create_files_as(struct task_struct *p, + struct task_security *sec, + struct inode *inode) +{ + return -ENOTSUPP; +} Hum. ENOTSUPP is not not very satisfying, is it? I will have to think on this a bit. + +/** * smack_task_setpgid - Smack check on setting pgid * @p: the task object * @pgid: unused @@ -2444,6 +2474,8 @@ static struct security_operations smack_ops = { .task_alloc_security = smack_task_alloc_security, .task_free_security = smack_task_free_security, .task_dup_security =smack_task_dup_security, + .task_kernel_act_as = smack_task_kernel_act_as, + .task_create_files_as = smack_task_create_files_as, .task_post_setuid = cap_task_post_setuid, .task_setpgid = smack_task_setpgid, .task_getpgid = smack_task_getpgid, Except for the fact that the hooks don't do anything this looks fine. I'm not sure that I would want these hooks to do anything, it requires additional thought to determine if there is a good behavior for them. Thank you. Casey Schaufler [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] capabilities: implement per-process securebits
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Andrew G. Morgan wrote: | Serge E. Hallyn wrote: | | It all looks good to me. | | | | Since we've confirmed that wireshark uses capabilities it must be using | | prctl(PR_SET_KEEPCAPS), so running it might be a good way to verify that | | your changes to that codepath (with CONFIG_SECURITY_FILE_CAPABILITIES=n) | | are 100% correct, and might set minds at ease. Is that something you're | | set up to be able to do? | | I guess I need someone to offer an existence proof that this particular | wireshark code ever worked? (ldd dumpcap|grep libcap). For reference, | I'm looking at: Doh! :*) I had mistaken cap_init() for cap_get_proc() in the wireshark code... I now see what wireshark is trying to do, and can *confirm* that with the present patch I do maintain the legacy behavior. :-) FWIW I've updated capsh in the libcap git tree to add a few more hooks and with the following sequence can now verify that the keep-caps works as before: As root: # rm -f tcapsh # cp capsh tcapsh # chown root.root tcapsh # chmod u+s tcapsh # ls -l tcapsh # ./capsh --uid=500 -- -c ./tcapsh --keep=1 \ ~ --caps=\cap_net_raw,cap_net_admin=ip\ --uid=500 \ ~ --caps=\cap_net_raw,cap_net_admin=pie\ --print # echo $? 0 The wireshark problem, that you have been discussing (in the other thread), can also be simulated as follows: # ./capsh --uid=500 -- -c ./tcapsh --keep=1 \ ~ --caps=\cap_net_raw,cap_net_admin=ip\ \ ~ --uid=500 --forkfor=10 --caps= --print \ ~ --killit=9 --print You might like to re-post your fix for that problem as a stand alone patch; I suspect it may be lost in the noise at this point. (You might also like to update the comment in that fix since the old comment looks very stale if you delete the -euid==0 check. I think it is safe to simply say /* legacy signal behavior requires that a user can kill any process running with their uid */) Cheers Andrew | | wireshark-0.99.7/dumpcap.c:302 | | void | relinquish_privs_except_capture(void) | { | ~/* CAP_NET_ADMIN: Promiscuous mode and a truckload of other | ~ *stuff we don't need (and shouldn't have). | ~ * CAP_NET_RAW: Packet capture (raw sockets). | ~ */ | ~cap_value_t cap_list[2] = { CAP_NET_ADMIN, CAP_NET_RAW }; | ~cap_t caps = cap_init(); | ~int cl_len = sizeof(cap_list) / sizeof(cap_value_t); | | ~if (started_with_special_privs()) { | ~print_caps(Pre drop, pre set); | ~if (prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0) == -1) { | ~perror(prctl()); | ~} | | ~cap_set_flag(caps, CAP_PERMITTED, cl_len, cap_list, CAP_SET); | ~cap_set_flag(caps, CAP_INHERITABLE, cl_len, cap_list, CAP_SET); | | [ XXX:AGM since (caps.pE caps.pP) this next line should fail ] | ~if (cap_set_proc(caps)) { | ~perror(capset()); | ~} | ~print_caps(Pre drop, post set); | ~} | | ~relinquish_special_privs_perm(); | | ~print_caps(Post drop, pre set); | ~cap_set_flag(caps, CAP_EFFECTIVE, cl_len, cap_list, CAP_SET); | ~if (cap_set_proc(caps)) { | ~perror(capset()); | ~} | ~print_caps(Post drop, post set); | ~cap_free(caps); | } | #endif /* HAVE_LIBCAP */ | | My reading of the above code suggests that the application believes that | it can raise/retain effective capabilities that are not in its permitted | set. | | Browsing back in my git tree all the way back to 'v2.6.12-rc2', the | following code (cap_capset_check) correctly requires: | | ~ 96 /* verify the _new_Effective_ is a subset of the _new_Permitted_ */ | ~ 97 if (!cap_issubset (*effective, *permitted)) { | ~ 98 return -EPERM; | ~ 99 } | | so my question is, why should one expect this wireshark code to work? It | looks wrong to me. | | Thanks | | Andrew | | | | | -serge | | | | Thanks | | | | Andrew | | ~From 006ddf6903983dd596e360ab1ab8e537b29fab46 Mon Sep 17 00:00:00 2001 | From: Andrew G. Morgan [EMAIL PROTECTED] | Date: Mon, 18 Feb 2008 15:23:28 -0800 | Subject: [PATCH] Implement per-process securebits | | | [This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES | ~ is enabled at configure time.] | | | Filesystem capability support makes it possible to do away with | (set)uid-0 based privilege and use capabilities instead. That is, with | filesystem support for capabilities but without this present patch, | it is (conceptually) possible to manage a system with capabilities | alone and never need to obtain privilege via (set)uid-0. -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.6 (GNU/Linux) iD8DBQFHvmeB+bHCR3gb8jsRAknmAKCMw0Qe7uDwtuRE+f3YVmnlE5pK4wCgsv0f 5E6+K9Z0Xp1P74iOlnt221o= =+sLL -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html