Re: [PATCH 2/3] exporting capability name/code pairs (final)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 KaiGai, I've just tried to build this with a separate obj tree: make O=/path.../ ~ the build failed as follows: ~ CC security/dummy.o ~ CC security/inode.o ~ CAPSsecurity/cap_names.h /bin/sh: security/../scripts/mkcapnames.sh: No such file or directory make[3]: *** [security/cap_names.h] Error 127 make[2]: *** [security] Error 2 make[1]: *** [sub-make] Error 2 make: *** [all] Error 2 when I replace $(src)/../scripts/... with $(srctree)/scripts/... I get it to compile, but (x86_64) see this warning fly by: ~ CC security/commoncap.o /home/morgan/gits/linux-2.6/security/commoncap.c: In function `capability_name_show': /home/morgan/gits/linux-2.6/security/commoncap.c:652: warning: cast from pointer to integer of different size Cheers Andrew Kohei KaiGai wrote: | [2/3] Exporting capability code/name pairs | | This patch enables to export code/name pairs of capabilities the running | kernel supported. | | A newer kernel sometimes adds new capabilities, like CAP_MAC_ADMIN | at 2.6.25. However, we have no interface to disclose what capabilities | are supported on the running kernel. Thus, we have to maintain libcap | version in appropriate one synchronously. | | This patch enables libcap to collect the list of capabilities at run time, | and provide them for users. It helps to improve portability of library. | | It exports these information as regular files under /sys/kernel/capability. | The numeric node exports its name, the symbolic node exports its code. | | Please consider to put this patch on the queue of 2.6.25. | | Thanks, | | BEGIN EXAMPLE | [EMAIL PROTECTED] ~]$ ls -R /sys/kernel/capability/ | /sys/kernel/capability/: | codes names version | | /sys/kernel/capability/codes: | 0 10 12 14 16 18 2 21 23 25 27 29 30 32 4 6 8 | 1 11 13 15 17 19 20 22 24 26 28 3 31 33 5 7 9 | | /sys/kernel/capability/names: | cap_audit_controlcap_kill cap_net_raw cap_sys_nice | cap_audit_write cap_lease cap_setfcap cap_sys_pacct | cap_chowncap_linux_immutable cap_setgid cap_sys_ptrace | cap_dac_override cap_mac_admin cap_setpcap cap_sys_rawio | cap_dac_read_search cap_mac_override cap_setuid cap_sys_resource | cap_fowner cap_mknod cap_sys_admin cap_sys_time | cap_fsetid cap_net_admin cap_sys_boot cap_sys_tty_config | cap_ipc_lock cap_net_bind_service cap_sys_chroot | cap_ipc_ownercap_net_broadcast cap_sys_module | [EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/version | 0x20071026 | [EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/codes/30 | cap_audit_control | [EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/names/cap_sys_pacct | 20 | [EMAIL PROTECTED] ~]$ | END EXAMPLE -- | | Signed-off-by: KaiGai Kohei [EMAIL PROTECTED] | -- | Documentation/ABI/testing/sysfs-kernel-capability | 23 + | scripts/mkcapnames.sh | 44 + | security/Makefile |9 ++ | security/commoncap.c | 99 + | 4 files changed, 175 insertions(+), 0 deletions(-) | | diff --git a/Documentation/ABI/testing/sysfs-kernel-capability b/Documentation/ABI/testing/sysfs-kernel-capability | index e69de29..402ef06 100644 | --- a/Documentation/ABI/testing/sysfs-kernel-capability | +++ b/Documentation/ABI/testing/sysfs-kernel-capability | @@ -0,0 +1,23 @@ | +What:/sys/kernel/capability | +Date:Feb 2008 | +Contact: KaiGai Kohei [EMAIL PROTECTED] | +Description: | + The entries under /sys/kernel/capability are used to export | + the list of capabilities the running kernel supported. | + | + - /sys/kernel/capability/version | + returns the most preferable version number for the | + running kernel. | + e.g) $ cat /sys/kernel/capability/version | +0x20071026 | + | + - /sys/kernel/capability/code/numerical representation | + returns its symbolic representation, on reading. | + e.g) $ cat /sys/kernel/capability/codes/30 | +cap_audit_control | + | + - /sys/kernel/capability/name/symbolic representation | + returns its numerical representation, on reading. | + e.g) $ cat /sys/kernel/capability/names/cap_sys_pacct | +20 | + | diff --git a/scripts/mkcapnames.sh b/scripts/mkcapnames.sh | index e69de29..5d36d52 100644 | --- a/scripts/mkcapnames.sh | +++ b/scripts/mkcapnames.sh | @@ -0,0 +1,44 @@ | +#!/bin/sh | + | +# | +# generate a cap_names.h file from include/linux/capability.h | +# | + | +CAPHEAD=`dirname $0`/../include/linux/capability.h | +REGEXP='^#define CAP_[A-Z_]+[]+[0-9]+$' | +NUMCAP=`cat $CAPHEAD
Re: [PATCH 2/3] exporting capability name/code pairs (final)
Kohei KaiGai 写道: [2/3] Exporting capability code/name pairs This patch enables to export code/name pairs of capabilities the running kernel supported. supported or supports ? A newer kernel sometimes adds new capabilities, like CAP_MAC_ADMIN at 2.6.25. However, we have no interface to disclose what capabilities are supported on the running kernel. Thus, we have to maintain libcap version in appropriate one synchronously. This patch enables libcap to collect the list of capabilities at run time, and provide them for users. It helps to improve portability of library. It exports these information as regular files under /sys/kernel/capability. The numeric node exports its name, the symbolic node exports its code. Please consider to put this patch on the queue of 2.6.25. Thanks, BEGIN EXAMPLE [EMAIL PROTECTED] ~]$ ls -R /sys/kernel/capability/ /sys/kernel/capability/: codes names version /sys/kernel/capability/codes: 0 10 12 14 16 18 2 21 23 25 27 29 30 32 4 6 8 1 11 13 15 17 19 20 22 24 26 28 3 31 33 5 7 9 /sys/kernel/capability/names: cap_audit_controlcap_kill cap_net_raw cap_sys_nice cap_audit_write cap_lease cap_setfcap cap_sys_pacct cap_chowncap_linux_immutable cap_setgid cap_sys_ptrace cap_dac_override cap_mac_admin cap_setpcap cap_sys_rawio cap_dac_read_search cap_mac_override cap_setuid cap_sys_resource cap_fowner cap_mknod cap_sys_admin cap_sys_time cap_fsetid cap_net_admin cap_sys_bootcap_sys_tty_config cap_ipc_lock cap_net_bind_service cap_sys_chroot cap_ipc_ownercap_net_broadcast cap_sys_module [EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/version 0x20071026 [EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/codes/30 cap_audit_control [EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/names/cap_sys_pacct 20 [EMAIL PROTECTED] ~]$ END EXAMPLE -- Signed-off-by: KaiGai Kohei [EMAIL PROTECTED] -- Documentation/ABI/testing/sysfs-kernel-capability | 23 + scripts/mkcapnames.sh | 44 + security/Makefile |9 ++ security/commoncap.c | 99 + 4 files changed, 175 insertions(+), 0 deletions(-) diff --git a/Documentation/ABI/testing/sysfs-kernel-capability b/Documentation/ABI/testing/sysfs-kernel-capability index e69de29..402ef06 100644 --- a/Documentation/ABI/testing/sysfs-kernel-capability +++ b/Documentation/ABI/testing/sysfs-kernel-capability @@ -0,0 +1,23 @@ +What:/sys/kernel/capability +Date:Feb 2008 +Contact: KaiGai Kohei [EMAIL PROTECTED] +Description: + The entries under /sys/kernel/capability are used to export + the list of capabilities the running kernel supported. + ditto, supported or supports ? + - /sys/kernel/capability/version + returns the most preferable version number for the + running kernel. + e.g) $ cat /sys/kernel/capability/version +0x20071026 + + - /sys/kernel/capability/code/numerical representation + returns its symbolic representation, on reading. + e.g) $ cat /sys/kernel/capability/codes/30 +cap_audit_control + + - /sys/kernel/capability/name/symbolic representation + returns its numerical representation, on reading. + e.g) $ cat /sys/kernel/capability/names/cap_sys_pacct +20 + diff --git a/scripts/mkcapnames.sh b/scripts/mkcapnames.sh index e69de29..5d36d52 100644 --- a/scripts/mkcapnames.sh +++ b/scripts/mkcapnames.sh @@ -0,0 +1,44 @@ +#!/bin/sh + +# +# generate a cap_names.h file from include/linux/capability.h +# + +CAPHEAD=`dirname $0`/../include/linux/capability.h +REGEXP='^#define CAP_[A-Z_]+[]+[0-9]+$' +NUMCAP=`cat $CAPHEAD | egrep -c $REGEXP` + +echo '#ifndef CAP_NAMES_H' +echo '#define CAP_NAMES_H' +echo +echo '/*' +echo ' * Do NOT edit this file directly.' +echo ' * This file is generated from include/linux/capability.h automatically' +echo ' */' +echo +echo '#if !defined(SYSFS_CAP_NAME_ENTRY) || !defined(SYSFS_CAP_CODE_ENTRY)' +echo '#error cap_names.h should be included from security/capability.c' +echo '#else' +echo #if $NUMCAP != CAP_LAST_CAP + 1 +echo '#error mkcapnames.sh cannot collect capabilities correctly' +echo '#else' +cat $CAPHEAD | egrep $REGEXP \ +| awk '{ printf(SYSFS_CAP_NAME_ENTRY(%s,%s);\n, tolower($2), $2); }' +echo +echo 'static struct attribute *capability_name_attrs[] = {' +cat $CAPHEAD | egrep $REGEXP \ +| awk '{ printf(\t%s_name_attr.attr,\n, tolower($2)); } END { print \tNULL, }' +echo '};'
Re: [PATCH 00/37] Permit filesystem local caching
Daniel Phillips [EMAIL PROTECTED] wrote: The way the client works is like this: Thanks for the excellent ascii art, that cleared up the confusion right away. You know what they say about pictures... :-) What are you trying to do exactly? Are you actually playing with it, or just looking at the numbers I've produced? Trying to see if you are offering enough of a win to justify testing it, and if that works out, then going shopping for a bin of rotten vegetables to throw at your design, which I hope you will perceive as useful. One thing that you have to remember: my test setup is pretty much the worst-case for being appropriate for showing the need for caching to improve performance. There's a single client and a single server, they've got GigE networking between them that has very little other load, and the server has sufficient memory to hold the entire test data set. From the numbers you have posted I think you are missing some basic efficiencies that could take this design from the sorta-ok zone to wow! Not really, it's just that this lashup could be considered designed to show local caching in the worst light. But looking up the object in the cache should be nearly free - much less than a microsecond per block. The problem is that you have to do a database lookup of some sort, possibly involving several synchronous disk operations. CacheFiles does a disk lookup by taking the key given to it by NFS, turning it into a set of file or directory names, and doing a short pathwalk to the target cache file. Throwing in extra indices won't necessarily help. What matters is how quick the backing filesystem is at doing lookups. As it turns out, Ext3 is a fair bit better then BTRFS when the disk cache is cold. The metadata problem is quite a tricky one since it increases with the number of files you're dealing with. As things stand in my patches, when NFS, for example, wants to access a new inode, it first has to go to the server to lookup the NFS file handle, and only then can it go to the cache to find out if there's a matching object in the case. So without the persistent cache it can omit the LOOKUP and just send the filehandle as part of the READ? What 'it'? Note that the get the filehandle, you have to do a LOOKUP op. With the cache, we could actually cache the results of lookups that we've done, however, we don't know that the results are still valid without going to the server:-/ AFS has a way around that - it versions its vnode (inode) IDs. The reason my client going to my server is so quick is that the server has the dcache and the pagecache preloaded, so that across-network lookup operations are really, really quick, as compared to the synchronous slogging of the local disk to find the cache object. Doesn't that just mean you have to preload the lookup table for the persistent cache so you can determine whether you are caching the data for a filehandle without going to disk? Where lookup table == dcache. That would be good yes. cachefilesd prescans all the files in the cache, which ought to do just that, but it doesn't seem to be very effective. I'm not sure why. I can probably improve this a little by pre-loading the subindex directories (hash tables) that I use to reduce the directory size in the cache, but I don't know by how much. Ah I should have read ahead. I think the correct answer is a lot. Quite possibly. It'll allow me to dispense with at least one fs lookup call per cache object request call. Your big can-t-get-there-from-here is the round trip to the server to determine whether you should read from the local cache. Got any ideas? I'm not sure what you mean. Your statement should probably read ... to determine _what_ you should read from the local cache. And where is the Trond-meister in all of this? Keeping quiet as far as I can tell. David - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
On Thursday 21 February 2008, David Howells wrote: David Howells [EMAIL PROTECTED] wrote: Have you got before/after benchmark results? See attached. Attached here are results using BTRFS (patched so that it'll work at all) rather than Ext3 on the client on the partition backing the cache. Thanks for trying this, of course I'll ask you to try again with the latest v0.13 code, it has a number of optimizations especially for CPU usage. Note that I didn't bother redoing the tests that didn't involve a cache as the choice of filesystem backing the cache should have no bearing on the result. Generally, completely cold caches shouldn't show much variation as all the writing can be done completely asynchronously, provided the client doesn't fill its RAM. The interesting case is where the disk cache is warm, but the pagecache is cold (ie: just after a reboot after filling the caches). Here, for the two big files case, BTRFS appears quite a bit better than Ext3, showing a 21% reduction in time for the smaller case and a 13% reduction for the larger case. I'm afraid I don't have a good handle on the filesystem operations that result from this workload. Are we reading from the FS to fill the NFS page cache? For the many small/medium files case, BTRFS performed significantly better (15% reduction in time) in the case where the caches were completely cold. I'm not sure why, though - perhaps because it doesn't execute a write_begin() stage during the write_one_page() call and thus doesn't go allocating disk blocks to back the data, but instead allocates them later. If your write_one_page call does parts of btrfs_file_write, you'll get delayed allocation for anything bigger than 8k by default. = 8k will get packed into the btree leaves. More surprising is that BTRFS performed significantly worse (15% increase in time) in the case where the cache on disk was fully populated and then the machine had been rebooted to clear the pagecaches. Which FS operations are included here? Finding all the files or just an unmount? Btrfs defrags metadata in the background, and unmount has to wait for that defrag to finish. Thanks again, Chris - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] exporting capability name/code pairs (final)
On Fri, Feb 22, 2008 at 06:45:32PM +0900, Kohei KaiGai wrote: I believe it is correct assumption that long type and pointers have same width in the linux kernel. Please tell me, if it is wrong. That is correct, it is one of the assumptions that is safe to make. But you should fix the compiler warning :) thanks, greg k-h - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Chris Mason [EMAIL PROTECTED] wrote: The interesting case is where the disk cache is warm, but the pagecache is cold (ie: just after a reboot after filling the caches). Here, for the two big files case, BTRFS appears quite a bit better than Ext3, showing a 21% reduction in time for the smaller case and a 13% reduction for the larger case. I'm afraid I don't have a good handle on the filesystem operations that result from this workload. Are we reading from the FS to fill the NFS page cache? I'm not sure what you're asking. When the cache is cold, we determine that we can't read from the cache very quickly. We then read data from the server and, in the background, create the metadata in the cache and store the data to it (by copying netfs pages to backingfs pages). When the cache is warm, we read the data from the cache, copying the data from the backingfs pages to the netfs pages. We use bmap() to ascertain that there is data to be read, otherwise we detect a hole and fallback to reading from the server. Looking up cache object involves a sequence of lookup() ops and getxattr() ops on the backingfs. Should an object not exist, we defer creation of that object to a background thread and do lookups(), mkdirs() and setxattrs() and a create() to manufacture the object. We read data from an object by calling readpages() on the backingfs to bring the data into the pagecache. We monitor the PG_lock bits to find out when each page is read or has completed with an error. Writing pages to the cache is done completely in the background. PG_fscache_write is set on a page when it is handed to fscache to storage, then at some point a background thread wakes up and calls write_one_page() in the backingfs to write that page to the cache file. At the moment, this copies the data into a backingfs page which is then marked PG_dirty, and the VM writes it out in the usual way. More surprising is that BTRFS performed significantly worse (15% increase in time) in the case where the cache on disk was fully populated and then the machine had been rebooted to clear the pagecaches. Which FS operations are included here? Finding all the files or just an unmount? Btrfs defrags metadata in the background, and unmount has to wait for that defrag to finish. BTRFS might not be doing any writing at all here - apart from local atimes (used by cache culling), that is. What it does have to do is lots of lookups, reads and getxattrs, all of which are synchronous. David - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
David Howells [EMAIL PROTECTED] wrote: Have you got before/after benchmark results? See attached. Attached here are results using BTRFS (patched so that it'll work at all) rather than Ext3 on the client on the partition backing the cache. And here are XFS results. Tuning XFS makes a *really* big difference for the lots of small/medium files being tarred case. However, in general BTRFS is much better. David --- = FEW BIG FILES TEST ON XFS = Completely cold caches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m2.286s user0m0.000s sys 0m1.828s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m4.228s user0m0.000s sys 0m1.360s Warm NFS pagecache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m0.058s user0m0.000s sys 0m0.060s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m0.122s user0m0.000s sys 0m0.120s Warm XFS pagecache, cold NFS pagecache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m0.181s user0m0.000s sys 0m0.180s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m1.034s user0m0.000s sys 0m0.404s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m1.540s user0m0.000s sys 0m0.256s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m3.003s user0m0.000s sys 0m0.532s == MANY SMALL/MEDIUM FILE READING TEST ON XFS == Completely cold caches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero real4m56.827s user0m0.180s sys 0m6.668s Warm NFS pagecache: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero real0m15.084s user0m0.212s sys 0m5.008s Warm XFS pagecache, cold NFS pagecache: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero real0m13.547s user0m0.220s sys 0m5.652s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero real4m36.316s user0m0.148s sys 0m4.440s === MANY SMALL/MEDIUM FILE READING TEST ON AN OPTIMISED XFS === mkfs.xfs -d agcount=4 -l size=128m,version=2 /dev/sda6 Completely cold caches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero real3m44.033s user0m0.248s sys 0m6.632s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero real3m8.582s user0m0.108s sys 0m3.420s - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 00/37] Permit filesystem local caching
Well, the AFS paper that was referenced earlier was written around the time of 10bt and 100bt. Local disk caching worked well then. There should also be some papers at CITI about disk caching over slower connections, and disconnected operation (which should still be applicable today). There are still winners from local disk caching, but their numbers have been reduced. Server load reduction should be a win. I'm not sure if it's worth it from a security/manageability standpoint, but I haven't looked that closely at David's code. One area that you might want to look at is WAN performance. When RPC RTT goes up, ordinary NFS performance goes down. This tends to get overlooked by the machine room folks. (There are several tools out there that can introduce delay in an IP packet stream and emulate WAN RTTs.) Just a thought, rick - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Chris Mason [EMAIL PROTECTED] wrote: Thanks for trying this, of course I'll ask you to try again with the latest v0.13 code, it has a number of optimizations especially for CPU usage. Here you go. The numbers are very similar. David = FEW BIG FILES TEST ON BTRFS v0.13 = Completely cold caches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m2.202s user0m0.000s sys 0m1.716s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m4.212s user0m0.000s sys 0m0.896s Warm BTRFS pagecache, cold NFS pagecache: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m0.197s user0m0.000s sys 0m0.192s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m0.376s user0m0.000s sys 0m0.372s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null real0m1.543s user0m0.004s sys 0m1.448s [EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null real0m3.111s user0m0.000s sys 0m2.856s == MANY SMALL/MEDIUM FILE READING TEST ON BTRFS v0.13 == Completely cold caches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero real0m31.575s user0m0.176s sys 0m6.316s Warm BTRFS pagecache, cold NFS pagecache: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero real0m16.081s user0m0.164s sys 0m5.528s Warm on-disk cache, cold pagecaches: [EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero real2m15.245s user0m0.064s sys 0m2.808s - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/37] Permit filesystem local caching
Daniel Phillips [EMAIL PROTECTED] wrote: I am eventually going to suggest cutting the backing filesystem entirely out of the picture, You still need a database to manage the cache. A filesystem such as Ext3 makes a very handy database for four reasons: (1) It exists and works. (2) It has a well defined interface within the kernel. (3) I can place my cache on, say, my root partition on my laptop. I don't have to dedicate a partition to the cache. (4) Userspace cache management tools (such as cachefilesd) have an already existing interface to use: rmdir, unlink, open, getdents, etc.. I do have a cache-on-blockdev thing, but it's basically a wandering tree filesystem inside. It is, or was, much faster than ext3 on a clean cache, but it degrades horribly over time because my free space reclamation sucks - it gradually randomises the block allocation sequence over time. So, what would you suggest instead of a backing filesystem? I really do not like idea of force fitting this cache into a generic vfs model. Sun was collectively smoking some serious crack when they cooked that one up. But there is also the ageless principle isness is more important than niceness. What do you mean? I'm not doing it like Sun. The cache is a side path from the netfs. It should be transparent to the user, the VFS and the server. The only place it might not be transparent is that you might to have to instruct the netfs mount to use the cache. I'd prefer to do it some other way than passing parameters to mount, though, as (1) this causes fun with NIS distributed automounter maps, and (2) people are asking for a finer grain of control than per-mountpoint. Unfortunately, I can't seem to find a way to do it that's acceptable to Al. Which would require a change to NFS, not an option because you hope to work with standard servers? Of course with years to think about this, the required protocol changes were put into v4. Not. I don't think there's much I can do about NFS. It requires the filesystem from which the NFS server is dealing to have inode uniquifiers, which are then incorporated into the file handle. I don't think the NFS protocol itself needs to change to support this. Have you completely exhausted optimization ideas for the file handle lookup? No, but there aren't many. CacheFiles doesn't actually do very much, and it's hard to reduce that not very much. The most obvious thing is to prepopulate the dcache, but that's at the expense of memory usage. Actually, if I cache the name = FH mapping I used last time, I can make a start on looking up in the cache whilst simultaneously accessing the server. If what's on the server has changed, I can ditch the speculative cache lookup I was making and start a new cache lookup. However, storing directory entries has penalties of its own, though it'll be necesary if we want to do disconnected operation. Where lookup table == dcache. That would be good yes. cachefilesd prescans all the files in the cache, which ought to do just that, but it doesn't seem to be very effective. I'm not sure why. RCU? Anyway, it is something to be tracked down and put right. cachefilesd runs in userspace. It's possible it isn't doing enough to preload all the metadata. What I tried to say. So still... got any ideas? That extra synchronous network round trip is a killer. Can it be made streaming/async to keep throughput healthy? That's a per-netfs thing. With the test rig I've got, it's going to the on-disk cache that's the killer. Going over the network is much faster. See the results I posted. For the tarball load, and using Ext3 to back the cache: Cold NFS cache, no disk cache: 0m22.734s Warm on-disk cache, cold pagecaches:1m54.350s The problem is reading using tar is a worst case workload for this. Everything it does is pretty much completely synchronous. One thing that might help is if things like tar and find can be made to use fadvise() on directories to hint to the filesystem (NFS, AFS, whatever) that it's going to access every file in those directories. Certainly AFS could make use of that: the directory is read as a file, and the netfs then parses the file to get a list of vnode IDs that that directory points to. It could then do bulk status fetch operations to instantiate the inodes 50 at a time. I don't know whether NFS could use it. Someone like Trond or SteveD or Chuck would have to answer that. David - To unsubscribe from this list: send the line unsubscribe linux-security-module in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html