Re: [PATCH 2/3] exporting capability name/code pairs (final)

2008-02-22 Thread Andrew G. Morgan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

KaiGai,

I've just tried to build this with a separate obj tree: make O=/path.../
~ the build failed as follows:

~  CC  security/dummy.o
~  CC  security/inode.o
~  CAPSsecurity/cap_names.h
/bin/sh: security/../scripts/mkcapnames.sh: No such file or directory
make[3]: *** [security/cap_names.h] Error 127
make[2]: *** [security] Error 2
make[1]: *** [sub-make] Error 2
make: *** [all] Error 2

when I replace $(src)/../scripts/...  with $(srctree)/scripts/... I get
it to compile, but (x86_64) see this warning fly by:

~  CC  security/commoncap.o
/home/morgan/gits/linux-2.6/security/commoncap.c: In function
`capability_name_show':
/home/morgan/gits/linux-2.6/security/commoncap.c:652: warning: cast from
pointer to integer of different size

Cheers

Andrew

Kohei KaiGai wrote:
| [2/3] Exporting capability code/name pairs
|
| This patch enables to export code/name pairs of capabilities the running
| kernel supported.
|
| A newer kernel sometimes adds new capabilities, like CAP_MAC_ADMIN
| at 2.6.25. However, we have no interface to disclose what capabilities
| are supported on the running kernel. Thus, we have to maintain libcap
| version in appropriate one synchronously.
|
| This patch enables libcap to collect the list of capabilities at run time,
| and provide them for users. It helps to improve portability of library.
|
| It exports these information as regular files under
/sys/kernel/capability.
| The numeric node exports its name, the symbolic node exports its code.
|
| Please consider to put this patch on the queue of 2.6.25.
|
| Thanks,
|
|  BEGIN EXAMPLE 
| [EMAIL PROTECTED] ~]$ ls -R /sys/kernel/capability/
| /sys/kernel/capability/:
| codes  names  version
|
| /sys/kernel/capability/codes:
| 0  10  12  14  16  18  2   21  23  25  27  29  30  32  4  6  8
| 1  11  13  15  17  19  20  22  24  26  28  3   31  33  5  7  9
|
| /sys/kernel/capability/names:
| cap_audit_controlcap_kill  cap_net_raw cap_sys_nice
| cap_audit_write  cap_lease cap_setfcap cap_sys_pacct
| cap_chowncap_linux_immutable   cap_setgid  cap_sys_ptrace
| cap_dac_override cap_mac_admin cap_setpcap cap_sys_rawio
| cap_dac_read_search  cap_mac_override  cap_setuid
cap_sys_resource
| cap_fowner   cap_mknod cap_sys_admin   cap_sys_time
| cap_fsetid   cap_net_admin cap_sys_boot
cap_sys_tty_config
| cap_ipc_lock cap_net_bind_service  cap_sys_chroot
| cap_ipc_ownercap_net_broadcast cap_sys_module
| [EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/version
| 0x20071026
| [EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/codes/30
| cap_audit_control
| [EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/names/cap_sys_pacct
| 20
| [EMAIL PROTECTED] ~]$
|  END EXAMPLE --
|
| Signed-off-by: KaiGai Kohei [EMAIL PROTECTED]
| --
|  Documentation/ABI/testing/sysfs-kernel-capability |   23 +
|  scripts/mkcapnames.sh |   44 +
|  security/Makefile |9 ++
|  security/commoncap.c  |   99
+
|  4 files changed, 175 insertions(+), 0 deletions(-)
|
| diff --git a/Documentation/ABI/testing/sysfs-kernel-capability
b/Documentation/ABI/testing/sysfs-kernel-capability
| index e69de29..402ef06 100644
| --- a/Documentation/ABI/testing/sysfs-kernel-capability
| +++ b/Documentation/ABI/testing/sysfs-kernel-capability
| @@ -0,0 +1,23 @@
| +What:/sys/kernel/capability
| +Date:Feb 2008
| +Contact: KaiGai Kohei [EMAIL PROTECTED]
| +Description:
| + The entries under /sys/kernel/capability are used to export
| + the list of capabilities the running kernel supported.
| +
| + - /sys/kernel/capability/version
| +   returns the most preferable version number for the
| +   running kernel.
| +   e.g) $ cat /sys/kernel/capability/version
| +0x20071026
| +
| + - /sys/kernel/capability/code/numerical representation
| +   returns its symbolic representation, on reading.
| +   e.g) $ cat /sys/kernel/capability/codes/30
| +cap_audit_control
| +
| + - /sys/kernel/capability/name/symbolic representation
| +   returns its numerical representation, on reading.
| +   e.g) $ cat /sys/kernel/capability/names/cap_sys_pacct
| +20
| +
| diff --git a/scripts/mkcapnames.sh b/scripts/mkcapnames.sh
| index e69de29..5d36d52 100644
| --- a/scripts/mkcapnames.sh
| +++ b/scripts/mkcapnames.sh
| @@ -0,0 +1,44 @@
| +#!/bin/sh
| +
| +#
| +# generate a cap_names.h file from include/linux/capability.h
| +#
| +
| +CAPHEAD=`dirname $0`/../include/linux/capability.h
| +REGEXP='^#define CAP_[A-Z_]+[]+[0-9]+$'
| +NUMCAP=`cat $CAPHEAD 

Re: [PATCH 2/3] exporting capability name/code pairs (final)

2008-02-22 Thread Li Zefan
Kohei KaiGai 写道:
 [2/3] Exporting capability code/name pairs
 
 This patch enables to export code/name pairs of capabilities the running
 kernel supported.
 

supported or supports ?

 A newer kernel sometimes adds new capabilities, like CAP_MAC_ADMIN
 at 2.6.25. However, we have no interface to disclose what capabilities
 are supported on the running kernel. Thus, we have to maintain libcap
 version in appropriate one synchronously.
 
 This patch enables libcap to collect the list of capabilities at run time,
 and provide them for users. It helps to improve portability of library.
 
 It exports these information as regular files under /sys/kernel/capability.
 The numeric node exports its name, the symbolic node exports its code.
 
 Please consider to put this patch on the queue of 2.6.25.
 
 Thanks,
 
  BEGIN EXAMPLE 
 [EMAIL PROTECTED] ~]$ ls -R /sys/kernel/capability/
 /sys/kernel/capability/:
 codes  names  version
 
 /sys/kernel/capability/codes:
 0  10  12  14  16  18  2   21  23  25  27  29  30  32  4  6  8
 1  11  13  15  17  19  20  22  24  26  28  3   31  33  5  7  9
 
 /sys/kernel/capability/names:
 cap_audit_controlcap_kill  cap_net_raw cap_sys_nice
 cap_audit_write  cap_lease cap_setfcap cap_sys_pacct
 cap_chowncap_linux_immutable   cap_setgid  cap_sys_ptrace
 cap_dac_override cap_mac_admin cap_setpcap cap_sys_rawio
 cap_dac_read_search  cap_mac_override  cap_setuid  cap_sys_resource
 cap_fowner   cap_mknod cap_sys_admin   cap_sys_time
 cap_fsetid   cap_net_admin cap_sys_bootcap_sys_tty_config
 cap_ipc_lock cap_net_bind_service  cap_sys_chroot
 cap_ipc_ownercap_net_broadcast cap_sys_module
 [EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/version
 0x20071026
 [EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/codes/30
 cap_audit_control
 [EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/names/cap_sys_pacct
 20
 [EMAIL PROTECTED] ~]$
  END EXAMPLE --
 
 Signed-off-by: KaiGai Kohei [EMAIL PROTECTED]
 --
  Documentation/ABI/testing/sysfs-kernel-capability |   23 +
  scripts/mkcapnames.sh |   44 +
  security/Makefile |9 ++
  security/commoncap.c  |   99 
 +
  4 files changed, 175 insertions(+), 0 deletions(-)
 
 diff --git a/Documentation/ABI/testing/sysfs-kernel-capability 
 b/Documentation/ABI/testing/sysfs-kernel-capability
 index e69de29..402ef06 100644
 --- a/Documentation/ABI/testing/sysfs-kernel-capability
 +++ b/Documentation/ABI/testing/sysfs-kernel-capability
 @@ -0,0 +1,23 @@
 +What:/sys/kernel/capability
 +Date:Feb 2008
 +Contact: KaiGai Kohei [EMAIL PROTECTED]
 +Description:
 + The entries under /sys/kernel/capability are used to export
 + the list of capabilities the running kernel supported.
 +

ditto, supported or supports ?

 + - /sys/kernel/capability/version
 +   returns the most preferable version number for the
 +   running kernel.
 +   e.g) $ cat /sys/kernel/capability/version
 +0x20071026
 +
 + - /sys/kernel/capability/code/numerical representation
 +   returns its symbolic representation, on reading.
 +   e.g) $ cat /sys/kernel/capability/codes/30
 +cap_audit_control
 +
 + - /sys/kernel/capability/name/symbolic representation
 +   returns its numerical representation, on reading.
 +   e.g) $ cat /sys/kernel/capability/names/cap_sys_pacct
 +20
 +
 diff --git a/scripts/mkcapnames.sh b/scripts/mkcapnames.sh
 index e69de29..5d36d52 100644
 --- a/scripts/mkcapnames.sh
 +++ b/scripts/mkcapnames.sh
 @@ -0,0 +1,44 @@
 +#!/bin/sh
 +
 +#
 +# generate a cap_names.h file from include/linux/capability.h
 +#
 +
 +CAPHEAD=`dirname $0`/../include/linux/capability.h
 +REGEXP='^#define CAP_[A-Z_]+[]+[0-9]+$'
 +NUMCAP=`cat $CAPHEAD | egrep -c $REGEXP`
 +
 +echo '#ifndef CAP_NAMES_H'
 +echo '#define CAP_NAMES_H'
 +echo
 +echo '/*'
 +echo ' * Do NOT edit this file directly.'
 +echo ' * This file is generated from include/linux/capability.h 
 automatically'
 +echo ' */'
 +echo
 +echo '#if !defined(SYSFS_CAP_NAME_ENTRY) || !defined(SYSFS_CAP_CODE_ENTRY)'
 +echo '#error cap_names.h should be included from security/capability.c'
 +echo '#else'
 +echo #if $NUMCAP != CAP_LAST_CAP + 1
 +echo '#error mkcapnames.sh cannot collect capabilities correctly'
 +echo '#else'
 +cat $CAPHEAD | egrep $REGEXP \
 +| awk '{ printf(SYSFS_CAP_NAME_ENTRY(%s,%s);\n, tolower($2), $2); }'
 +echo
 +echo 'static struct attribute *capability_name_attrs[] = {'
 +cat $CAPHEAD | egrep $REGEXP \
 +| awk '{ printf(\t%s_name_attr.attr,\n, tolower($2)); } END { print 
 \tNULL, }'
 +echo '};'
 

Re: [PATCH 00/37] Permit filesystem local caching

2008-02-22 Thread David Howells
Daniel Phillips [EMAIL PROTECTED] wrote:

  The way the client works is like this:
 
 Thanks for the excellent ascii art, that cleared up the confusion right
 away.

You know what they say about pictures... :-)

  What are you trying to do exactly?  Are you actually playing with it, or
  just looking at the numbers I've produced?
 
 Trying to see if you are offering enough of a win to justify testing it,
 and if that works out, then going shopping for a bin of rotten vegetables
 to throw at your design, which I hope you will perceive as useful.

One thing that you have to remember: my test setup is pretty much the
worst-case for being appropriate for showing the need for caching to improve
performance.  There's a single client and a single server, they've got GigE
networking between them that has very little other load, and the server has
sufficient memory to hold the entire test data set.

 From the numbers you have posted I think you are missing some basic
 efficiencies that could take this design from the sorta-ok zone to wow!

Not really, it's just that this lashup could be considered designed to show
local caching in the worst light.

 But looking up the object in the cache should be nearly free - much less
 than a microsecond per block.

The problem is that you have to do a database lookup of some sort, possibly
involving several synchronous disk operations.

CacheFiles does a disk lookup by taking the key given to it by NFS, turning it
into a set of file or directory names, and doing a short pathwalk to the target
cache file.  Throwing in extra indices won't necessarily help.  What matters is
how quick the backing filesystem is at doing lookups.  As it turns out, Ext3 is
a fair bit better then BTRFS when the disk cache is cold.

  The metadata problem is quite a tricky one since it increases with the
  number of files you're dealing with.  As things stand in my patches, when
  NFS, for example, wants to access a new inode, it first has to go to the
  server to lookup the NFS file handle, and only then can it go to the cache
  to find out if there's a matching object in the case.
 
 So without the persistent cache it can omit the LOOKUP and just send the
 filehandle as part of the READ?

What 'it'?  Note that the get the filehandle, you have to do a LOOKUP op.  With
the cache, we could actually cache the results of lookups that we've done,
however, we don't know that the results are still valid without going to the
server:-/

AFS has a way around that - it versions its vnode (inode) IDs.

  The reason my client going to my server is so quick is that the server has
  the dcache and the pagecache preloaded, so that across-network lookup
  operations are really, really quick, as compared to the synchronous
  slogging of the local disk to find the cache object.
 
 Doesn't that just mean you have to preload the lookup table for the
 persistent cache so you can determine whether you are caching the data
 for a filehandle without going to disk?

Where lookup table == dcache.  That would be good yes.  cachefilesd
prescans all the files in the cache, which ought to do just that, but it
doesn't seem to be very effective.  I'm not sure why.

  I can probably improve this a little by pre-loading the subindex
  directories (hash tables) that I use to reduce the directory size in the
  cache, but I don't know by how much.
 
 Ah I should have read ahead.  I think the correct answer is a lot.

Quite possibly.  It'll allow me to dispense with at least one fs lookup call
per cache object request call.

 Your big can-t-get-there-from-here is the round trip to the server to
 determine whether you should read from the local cache.  Got any ideas?

I'm not sure what you mean.  Your statement should probably read ... to
determine _what_ you should read from the local cache.

 And where is the Trond-meister in all of this?

Keeping quiet as far as I can tell.

David
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/37] Permit filesystem local caching

2008-02-22 Thread Chris Mason
On Thursday 21 February 2008, David Howells wrote:
 David Howells [EMAIL PROTECTED] wrote:
   Have you got before/after benchmark results?
 
  See attached.

 Attached here are results using BTRFS (patched so that it'll work at all)
 rather than Ext3 on the client on the partition backing the cache.

Thanks for trying this, of course I'll ask you to try again with the latest 
v0.13 code, it has a number of optimizations especially for CPU usage.


 Note that I didn't bother redoing the tests that didn't involve a cache as
 the choice of filesystem backing the cache should have no bearing on the
 result.

 Generally, completely cold caches shouldn't show much variation as all the
 writing can be done completely asynchronously, provided the client doesn't
 fill its RAM.

 The interesting case is where the disk cache is warm, but the pagecache is
 cold (ie: just after a reboot after filling the caches).  Here, for the two
 big files case, BTRFS appears quite a bit better than Ext3, showing a 21%
 reduction in time for the smaller case and a 13% reduction for the larger
 case.

I'm afraid I don't have a good handle on the filesystem operations that result 
from this workload.  Are we reading from the FS to fill the NFS page cache?


 For the many small/medium files case, BTRFS performed significantly better
 (15% reduction in time) in the case where the caches were completely cold.
 I'm not sure why, though - perhaps because it doesn't execute a
 write_begin() stage during the write_one_page() call and thus doesn't go
 allocating disk blocks to back the data, but instead allocates them later.

If your write_one_page call does parts of btrfs_file_write, you'll get delayed 
allocation for anything bigger than 8k by default.  = 8k will get packed 
into the btree leaves.


 More surprising is that BTRFS performed significantly worse (15% increase
 in time) in the case where the cache on disk was fully populated and then
 the machine had been rebooted to clear the pagecaches.

Which FS operations are included here?  Finding all the files or just an 
unmount?  Btrfs defrags metadata in the background, and unmount has to wait 
for that defrag to finish.

Thanks again,
Chris
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] exporting capability name/code pairs (final)

2008-02-22 Thread Greg KH
On Fri, Feb 22, 2008 at 06:45:32PM +0900, Kohei KaiGai wrote:
 I believe it is correct assumption that long type and pointers have
 same width in the linux kernel. Please tell me, if it is wrong.

That is correct, it is one of the assumptions that is safe to make.  But
you should fix the compiler warning :)

thanks,

greg k-h
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/37] Permit filesystem local caching

2008-02-22 Thread David Howells
Chris Mason [EMAIL PROTECTED] wrote:

  The interesting case is where the disk cache is warm, but the pagecache is
  cold (ie: just after a reboot after filling the caches).  Here, for the two
  big files case, BTRFS appears quite a bit better than Ext3, showing a 21%
  reduction in time for the smaller case and a 13% reduction for the larger
  case.
 
 I'm afraid I don't have a good handle on the filesystem operations that
 result from this workload.  Are we reading from the FS to fill the NFS page
 cache?

I'm not sure what you're asking.

When the cache is cold, we determine that we can't read from the cache very
quickly.  We then read data from the server and, in the background, create the
metadata in the cache and store the data to it (by copying netfs pages to
backingfs pages).

When the cache is warm, we read the data from the cache, copying the data from
the backingfs pages to the netfs pages.  We use bmap() to ascertain that there
is data to be read, otherwise we detect a hole and fallback to reading from
the server.

Looking up cache object involves a sequence of lookup() ops and getxattr() ops
on the backingfs.  Should an object not exist, we defer creation of that
object to a background thread and do lookups(), mkdirs() and setxattrs() and a
create() to manufacture the object.

We read data from an object by calling readpages() on the backingfs to bring
the data into the pagecache.  We monitor the PG_lock bits to find out when
each page is read or has completed with an error.

Writing pages to the cache is done completely in the background.
PG_fscache_write is set on a page when it is handed to fscache to storage,
then at some point a background thread wakes up and calls write_one_page() in
the backingfs to write that page to the cache file.  At the moment, this
copies the data into a backingfs page which is then marked PG_dirty, and the
VM writes it out in the usual way.

  More surprising is that BTRFS performed significantly worse (15% increase
  in time) in the case where the cache on disk was fully populated and then
  the machine had been rebooted to clear the pagecaches.
 
 Which FS operations are included here?  Finding all the files or just an 
 unmount?  Btrfs defrags metadata in the background, and unmount has to wait 
 for that defrag to finish.

BTRFS might not be doing any writing at all here - apart from local atimes
(used by cache culling), that is.

What it does have to do is lots of lookups, reads and getxattrs, all of which
are synchronous.

David
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/37] Permit filesystem local caching

2008-02-22 Thread David Howells
David Howells [EMAIL PROTECTED] wrote:

   Have you got before/after benchmark results?
  
  See attached.
 
 Attached here are results using BTRFS (patched so that it'll work at all)
 rather than Ext3 on the client on the partition backing the cache.

And here are XFS results.

Tuning XFS makes a *really* big difference for the lots of small/medium files
being tarred case.  However, in general BTRFS is much better.

David
---


=
FEW BIG FILES TEST ON XFS
=

Completely cold caches:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m2.286s
user0m0.000s
sys 0m1.828s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m4.228s
user0m0.000s
sys 0m1.360s

Warm NFS pagecache:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m0.058s
user0m0.000s
sys 0m0.060s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m0.122s
user0m0.000s
sys 0m0.120s

Warm XFS pagecache, cold NFS pagecache:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m0.181s
user0m0.000s
sys 0m0.180s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m1.034s
user0m0.000s
sys 0m0.404s

Warm on-disk cache, cold pagecaches:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m1.540s
user0m0.000s
sys 0m0.256s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m3.003s
user0m0.000s
sys 0m0.532s


==
MANY SMALL/MEDIUM FILE READING TEST ON XFS
==

Completely cold caches:

[EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero
real4m56.827s
user0m0.180s
sys 0m6.668s

Warm NFS pagecache:

[EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero
real0m15.084s
user0m0.212s
sys 0m5.008s

Warm XFS pagecache, cold NFS pagecache:

[EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero
real0m13.547s
user0m0.220s
sys 0m5.652s

Warm on-disk cache, cold pagecaches:

[EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero
real4m36.316s
user0m0.148s
sys 0m4.440s


===
MANY SMALL/MEDIUM FILE READING TEST ON AN OPTIMISED XFS
===

mkfs.xfs -d agcount=4 -l size=128m,version=2 /dev/sda6


Completely cold caches:

[EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero
real3m44.033s
user0m0.248s
sys 0m6.632s

Warm on-disk cache, cold pagecaches:

[EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero
real3m8.582s
user0m0.108s
sys 0m3.420s
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 00/37] Permit filesystem local caching

2008-02-22 Thread Rick Macklem
 Well, the AFS paper that was referenced earlier was written around the
 time of 10bt and 100bt.  Local disk caching worked well then.  There
 should also be some papers at CITI about disk caching over slower
 connections, and disconnected operation (which should still be
 applicable today).  There are still winners from local disk caching, but
 their numbers have been reduced.  Server load reduction should be a win.
 I'm not sure if it's worth it from a security/manageability standpoint,
 but I haven't looked that closely at David's code.

One area that you might want to look at is WAN performance. When RPC RTT
goes up, ordinary NFS performance goes down. This tends to get overlooked
by the machine room folks. (There are several tools out there that can
introduce delay in an IP packet stream and emulate WAN RTTs.)

Just a thought, rick
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/37] Permit filesystem local caching

2008-02-22 Thread David Howells
Chris Mason [EMAIL PROTECTED] wrote:

 Thanks for trying this, of course I'll ask you to try again with the latest 
 v0.13 code, it has a number of optimizations especially for CPU usage.

Here you go.  The numbers are very similar.

David

=
FEW BIG FILES TEST ON BTRFS v0.13
=

Completely cold caches:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m2.202s
user0m0.000s
sys 0m1.716s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m4.212s
user0m0.000s
sys 0m0.896s

Warm BTRFS pagecache, cold NFS pagecache:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m0.197s
user0m0.000s
sys 0m0.192s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m0.376s
user0m0.000s
sys 0m0.372s

Warm on-disk cache, cold pagecaches:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m1.543s
user0m0.004s
sys 0m1.448s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m3.111s
user0m0.000s
sys 0m2.856s


==
MANY SMALL/MEDIUM FILE READING TEST ON BTRFS v0.13
==

Completely cold caches:

[EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero
real0m31.575s
user0m0.176s
sys 0m6.316s

Warm BTRFS pagecache, cold NFS pagecache:

[EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero
real0m16.081s
user0m0.164s
sys 0m5.528s

Warm on-disk cache, cold pagecaches:

[EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero
real2m15.245s
user0m0.064s
sys 0m2.808s

-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/37] Permit filesystem local caching

2008-02-22 Thread David Howells
Daniel Phillips [EMAIL PROTECTED] wrote:

 I am eventually going to suggest cutting the backing filesystem entirely out
 of the picture,

You still need a database to manage the cache.  A filesystem such as Ext3
makes a very handy database for four reasons:

 (1) It exists and works.

 (2) It has a well defined interface within the kernel.

 (3) I can place my cache on, say, my root partition on my laptop.  I don't
 have to dedicate a partition to the cache.

 (4) Userspace cache management tools (such as cachefilesd) have an already
 existing interface to use: rmdir, unlink, open, getdents, etc..

I do have a cache-on-blockdev thing, but it's basically a wandering tree
filesystem inside.  It is, or was, much faster than ext3 on a clean cache, but
it degrades horribly over time because my free space reclamation sucks - it
gradually randomises the block allocation sequence over time.

So, what would you suggest instead of a backing filesystem?

 I really do not like idea of force fitting this cache into a generic
 vfs model.  Sun was collectively smoking some serious crack when they
 cooked that one up.  But there is also the ageless principle isness is
 more important than niceness.

What do you mean?  I'm not doing it like Sun.  The cache is a side path from
the netfs.  It should be transparent to the user, the VFS and the server.

The only place it might not be transparent is that you might to have to
instruct the netfs mount to use the cache.  I'd prefer to do it some other way
than passing parameters to mount, though, as (1) this causes fun with NIS
distributed automounter maps, and (2) people are asking for a finer grain of
control than per-mountpoint.  Unfortunately, I can't seem to find a way to do
it that's acceptable to Al.

 Which would require a change to NFS, not an option because you hope to
 work with standard servers?  Of course with years to think about this,
 the required protocol changes were put into v4.  Not.

I don't think there's much I can do about NFS.  It requires the filesystem
from which the NFS server is dealing to have inode uniquifiers, which are then
incorporated into the file handle.  I don't think the NFS protocol itself
needs to change to support this.

 Have you completely exhausted optimization ideas for the file handle
 lookup?

No, but there aren't many.  CacheFiles doesn't actually do very much, and it's
hard to reduce that not very much.  The most obvious thing is to prepopulate
the dcache, but that's at the expense of memory usage.

Actually, if I cache the name = FH mapping I used last time, I can make a
start on looking up in the cache whilst simultaneously accessing the server.
If what's on the server has changed, I can ditch the speculative cache lookup
I was making and start a new cache lookup.

However, storing directory entries has penalties of its own, though it'll be
necesary if we want to do disconnected operation.

  Where lookup table == dcache.  That would be good yes.  cachefilesd
  prescans all the files in the cache, which ought to do just that, but it
  doesn't seem to be very effective.  I'm not sure why.
 
 RCU?  Anyway, it is something to be tracked down and put right.

cachefilesd runs in userspace.  It's possible it isn't doing enough to preload
all the metadata.

 What I tried to say.  So still... got any ideas?  That extra synchronous
 network round trip is a killer.  Can it be made streaming/async to keep
 throughput healthy?

That's a per-netfs thing.  With the test rig I've got, it's going to the
on-disk cache that's the killer.  Going over the network is much faster.

See the results I posted.  For the tarball load, and using Ext3 to back the
cache:

Cold NFS cache, no disk cache:  0m22.734s
Warm on-disk cache, cold pagecaches:1m54.350s

The problem is reading using tar is a worst case workload for this.  Everything
it does is pretty much completely synchronous.

One thing that might help is if things like tar and find can be made to use
fadvise() on directories to hint to the filesystem (NFS, AFS, whatever) that
it's going to access every file in those directories.

Certainly AFS could make use of that: the directory is read as a file, and the
netfs then parses the file to get a list of vnode IDs that that directory
points to.  It could then do bulk status fetch operations to instantiate the
inodes 50 at a time.

I don't know whether NFS could use it.  Someone like Trond or SteveD or Chuck
would have to answer that.

David
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html