Re: [PATCH 00/37] Permit filesystem local caching

2008-02-21 Thread David Howells
Daniel Phillips [EMAIL PROTECTED] wrote:

  These patches add local caching for network filesystems such as NFS.
 
 Have you got before/after benchmark results?

I need to get a new hard drive for my test machine before I can go and get
some more up to date benchmark results.  It does seem, however, that the I/O
error handling capabilities of FS-Cache work properly:-)

David
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/37] Permit filesystem local caching

2008-02-21 Thread David Howells
Daniel Phillips [EMAIL PROTECTED] wrote:

 Have you got before/after benchmark results?

See attached.

These show a couple of things:

 (1) Dealing with lots of metadata slows things down a lot.  Note the result of
 looking and reading lots of small files with tar (the last result).  The
 NFS client has to both consult the NFS server *and* the cache.  Not only
 that, but any asynchronicity the cache may like to do is rendered
 ineffective by the fact tar wants to do a read on a file pretty much
 directly after opening it.

 (2) Getting metadata from the local disk fs is slower than pulling it across
 an unshared gigabit ethernet from a server that already has it in memory.

These points don't mean that fscache is no use, just that you have to consider
carefully whether it's of use to *you* given your particular situation, and
that depends on various factors.

Note that currently FS-Caching is disabled for individual NFS files opened for
writing as there's no way to handle the coherency problems thereby introduced.

David
---

  ===
  FS-CACHE FOR NFS BENCHMARKS
  ===

 (*) The NFS client has a 1.86GHz Core2 Duo CPU and 1GB of RAM.

 (*) The NFS client has a Seagate ST380211AS 80GB 7200rpm SATA disk on an
 interface running in AHCI mode.  The chipset is an Intel G965.

 (*) A partition of approx 4.5GB is committed to caching, and is formatted as
 Ext3 with a blocksize of 4096 and directory indices.

 (*) The NFS client is using SELinux.

 (*) The NFS server is running an in-kernel NFSd, and has a 2.66GHz Core2 Duo
 CPU and 6GB of RAM.  The chipset is an Intel P965.

 (*) The NFS client is connected to the NFS server by Gigabit Ethernet.

 (*) The NFS mount is made with defaults for all options not relating to the
 cache:

warthog:/warthog /warthog nfs
rw,vers=3,rsize=1048576,wsize=1048576,hard,proto=tcp,timeo=600,
retrans=2,sec=sys,fsc,addr=w.x.y.z 0 0


==
FEW BIG FILES TEST
==

Where:

 (*) The NFS server has two files:

[EMAIL PROTECTED] ~]# ls -l /warthog/bigfile
-rw-rw-r-- 1 4043 4043 104857600 2006-11-30 09:39 /warthog/bigfile
[EMAIL PROTECTED] ~]# ls -l /warthog/biggerfile 
-rw-rw-r-- 1 4043 4041 209715200 2006-03-21 13:56 /warthog/biggerfile

 Both of which are in memory on the server in all cases.


No patches, cold NFS cache:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m1.909s
user0m0.000s
sys 0m0.520s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m3.750s
user0m0.000s
sys 0m0.904s

CONFIG_FSCACHE=n, cold NFS cache:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m2.003s
user0m0.000s
sys 0m0.124s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m4.100s
user0m0.004s
sys 0m0.488s

Cold NFS cache, no disk cache:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m2.084s
user0m0.000s
sys 0m0.136s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m4.020s
user0m0.000s
sys 0m0.720s

Completely cold caches:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m2.412s
user0m0.000s
sys 0m0.892s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m4.449s
user0m0.000s
sys 0m2.300s

Warm NFS pagecache:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m0.067s
user0m0.000s
sys 0m0.064s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m0.133s
user0m0.000s
sys 0m0.136s

Warm Ext3 pagecache, cold NFS pagecache:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m0.173s
user0m0.000s
sys 0m0.172s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m0.316s
user0m0.000s
sys 0m0.316s

Warm on-disk cache, cold pagecaches:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m1.955s
user0m0.000s
sys 0m0.244s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m3.596s
user0m0.000s
sys 0m0.460s


===
MANY SMALL/MEDIUM FILE READING TEST
===

Where:

 (*) The NFS server has an old kernel tree:

[EMAIL PROTECTED] ~]# du -s /warthog/aaa
347340  /warthog/aaa
[EMAIL 

Re: [PATCH 00/37] Permit filesystem local caching

2008-02-21 Thread Kevin Coffman
On Thu, Feb 21, 2008 at 9:55 AM, David Howells [EMAIL PROTECTED] wrote:
 Daniel Phillips [EMAIL PROTECTED] wrote:


  Have you got before/after benchmark results?

  See attached.

  These show a couple of things:

   (1) Dealing with lots of metadata slows things down a lot.  Note the result 
 of
  looking and reading lots of small files with tar (the last result).  The
  NFS client has to both consult the NFS server *and* the cache.  Not only
  that, but any asynchronicity the cache may like to do is rendered
  ineffective by the fact tar wants to do a read on a file pretty much
  directly after opening it.

   (2) Getting metadata from the local disk fs is slower than pulling it across
  an unshared gigabit ethernet from a server that already has it in memory.

Hi David,

Your results remind me of this in case you're interested...

http://www.citi.umich.edu/techreports/reports/citi-tr-92-3.pdf
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/37] Permit filesystem local caching

2008-02-21 Thread Daniel Phillips
Hi David,

I am trying to spot the numbers that show the sweet spot for this 
optimization, without much success so far.

Who is supposed to win big?  Is this mainly about reducing the load on 
the server, or is the client supposed to win even with a lightly loaded 
server?

When you say Ext3 cache vs NFS cache is the first on the server and the 
second on the client?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 00/37] Permit filesystem local caching

2008-02-21 Thread Muntz, Daniel
Well, the AFS paper that was referenced earlier was written around the
time of 10bt and 100bt.  Local disk caching worked well then.  There
should also be some papers at CITI about disk caching over slower
connections, and disconnected operation (which should still be
applicable today).  There are still winners from local disk caching, but
their numbers have been reduced.  Server load reduction should be a win.
I'm not sure if it's worth it from a security/manageability standpoint,
but I haven't looked that closely at David's code.

  -Dan

-Original Message-
From: Daniel Phillips [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 21, 2008 2:44 PM
To: David Howells
Cc: Myklebust, Trond; [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]; linux-security-module@vger.kernel.org;
[EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: [PATCH 00/37] Permit filesystem local caching

Hi David,

I am trying to spot the numbers that show the sweet spot for this
optimization, without much success so far.

Who is supposed to win big?  Is this mainly about reducing the load on
the server, or is the client supposed to win even with a lightly loaded
server?

When you say Ext3 cache vs NFS cache is the first on the server and the
second on the client?

Regards,

Daniel
___
NFSv4 mailing list
[EMAIL PROTECTED]
http://linux-nfs.org/cgi-bin/mailman/listinfo/nfsv4
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/37] Permit filesystem local caching

2008-02-21 Thread David Howells
David Howells [EMAIL PROTECTED] wrote:

  Have you got before/after benchmark results?
 
 See attached.

Attached here are results using BTRFS (patched so that it'll work at all)
rather than Ext3 on the client on the partition backing the cache.

Note that I didn't bother redoing the tests that didn't involve a cache as the
choice of filesystem backing the cache should have no bearing on the result.

Generally, completely cold caches shouldn't show much variation as all the
writing can be done completely asynchronously, provided the client doesn't
fill its RAM.

The interesting case is where the disk cache is warm, but the pagecache is
cold (ie: just after a reboot after filling the caches).  Here, for the two
big files case, BTRFS appears quite a bit better than Ext3, showing a 21%
reduction in time for the smaller case and a 13% reduction for the larger
case.

For the many small/medium files case, BTRFS performed significantly better
(15% reduction in time) in the case where the caches were completely cold.
I'm not sure why, though - perhaps because it doesn't execute a write_begin()
stage during the write_one_page() call and thus doesn't go allocating disk
blocks to back the data, but instead allocates them later.

More surprising is that BTRFS performed significantly worse (15% increase in
time) in the case where the cache on disk was fully populated and then the
machine had been rebooted to clear the pagecaches.

It's important to note that I've only run each test once apiece, so the
numbers should be taken with a modicum of salt (bad statistics and all that).

David
---
===
FEW BIG FILES TEST ON BTRFS
===

Completely cold caches:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m2.124s
user0m0.000s
sys 0m1.260s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m4.538s
user0m0.000s
sys 0m2.624s

Warm NFS pagecache:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m0.061s
user0m0.000s
sys 0m0.064s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m0.118s
user0m0.000s
sys 0m0.116s

Warm BTRFS pagecache, cold NFS pagecache:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m0.189s
user0m0.000s
sys 0m0.188s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m0.369s
user0m0.000s
sys 0m0.368s

Warm on-disk cache, cold pagecaches:

[EMAIL PROTECTED] ~]# time cat /warthog/bigfile /dev/null
real0m1.540s
user0m0.000s
sys 0m1.440s
[EMAIL PROTECTED] ~]# time cat /warthog/biggerfile /dev/null
real0m3.132s
user0m0.000s
sys 0m1.724s



MANY SMALL/MEDIUM FILE READING TEST ON BTRFS


Completely cold caches:

[EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero
real0m31.838s
user0m0.192s
sys 0m6.076s

Warm NFS pagecache:

[EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero
real0m14.841s
user0m0.148s
sys 0m4.988s

Warm BTRFS pagecache, cold NFS pagecache:

[EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero
real0m16.773s
user0m0.148s
sys 0m5.512s

Warm on-disk cache, cold pagecaches:

[EMAIL PROTECTED] ~]# time tar cf - /warthog/aaa /dev/zero
real2m12.527s
user0m0.080s
sys 0m2.908s

-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/37] Permit filesystem local caching

2008-02-21 Thread David Howells
Daniel Phillips [EMAIL PROTECTED] wrote:

 When you say Ext3 cache vs NFS cache is the first on the server and the 
 second on the client?

The filesystem on the server is pretty much irrelevant as long as (a) it
doesn't change, and (b) all the data is in memory on the server anyway.

The way the client works is like this:

+-+
| |   
|   NFS   |--+
| |  |
+-+  |   +--+ 
 |   |  | 
+-+  +--|  | 
| |  |  |
|   AFS   |-| FS-Cache |
| |  |  |--+
+-+  +--|  |  |
 |   |  |  |   +--+   +--+
+-+  |   +--+  |   |  |   |  |
| |  | +--|  CacheFiles  |--|  Ext3|
|  ISOFS  |--+ |  /var/cache  |   |  /dev/sda6   |
| |+--+   +--+
+-+


 (1) NFS, say, asks FS-Cache to store/retrieve data for it;

 (2) FS-Cache asks the cache backend, in this case CacheFiles to honour the
 operation;

 (3) CacheFiles 'opens' a file in a mounted filesystem, say Ext3, and does read
 and write operations of a sort on it;

 (4) Ext3 decides how the cache data is laid out on disk - CacheFiles just
 attempts to use one sparse file per netfs inode.

 I am trying to spot the numbers that show the sweet spot for this 
 optimization, without much success so far.

What are you trying to do exactly?  Are you actually playing with it, or just
looking at the numbers I've produced?

 Who is supposed to win big?  Is this mainly about reducing the load on 
 the server, or is the client supposed to win even with a lightly loaded 
 server?

These are difficult questions to answer.  The obvious answer to both is it
depends, and the real answer to both is it's a compromise.

Inserting a cache adds overhead: you have to look in the cache to see if your
objects are mirrored there, and then you have to look in the cache to see if
the data you want is stored there; and then you might have to go to the server
anyway and then schedule a copy to be stored in the cache.

The characteristics of this type of cache depend on a number of things: the
filesystem backing it being the most obvious variable, but also how fragmented
it is and the properties of the disk drive or drives it is on.

Whether it's worth having a cache depend on the characteristics of the network
versus the characteristics of the cache.  Latency of the cache vs latency of
the network, for example.  Network loading is another: having a cache on each
of several clients sharing a server can reduce network traffic by avoiding the
read requests to the server.  NFS has a characteristic that it keeps spamming
the server with file status requests, so even if you take the read requests out
of the load, an NFS client still generates quite a lot of network traffic to
the server - but the reduction is still useful.

The metadata problem is quite a tricky one since it increases with the number
of files you're dealing with.  As things stand in my patches, when NFS, for
example, wants to access a new inode, it first has to go to the server to
lookup the NFS file handle, and only then can it go to the cache to find out if
there's a matching object in the case.  Worse, the cache must then perform
several synchronous disk bound metadata operations before it can be possible to
read from the cache.  Worse still, this means that a read on the network file
cannot proceed until (a) we've been to the server *plus* (b) we've been to the
disk.

The reason my client going to my server is so quick is that the server has the
dcache and the pagecache preloaded, so that across-network lookup operations
are really, really quick, as compared to the synchronous slogging of the local
disk to find the cache object.

I can probably improve this a little by pre-loading the subindex directories
(hash tables) that I use to reduce the directory size in the cache, but I don't
know by how much.


Anyway, to answer your questions:

 (1) It may help with heavily loaded networks with lots of read-only traffic.

 (2) It may help with slow connections (like doing NFS between the UK and
 Australia).

 (3) It could be used to do offline/disconnected operation.

David
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/37] Permit filesystem local caching

2008-02-21 Thread Daniel Phillips
On Thursday 21 February 2008 16:07, David Howells wrote:
 The way the client works is like this:

Thanks for the excellent ascii art, that cleared up the confusion right
away.

 What are you trying to do exactly?  Are you actually playing with it, or just
 looking at the numbers I've produced?

Trying to see if you are offering enough of a win to justify testing it,
and if that works out, then going shopping for a bin of rotten vegetables
to throw at your design, which I hope you will perceive as useful.

In short I am looking for a reason to throw engineering effort at it.
From the numbers you have posted I think you are missing some basic
efficiencies that could take this design from the sorta-ok zone to wow!

I think you may already be in the wow zone for taking load off a server
and I know of applications where an NFS server gets hammered so badly
that having the client suck a little in the unloaded case is a price
worth paying.  But the whole idea would be much more attractive if the
regressions were smaller.

  Who is supposed to win big?  Is this mainly about reducing the load on 
  the server, or is the client supposed to win even with a lightly loaded 
  server?
 
 These are difficult questions to answer.  The obvious answer to both is it
 depends, and the real answer to both is it's a compromise.
 
 Inserting a cache adds overhead: you have to look in the cache to see if your
 objects are mirrored there, and then you have to look in the cache to see if
 the data you want is stored there; and then you might have to go to the server
 anyway and then schedule a copy to be stored in the cache.

But looking up the object in the cache should be nearly free - much less
than a microsecond per block.  If not then there are design issues.  I
suspect that you are doing yourself a disservice by going all the way
through the vfs to do this cache lookup, but this needs to be proved.

 The characteristics of this type of cache depend on a number of things: the
 filesystem backing it being the most obvious variable, but also how fragmented
 it is and the properties of the disk drive or drives it is on.

Double caching and vm unawareness of that has to hurt.

 The metadata problem is quite a tricky one since it increases with the number
 of files you're dealing with.  As things stand in my patches, when NFS, for
 example, wants to access a new inode, it first has to go to the server to
 lookup the NFS file handle, and only then can it go to the cache to find out 
 if
 there's a matching object in the case.

So without the persistent cache it can omit the LOOKUP and just send the
filehandle as part of the READ?

 Worse, the cache must then perform 
 several synchronous disk bound metadata operations before it can be possible 
 to
 read from the cache.  Worse still, this means that a read on the network file
 cannot proceed until (a) we've been to the server *plus* (b) we've been to the
 disk.
 
 The reason my client going to my server is so quick is that the server has 
 the 
 dcache and the pagecache preloaded, so that across-network lookup operations
 are really, really quick, as compared to the synchronous slogging of the local
 disk to find the cache object.

Doesn't that just mean you have to preload the lookup table for the
persistent cache so you can determine whether you are caching the data
for a filehandle without going to disk?

 I can probably improve this a little by pre-loading the subindex directories
 (hash tables) that I use to reduce the directory size in the cache, but I 
 don't
 know by how much.

Ah I should have read ahead.  I think the correct answer is a lot.
Your big can-t-get-there-from-here is the round trip to the server to
determine whether you should read from the local cache.  Got any ideas?

And where is the Trond-meister in all of this?

Regards,

Daniel
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] exporting capability name/code pairs (final)

2008-02-21 Thread Kohei KaiGai
[3/3] A new example to use kobject/kobj_attribute

The attached patch can provide a new exmple to use kobject and attribute.
The _show() and _store() method can refer/store the private data field of
kobj_attribute structure to know what entries are accessed by users.
It will make easier to share a single _show()/_store() method with several
entries.

KaiGai Kohei [EMAIL PROTECTED]
--
 samples/kobject/kobject-example.c |   32 
 1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/samples/kobject/kobject-example.c 
b/samples/kobject/kobject-example.c
index 08d0d3f..f99d734 100644
--- a/samples/kobject/kobject-example.c
+++ b/samples/kobject/kobject-example.c
@@ -77,6 +77,35 @@ static struct kobj_attribute baz_attribute =
 static struct kobj_attribute bar_attribute =
__ATTR(bar, 0666, b_show, b_store);

+/*
+ * You can store a private data within 'data' field of kobj_attribute.
+ * It enables to share a single _show() or _store() method with several
+ * entries.
+ */
+static ssize_t integer_show(struct kobject *kobj,
+   struct kobj_attribute *attr,
+   char *buf)
+{
+   return scnprintf(buf, PAGE_SIZE, %d\n, (int) attr-data);
+}
+
+static ssize_t integer_store(struct kobject *kobj,
+struct kobj_attribute *attr,
+const char *buf, size_t count)
+{
+   int code;
+
+   sscanf(buf, %du, code);
+   attr-data = (void *) code;
+   return count;
+}
+
+static struct kobj_attribute hoge_attribute =
+   __ATTR_DATA(hoge, 0666, integer_show, integer_store, 123);
+static struct kobj_attribute piyo_attribute =
+   __ATTR_DATA(piyo, 0666, integer_show, integer_store, 456);
+static struct kobj_attribute fuga_attribute =
+   __ATTR_DATA(fuga, 0444, integer_show, NULL, 789);

 /*
  * Create a group of attributes so that we can create and destory them all
@@ -86,6 +115,9 @@ static struct attribute *attrs[] = {
foo_attribute.attr,
baz_attribute.attr,
bar_attribute.attr,
+   hoge_attribute.attr,
+   piyo_attribute.attr,
+   fuga_attribute.attr,
NULL,   /* need to NULL terminate the list of attributes */
 };


-- 
OSS Platform Development Division, NEC
KaiGai Kohei [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] exporting capability name/code pairs (final)

2008-02-21 Thread Kohei KaiGai
[1/3] Add a private data field within kobj_attribute structure.

This patch add a private data field, declared as void *, within kobj_attribute
structure. The _show() and _store() method in the sysfs attribute entries can
refer this information to identify what entry is accessed.
It makes easier to share a single method implementation with several similar
entries, like ones to export the list of capabilities the running kernel
supported.

Signed-off-by: KaiGai Kohei [EMAIL PROTECTED]
--
 include/linux/kobject.h |1 +
 include/linux/sysfs.h   |7 +++
 2 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index caa3f41..57d5bf1 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -130,6 +130,7 @@ struct kobj_attribute {
char *buf);
ssize_t (*store)(struct kobject *kobj, struct kobj_attribute *attr,
 const char *buf, size_t count);
+   void *data; /* a private field */
 };

 extern struct sysfs_ops kobj_sysfs_ops;
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index 8027104..6f40ff9 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -50,6 +50,13 @@ struct attribute_group {
.store  = _store,   \
 }

+#define __ATTR_DATA(_name,_mode,_show,_store,_data) {  \
+   .attr = {.name = __stringify(_name), .mode = _mode },   \
+   .show   = _show,\
+   .store  = _store,   \
+   .data   = (void *)(_data),  \
+}
+   
 #define __ATTR_RO(_name) { \
.attr   = { .name = __stringify(_name), .mode = 0444 }, \
.show   = _name##_show, \

-- 
OSS Platform Development Division, NEC
KaiGai Kohei [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] capabilities: implement per-process securebits

2008-02-21 Thread Andrew G. Morgan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Serge E. Hallyn wrote:
| It all looks good to me.
|
| Since we've confirmed that wireshark uses capabilities it must be using
| prctl(PR_SET_KEEPCAPS), so running it might be a good way to verify that
| your changes to that codepath (with CONFIG_SECURITY_FILE_CAPABILITIES=n)
| are 100% correct, and might set minds at ease.  Is that something you're
| set up to be able to do?

I guess I need someone to offer an existence proof that this particular
wireshark code ever worked? (ldd dumpcap|grep libcap). For reference,
I'm looking at:

wireshark-0.99.7/dumpcap.c:302

void
relinquish_privs_except_capture(void)
{
~/* CAP_NET_ADMIN: Promiscuous mode and a truckload of other
~ *stuff we don't need (and shouldn't have).
~ * CAP_NET_RAW:   Packet capture (raw sockets).
~ */
~cap_value_t cap_list[2] = { CAP_NET_ADMIN, CAP_NET_RAW };
~cap_t caps = cap_init();
~int cl_len = sizeof(cap_list) / sizeof(cap_value_t);

~if (started_with_special_privs()) {
~print_caps(Pre drop, pre set);
~if (prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0) == -1) {
~perror(prctl());
~}

~cap_set_flag(caps, CAP_PERMITTED,   cl_len, cap_list, CAP_SET);
~cap_set_flag(caps, CAP_INHERITABLE, cl_len, cap_list, CAP_SET);

[ XXX:AGM since (caps.pE  caps.pP) this next line should fail ]
~if (cap_set_proc(caps)) {
~perror(capset());
~}
~print_caps(Pre drop, post set);
~}

~relinquish_special_privs_perm();

~print_caps(Post drop, pre set);
~cap_set_flag(caps, CAP_EFFECTIVE,   cl_len, cap_list, CAP_SET);
~if (cap_set_proc(caps)) {
~perror(capset());
~}
~print_caps(Post drop, post set);
~cap_free(caps);
}
#endif /* HAVE_LIBCAP */

My reading of the above code suggests that the application believes that
it can raise/retain effective capabilities that are not in its permitted
set.

Browsing back in my git tree all the way back to 'v2.6.12-rc2', the
following code (cap_capset_check) correctly requires:

~   96  /* verify the _new_Effective_ is a subset of the _new_Permitted_ */
~   97  if (!cap_issubset (*effective, *permitted)) {
~   98   return -EPERM;
~   99  }

so my question is, why should one expect this wireshark code to work? It
looks wrong to me.

Thanks

Andrew

|
| -serge
|
| Thanks
|
| Andrew

~From 006ddf6903983dd596e360ab1ab8e537b29fab46 Mon Sep 17 00:00:00 2001
From: Andrew G. Morgan [EMAIL PROTECTED]
Date: Mon, 18 Feb 2008 15:23:28 -0800
Subject: [PATCH] Implement per-process securebits
|
[This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES
~ is enabled at configure time.]
|
Filesystem capability support makes it possible to do away with
(set)uid-0 based privilege and use capabilities instead. That is, with
filesystem support for capabilities but without this present patch,
it is (conceptually) possible to manage a system with capabilities
alone and never need to obtain privilege via (set)uid-0.
|
Of course, conceptually isn't quite the same as currently possible
since few user applications, certainly not enough to run a viable
system, are currently prepared to leverage capabilities to exercise
privilege. Further, many applications exist that may never get
upgraded in this way, and the kernel will continue to want to support
their setuid-0 base privilege needs.
|
Where pure-capability applications evolve and replace setuid-0
binaries, it is desirable that there be a mechanisms by which they
can contain their privilege. In addition to leveraging the per-process
bounding and inheritable sets, this should include suppressing the
privilege of the uid-0 superuser from the process' tree of children.
|
The feature added by this patch can be leveraged to suppress the
privilege associated with (set)uid-0. This suppression requires
CAP_SETPCAP to initiate, and only immediately affects the 'current'
process (it is inherited through fork()/exec()). This
reimplementation differs significantly from the historical support for
securebits which was system-wide, unwieldy and which has ultimately
withered to a dead relic in the source of the modern kernel.
|
With this patch applied a process, that is capable(CAP_SETPCAP), can
now drop all legacy privilege (through uid=0) for itself and all
subsequently fork()'d/exec()'d children with:
|
~  prctl(PR_SET_SECUREBITS, 0x2f);
|
[2008/02/18: This version includes an int - long argument fix from Serge.]
|
Signed-off-by: Andrew G. Morgan [EMAIL PROTECTED]
Acked-by: Serge Hallyn [EMAIL PROTECTED]
Reviewed-by: James Morris [EMAIL PROTECTED]
- ---
~ include/linux/capability.h |3 +-
~ include/linux/init_task.h  |3 +-
~ include/linux/prctl.h  |9 +++-
~ include/linux/sched.h  |3 +-
~ include/linux/securebits.h |   25 ---
~ include/linux/security.h   |   14 +++---
~ kernel/sys.c   |   25 +--
~ security/capability.c  |1 +
~ 

[PATCH 2/3] exporting capability name/code pairs (final)

2008-02-21 Thread Kohei KaiGai
[2/3] Exporting capability code/name pairs

This patch enables to export code/name pairs of capabilities the running
kernel supported.

A newer kernel sometimes adds new capabilities, like CAP_MAC_ADMIN
at 2.6.25. However, we have no interface to disclose what capabilities
are supported on the running kernel. Thus, we have to maintain libcap
version in appropriate one synchronously.

This patch enables libcap to collect the list of capabilities at run time,
and provide them for users. It helps to improve portability of library.

It exports these information as regular files under /sys/kernel/capability.
The numeric node exports its name, the symbolic node exports its code.

Please consider to put this patch on the queue of 2.6.25.

Thanks,

 BEGIN EXAMPLE 
[EMAIL PROTECTED] ~]$ ls -R /sys/kernel/capability/
/sys/kernel/capability/:
codes  names  version

/sys/kernel/capability/codes:
0  10  12  14  16  18  2   21  23  25  27  29  30  32  4  6  8
1  11  13  15  17  19  20  22  24  26  28  3   31  33  5  7  9

/sys/kernel/capability/names:
cap_audit_controlcap_kill  cap_net_raw cap_sys_nice
cap_audit_write  cap_lease cap_setfcap cap_sys_pacct
cap_chowncap_linux_immutable   cap_setgid  cap_sys_ptrace
cap_dac_override cap_mac_admin cap_setpcap cap_sys_rawio
cap_dac_read_search  cap_mac_override  cap_setuid  cap_sys_resource
cap_fowner   cap_mknod cap_sys_admin   cap_sys_time
cap_fsetid   cap_net_admin cap_sys_bootcap_sys_tty_config
cap_ipc_lock cap_net_bind_service  cap_sys_chroot
cap_ipc_ownercap_net_broadcast cap_sys_module
[EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/version
0x20071026
[EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/codes/30
cap_audit_control
[EMAIL PROTECTED] ~]$ cat /sys/kernel/capability/names/cap_sys_pacct
20
[EMAIL PROTECTED] ~]$
 END EXAMPLE --

Signed-off-by: KaiGai Kohei [EMAIL PROTECTED]
--
 Documentation/ABI/testing/sysfs-kernel-capability |   23 +
 scripts/mkcapnames.sh |   44 +
 security/Makefile |9 ++
 security/commoncap.c  |   99 +
 4 files changed, 175 insertions(+), 0 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-capability 
b/Documentation/ABI/testing/sysfs-kernel-capability
index e69de29..402ef06 100644
--- a/Documentation/ABI/testing/sysfs-kernel-capability
+++ b/Documentation/ABI/testing/sysfs-kernel-capability
@@ -0,0 +1,23 @@
+What:  /sys/kernel/capability
+Date:  Feb 2008
+Contact:   KaiGai Kohei [EMAIL PROTECTED]
+Description:
+   The entries under /sys/kernel/capability are used to export
+   the list of capabilities the running kernel supported.
+
+   - /sys/kernel/capability/version
+ returns the most preferable version number for the
+ running kernel.
+ e.g) $ cat /sys/kernel/capability/version
+  0x20071026
+
+   - /sys/kernel/capability/code/numerical representation
+ returns its symbolic representation, on reading.
+ e.g) $ cat /sys/kernel/capability/codes/30
+  cap_audit_control
+
+   - /sys/kernel/capability/name/symbolic representation
+ returns its numerical representation, on reading.
+ e.g) $ cat /sys/kernel/capability/names/cap_sys_pacct
+  20
+
diff --git a/scripts/mkcapnames.sh b/scripts/mkcapnames.sh
index e69de29..5d36d52 100644
--- a/scripts/mkcapnames.sh
+++ b/scripts/mkcapnames.sh
@@ -0,0 +1,44 @@
+#!/bin/sh
+
+#
+# generate a cap_names.h file from include/linux/capability.h
+#
+
+CAPHEAD=`dirname $0`/../include/linux/capability.h
+REGEXP='^#define CAP_[A-Z_]+[  ]+[0-9]+$'
+NUMCAP=`cat $CAPHEAD | egrep -c $REGEXP`
+
+echo '#ifndef CAP_NAMES_H'
+echo '#define CAP_NAMES_H'
+echo
+echo '/*'
+echo ' * Do NOT edit this file directly.'
+echo ' * This file is generated from include/linux/capability.h automatically'
+echo ' */'
+echo
+echo '#if !defined(SYSFS_CAP_NAME_ENTRY) || !defined(SYSFS_CAP_CODE_ENTRY)'
+echo '#error cap_names.h should be included from security/capability.c'
+echo '#else'
+echo #if $NUMCAP != CAP_LAST_CAP + 1
+echo '#error mkcapnames.sh cannot collect capabilities correctly'
+echo '#else'
+cat $CAPHEAD | egrep $REGEXP \
+| awk '{ printf(SYSFS_CAP_NAME_ENTRY(%s,%s);\n, tolower($2), $2); }'
+echo
+echo 'static struct attribute *capability_name_attrs[] = {'
+cat $CAPHEAD | egrep $REGEXP \
+| awk '{ printf(\t%s_name_attr.attr,\n, tolower($2)); } END { print 
\tNULL, }'
+echo '};'
+
+echo
+cat $CAPHEAD | egrep $REGEXP \
+| awk '{ printf(SYSFS_CAP_CODE_ENTRY(%s,%s);\n, tolower($2), $2); }'
+echo
+echo 'static struct attribute *capability_code_attrs[] = {'
+cat 

Re: [PATCH 06/37] Security: Separate task security context from task_struct

2008-02-21 Thread Casey Schaufler

--- David Howells [EMAIL PROTECTED] wrote:

 Separate the task security context from task_struct.  At this point, the
 security data is temporarily embedded in the task_struct with two pointers
 pointing to it.

 ...

 diff --git a/security/smack/smack_access.c b/security/smack/smack_access.c
 index f6b5f6e..722752f 100644
 --- a/security/smack/smack_access.c
 +++ b/security/smack/smack_access.c
 @@ -164,7 +164,7 @@ int smk_curacc(char *obj_label, u32 mode)
  {
   int rc;
  
 - rc = smk_access(current-security, obj_label, mode);
 + rc = smk_access(current-act_as-security, obj_label, mode);
   if (rc == 0)
   return 0;
  
 diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
 index 25cbfa3..a49d94f 100644
 --- a/security/smack/smack_lsm.c
 +++ b/security/smack/smack_lsm.c
 @@ -102,7 +102,8 @@ static int smack_ptrace(struct task_struct *ptp, struct
 task_struct *ctp)
   if (rc != 0)
   return rc;
  
 - rc = smk_access(ptp-security, ctp-security, MAY_READWRITE);
 + rc = smk_access(ptp-act_as-security, ctp-sec-security,
 + MAY_READWRITE);
   if (rc != 0  __capable(ptp, CAP_MAC_OVERRIDE))
   return 0;
  
 @@ -120,7 +121,7 @@ static int smack_ptrace(struct task_struct *ptp, struct
 task_struct *ctp)
  static int smack_syslog(int type)
  {
   int rc;
 - char *sp = current-security;
 + char *sp = current-act_as-security;
  
   rc = cap_syslog(type);
   if (rc != 0)
 @@ -359,7 +360,7 @@ static int smack_sb_umount(struct vfsmount *mnt, int
 flags)
   */
  static int smack_inode_alloc_security(struct inode *inode)
  {
 - inode-i_security = new_inode_smack(current-security);
 + inode-i_security = new_inode_smack(current-act_as-security);
   if (inode-i_security == NULL)
   return -ENOMEM;
   return 0;
 @@ -777,7 +778,7 @@ static int smack_file_permission(struct file *file, int
 mask)
   */
  static int smack_file_alloc_security(struct file *file)
  {
 - file-f_security = current-security;
 + file-f_security = current-act_as-security;
   return 0;
  }
  
 @@ -875,7 +876,7 @@ static int smack_file_fcntl(struct file *file, unsigned
 int cmd,
   */
  static int smack_file_set_fowner(struct file *file)
  {
 - file-f_security = current-security;
 + file-f_security = current-act_as-security;
   return 0;
  }
  
 @@ -900,7 +901,7 @@ static int smack_file_send_sigiotask(struct task_struct
 *tsk,
* struct fown_struct is never outside the context of a struct file
*/
   file = container_of(fown, struct file, f_owner);
 - rc = smk_access(file-f_security, tsk-security, MAY_WRITE);
 + rc = smk_access(file-f_security, tsk-sec-security, MAY_WRITE);
   if (rc != 0  __capable(tsk, CAP_MAC_OVERRIDE))
   return 0;
   return rc;
 @@ -943,7 +944,7 @@ static int smack_file_receive(struct file *file)
   */
  static int smack_task_alloc_security(struct task_struct *tsk)
  {
 - tsk-security = current-security;
 + tsk-sec-security = current-act_as-security;
  
   return 0;
  }
 @@ -958,7 +959,7 @@ static int smack_task_alloc_security(struct task_struct
 *tsk)
   */
  static void smack_task_free_security(struct task_struct *task)
  {
 - task-security = NULL;
 + task-sec-security = NULL;
  }
  
  /**
 @@ -970,7 +971,7 @@ static void smack_task_free_security(struct task_struct
 *task)
   */
  static int smack_task_setpgid(struct task_struct *p, pid_t pgid)
  {
 - return smk_curacc(p-security, MAY_WRITE);
 + return smk_curacc(p-sec-security, MAY_WRITE);
  }
  
  /**
 @@ -981,7 +982,7 @@ static int smack_task_setpgid(struct task_struct *p,
 pid_t pgid)
   */
  static int smack_task_getpgid(struct task_struct *p)
  {
 - return smk_curacc(p-security, MAY_READ);
 + return smk_curacc(p-sec-security, MAY_READ);
  }
  
  /**
 @@ -992,7 +993,7 @@ static int smack_task_getpgid(struct task_struct *p)
   */
  static int smack_task_getsid(struct task_struct *p)
  {
 - return smk_curacc(p-security, MAY_READ);
 + return smk_curacc(p-sec-security, MAY_READ);
  }
  
  /**
 @@ -1004,7 +1005,7 @@ static int smack_task_getsid(struct task_struct *p)
   */
  static void smack_task_getsecid(struct task_struct *p, u32 *secid)
  {
 - *secid = smack_to_secid(p-security);
 + *secid = smack_to_secid(p-sec-security);
  }
  
  /**
 @@ -1016,7 +1017,7 @@ static void smack_task_getsecid(struct task_struct *p,
 u32 *secid)
   */
  static int smack_task_setnice(struct task_struct *p, int nice)
  {
 - return smk_curacc(p-security, MAY_WRITE);
 + return smk_curacc(p-sec-security, MAY_WRITE);
  }
  
  /**
 @@ -1028,7 +1029,7 @@ static int smack_task_setnice(struct task_struct *p,
 int nice)
   */
  static int smack_task_setioprio(struct task_struct *p, int ioprio)
  {
 - return smk_curacc(p-security, MAY_WRITE);
 + return smk_curacc(p-sec-security, MAY_WRITE);
  }
  
  /**
 @@ -1039,7 +1040,7 @@ 

Re: [PATCH 07/37] Security: De-embed task security record from task and use refcounting

2008-02-21 Thread Casey Schaufler

--- David Howells [EMAIL PROTECTED] wrote:

 Remove the temporarily embedded task security record from task_struct. 
 Instead
 it is made to dangle from the task_struct::sec and task_struct::act_as
 pointers
 with references counted for each.
 
 ...
 
 The LSM hooks for dealing with task security are modified to deal with the
 task
 security struct directly rather than going via the task_struct as appopriate.
 
 ...

 diff --git a/security/smack/smack_lsm.c b/security/smack/smack_lsm.c
 index a49d94f..dbce607 100644
 --- a/security/smack/smack_lsm.c
 +++ b/security/smack/smack_lsm.c
 @@ -957,9 +957,22 @@ static int smack_task_alloc_security(struct task_struct
 *tsk)
   * points to an immutable list. The blobs never go away.
   * There is no leak here.
   */
 -static void smack_task_free_security(struct task_struct *task)
 +static void smack_task_free_security(struct task_security *sec)
  {
 - task-sec-security = NULL;
 + sec-security = NULL;
 +}
 +
 +/**
 + * task_dup_security - Duplicate task security
 + * @p points to the task_security struct that has been copied
 + *
 + * Duplicate the security structure currently attached to the p-security
 field
 + * and attach back to p-security (the pointer itself was copied, so there's
 + * nothing to be done here).
 + */
 +static int smack_task_dup_security(struct task_security *sec)
 +{
 + return 0;
  }

Thank you for adding this hook. The comment is helpful.
  
  /**
 @@ -2276,17 +2289,17 @@ static int smack_inet_conn_request(struct sock *sk,
 struct sk_buff *skb,
  /**
   * smack_key_alloc - Set the key security blob
   * @key: object
 - * @tsk: the task associated with the key
 + * @context: the task security associated with the key
   * @flags: unused
   *
   * No allocation required
   *
   * Returns 0
   */
 -static int smack_key_alloc(struct key *key, struct task_struct *tsk,
 +static int smack_key_alloc(struct key *key, struct task_security *context,
  unsigned long flags)
  {
 - key-security = tsk-act_as-security;
 + key-security = context-security;
   return 0;
  }
  
 @@ -2304,14 +2317,14 @@ static void smack_key_free(struct key *key)
  /*
   * smack_key_permission - Smack access on a key
   * @key_ref: gets to the object
 - * @context: task involved
 + * @context: task security involved
   * @perm: unused
   *
   * Return 0 if the task has read and write to the object,
   * an error code otherwise
   */
  static int smack_key_permission(key_ref_t key_ref,
 - struct task_struct *context, key_perm_t perm)
 + struct task_security *context, key_perm_t perm)
  {
   struct key *keyp;
  
 @@ -2327,10 +2340,10 @@ static int smack_key_permission(key_ref_t key_ref,
   /*
* This should not occur
*/
 - if (context-act_as-security == NULL)
 + if (context-security == NULL)
   return -EACCES;
  
 - return smk_access(context-act_as-security, keyp-security,
 + return smk_access(context-security, keyp-security,
 MAY_READWRITE);
  }
  #endif /* CONFIG_KEYS */
 @@ -2430,6 +2443,7 @@ static struct security_operations smack_ops = {
  
   .task_alloc_security =  smack_task_alloc_security,
   .task_free_security =   smack_task_free_security,
 + .task_dup_security =smack_task_dup_security,
   .task_post_setuid = cap_task_post_setuid,
   .task_setpgid = smack_task_setpgid,
   .task_getpgid = smack_task_getpgid,

No objections from the Smack side. Thank you.


Casey Schaufler
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/37] Security: Allow kernel services to override LSM settings for task actions

2008-02-21 Thread Casey Schaufler

--- David Howells [EMAIL PROTECTED] wrote:

 Allow kernel services to override LSM settings appropriate to the actions
 performed by a task by duplicating a security record, modifying it and then
 using task_struct::act_as to point to it when performing operations on behalf
 of a task.
 
 This is used, for example, by CacheFiles which has to transparently access
 the
 cache on behalf of a process that thinks it is doing, say, NFS accesses with
 a
 potentially inappropriate (with respect to accessing the cache) set of
 security data.
 
 This patch provides two LSM hooks for modifying a task security record:
 
  (*) security_kernel_act_as() which allows modification of the security datum
  with which a task acts on other objects (most notably files).
 
  (*) security_create_files_as() which allows modification of the security
  datum that is used to initialise the security data on a file that a task
  creates.
 
 ...

 --- a/security/smack/smack_lsm.c
 +++ b/security/smack/smack_lsm.c
 @@ -976,6 +976,36 @@ static int smack_task_dup_security(struct task_security
 *sec)
  }
  
  /**
 + * smack_task_kernel_act_as - Set the subjective context in a security
 record
 + * @p points to the task that nominated @secid.
 + * @sec points to the task security record to be modified.
 + * @secid specifies the security ID to be set
 + *
 + * Set the security data for a kernel service.
 + */
 +static int smack_task_kernel_act_as(struct task_struct *p,
 + struct task_security *sec, u32 secid)
 +{
 + return -ENOTSUPP;
 +}
 +
 +/**
 + * smack_task_create_files_as - Set the file creation label in a security
 record
 + * @p points to the task that nominated @inode.
 + * @sec points to the task security record to be modified.
 + * @inode points to the inode to use as a reference.
 + *
 + * Set the file creation context in a security record to the same as the
 + * objective context of the specified inode
 + */
 +static int smack_task_create_files_as(struct task_struct *p,
 +   struct task_security *sec,
 +   struct inode *inode)
 +{
 + return -ENOTSUPP;
 +}

Hum. ENOTSUPP is not not very satisfying, is it? I will have to
think on this a bit.

 +
 +/**
   * smack_task_setpgid - Smack check on setting pgid
   * @p: the task object
   * @pgid: unused
 @@ -2444,6 +2474,8 @@ static struct security_operations smack_ops = {
   .task_alloc_security =  smack_task_alloc_security,
   .task_free_security =   smack_task_free_security,
   .task_dup_security =smack_task_dup_security,
 + .task_kernel_act_as =   smack_task_kernel_act_as,
 + .task_create_files_as = smack_task_create_files_as,
   .task_post_setuid = cap_task_post_setuid,
   .task_setpgid = smack_task_setpgid,
   .task_getpgid = smack_task_getpgid,

Except for the fact that the hooks don't do anything this
looks fine. I'm not sure that I would want these hooks to
do anything, it requires additional thought to determine if
there is a good behavior for them.

Thank you.


Casey Schaufler
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] capabilities: implement per-process securebits

2008-02-21 Thread Andrew G. Morgan

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Andrew G. Morgan wrote:
| Serge E. Hallyn wrote:
| | It all looks good to me.
| |
| | Since we've confirmed that wireshark uses capabilities it must be using
| | prctl(PR_SET_KEEPCAPS), so running it might be a good way to verify
that
| | your changes to that codepath (with
CONFIG_SECURITY_FILE_CAPABILITIES=n)
| | are 100% correct, and might set minds at ease.  Is that something
you're
| | set up to be able to do?
|
| I guess I need someone to offer an existence proof that this particular
| wireshark code ever worked? (ldd dumpcap|grep libcap). For reference,
| I'm looking at:

Doh! :*)

I had mistaken cap_init() for cap_get_proc() in the wireshark code...

I now see what wireshark is trying to do, and can *confirm* that with
the present patch I do maintain the legacy behavior. :-)

FWIW I've updated capsh in the libcap git tree to add a few more hooks
and with the following sequence can now verify that the keep-caps works
as before:

As root:
# rm -f tcapsh
# cp capsh tcapsh
# chown root.root tcapsh
# chmod u+s tcapsh
# ls -l tcapsh
# ./capsh --uid=500 -- -c ./tcapsh --keep=1 \
~  --caps=\cap_net_raw,cap_net_admin=ip\ --uid=500 \
~  --caps=\cap_net_raw,cap_net_admin=pie\ --print
# echo $?
0

The wireshark problem, that you have been discussing (in the other
thread), can also be simulated as follows:

# ./capsh --uid=500 -- -c ./tcapsh --keep=1 \
~  --caps=\cap_net_raw,cap_net_admin=ip\ \
~  --uid=500 --forkfor=10 --caps= --print \
~  --killit=9 --print

You might like to re-post your fix for that problem as a stand alone
patch; I suspect it may be lost in the noise at this point. (You might
also like to update the comment in that fix since the old comment looks
very stale if you delete the -euid==0 check. I think it is safe to
simply say /* legacy signal behavior requires that a user can kill any
process running with their uid */)

Cheers

Andrew

|
| wireshark-0.99.7/dumpcap.c:302
|
| void
| relinquish_privs_except_capture(void)
| {
| ~/* CAP_NET_ADMIN: Promiscuous mode and a truckload of other
| ~ *stuff we don't need (and shouldn't have).
| ~ * CAP_NET_RAW:   Packet capture (raw sockets).
| ~ */
| ~cap_value_t cap_list[2] = { CAP_NET_ADMIN, CAP_NET_RAW };
| ~cap_t caps = cap_init();
| ~int cl_len = sizeof(cap_list) / sizeof(cap_value_t);
|
| ~if (started_with_special_privs()) {
| ~print_caps(Pre drop, pre set);
| ~if (prctl(PR_SET_KEEPCAPS, 1, 0, 0, 0) == -1) {
| ~perror(prctl());
| ~}
|
| ~cap_set_flag(caps, CAP_PERMITTED,   cl_len, cap_list, CAP_SET);
| ~cap_set_flag(caps, CAP_INHERITABLE, cl_len, cap_list, CAP_SET);
|
| [ XXX:AGM since (caps.pE  caps.pP) this next line should fail ]
| ~if (cap_set_proc(caps)) {
| ~perror(capset());
| ~}
| ~print_caps(Pre drop, post set);
| ~}
|
| ~relinquish_special_privs_perm();
|
| ~print_caps(Post drop, pre set);
| ~cap_set_flag(caps, CAP_EFFECTIVE,   cl_len, cap_list, CAP_SET);
| ~if (cap_set_proc(caps)) {
| ~perror(capset());
| ~}
| ~print_caps(Post drop, post set);
| ~cap_free(caps);
| }
| #endif /* HAVE_LIBCAP */
|
| My reading of the above code suggests that the application believes that
| it can raise/retain effective capabilities that are not in its permitted
| set.
|
| Browsing back in my git tree all the way back to 'v2.6.12-rc2', the
| following code (cap_capset_check) correctly requires:
|
| ~   96  /* verify the _new_Effective_ is a subset of the
_new_Permitted_ */
| ~   97  if (!cap_issubset (*effective, *permitted)) {
| ~   98   return -EPERM;
| ~   99  }
|
| so my question is, why should one expect this wireshark code to work? It
| looks wrong to me.
|
| Thanks
|
| Andrew
|
| |
| | -serge
| |
| | Thanks
| |
| | Andrew
|
| ~From 006ddf6903983dd596e360ab1ab8e537b29fab46 Mon Sep 17 00:00:00 2001
| From: Andrew G. Morgan [EMAIL PROTECTED]
| Date: Mon, 18 Feb 2008 15:23:28 -0800
| Subject: [PATCH] Implement per-process securebits
| |
| [This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES
| ~ is enabled at configure time.]
| |
| Filesystem capability support makes it possible to do away with
| (set)uid-0 based privilege and use capabilities instead. That is, with
| filesystem support for capabilities but without this present patch,
| it is (conceptually) possible to manage a system with capabilities
| alone and never need to obtain privilege via (set)uid-0.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFHvmeB+bHCR3gb8jsRAknmAKCMw0Qe7uDwtuRE+f3YVmnlE5pK4wCgsv0f
5E6+K9Z0Xp1P74iOlnt221o=
=+sLL
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe 
linux-security-module in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html