[Devel] Re: [PATCH 5/5][NFS] Cleanup explicit check for mandatory locks
Trond Myklebust wrote: On Mon, 2007-09-17 at 11:57 +0400, Pavel Emelyanov wrote: The __mandatory_lock(inode) macro makes the same check, but makes the code more readable. Could we please avoid using underscores in macros. Also, why are we breaking the usual convention of capitalising macro names? Sorry, I've forgot to change all the log - this is not a macro, but a static inline function. The underscores are here, because the mandatory_lock() one already exists and additionally checks for if (IS_MANDLOCK(inode)) Thanks, Pavel Cheers Trond Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Cc: Trond Myklebust [EMAIL PROTECTED] --- fs/nfs/file.c |3 +-- 1 files changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/nfs/file.c b/fs/nfs/file.c index 73ddd2e..7a07be1 100644 --- a/fs/nfs/file.c +++ b/fs/nfs/file.c @@ -605,8 +605,7 @@ static int nfs_lock(struct file *filp, i nfs_inc_stats(inode, NFSIOS_VFSLOCK); /* No mandatory locks over NFS */ -if ((inode-i_mode (S_ISGID | S_IXGRP)) == S_ISGID -fl-fl_type != F_UNLCK) +if (__mandatory_lock(inode) fl-fl_type != F_UNLCK) return -ENOLCK; if (IS_GETLK(cmd)) ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/4] Account for the slub objects
Dave Hansen wrote: On Mon, 2007-09-17 at 16:35 +0400, Pavel Emelyanov wrote: + + rcu_read_lock(); + cnt = task_kmem_container(current); + if (res_counter_charge(cnt-res, s-size)) + goto err_locked; + + css_get(cnt-css); + rcu_read_unlock(); + obj_container[slab_index(obj, s, page_address(pg))] = cnt; You made some effort in the description, but could we get some big fat comments here about what RCU is doing? No big fat comment here - this all is containers API. -- Dave ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/4] Account for the slub objects
Dave Hansen wrote: On Mon, 2007-09-17 at 16:35 +0400, Pavel Emelyanov wrote: The struct page gets an extra pointer (just like it has with the RSS controller) and this pointer points to the array of the kmem_container pointers - one for each object stored on that page itself. Can't these at least be unioned so we don't make it any _worse_ when both are turned on? Your comment makes me suspect, that you don't look at the patches at all :( They ARE unioned. -- Dave ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] Wake up mandatory locks waiter on chmod
J. Bruce Fields wrote: On Mon, Sep 17, 2007 at 10:37:56AM +0400, Pavel Emelyanov wrote: J. Bruce Fields wrote: Is there a small chance that a lock may be applied after this check: + mandatory = (inode-i_flock MANDATORY_LOCK(inode)); + but early enough that someone can still block on the lock while the file is still marked for mandatory locking? (And is the inode-i_flock check there really necessary?) There is, but as you have noticed: OK, but why not just remove the inode-i_flock check there? I can't see how it helps anyway. Well, there are probably worse races in the mandatory locking code. ...there are. The inode-i_lock is protected with lock_kernel() only and is not in sync with any other checks for inodes. This is sad :( but a good locking for locks is to be done... I would also prefer a locking scheme that didn't rely on the BKL. That said, except for this race: I would as well :) But I don't know the locking code good enough to start fixing. Besides, even if I send a patch series that handles this, I don't think that anyone will accept it, due to this changes too much code, can you prove you fixed all the places and so on... (For example, my impression is that a mandatory lock can be applied just after the locks_mandatory_area() checks but before the io actually completes.) ... I'm not aware of other races in the existing file-locking code. It sounds like you might be. Could you give specific examples? Well, there's a long standing BUG in leases code - when we made all the checks in inserting lease, we call the locks_alloc_lock() and may fall asleep. Bu after the wakeup nobody re-checks for the things to change. I suspect there are other bad places. --b. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/4] Switch caches notification dynamically
Christoph Lameter wrote: On Mon, 17 Sep 2007, Pavel Emelyanov wrote: If we turn accounting on on some cache and this cache is merged with some other, this other will be notified as well. We can solve this by disabling of cache merging, but maybe we can do it some other way. You could write a 1 to slub_nomerge during bootup if containers are to be supported? Once they are merged it is going to be difficult to separate them again. Yes. I also thought that disabling the merge is the only way to handle this case. Thanks. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/4] Switch caches notification dynamically
Christoph Lameter wrote: On Mon, 17 Sep 2007, Pavel Emelyanov wrote: struct kmem_cache kmalloc_caches[PAGE_SHIFT] __cacheline_aligned; EXPORT_SYMBOL(kmalloc_caches); +static inline int is_kmalloc_cache(struct kmem_cache *s) +{ +int km_idx; + +km_idx = s - kmalloc_caches; +return km_idx = 0 km_idx ARRAY_SIZE(kmalloc_caches); +} Could be as simple at return s kmalloc_caches s kmalloc_caches + ARRAY_SIZE(kmalloc_caches); +if (buf[0] == '0') { +if (any_slab_objects(s)) +/* + * we cannot turn this off because of the + * full slabs cannot be found in this case + */ +return -EBUSY; The full slabs can be checked by subtracting the partial slabs from the allocated slabs in the per node structure. No no! This is not that I meant here. This is just like the redzoning turning on/off dynamically. I meant that we cannot find the pages that are full of objects to notify others that these ones are no longer tracked. I know that we can do it by tracking these pages with some performance penalty, but does it worth having the ability to turn notifications off by the cost of the performance degradation? Thanks, Pavel ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] Wake up mandatory locks waiter on chmod (v2)
Trond Myklebust wrote: On Mon, 2007-09-17 at 18:16 +0400, Pavel Emelyanov wrote: Trond Myklebust wrote: On Mon, 2007-09-17 at 12:13 +0400, Pavel Emelyanov wrote: When the process is blocked on mandatory lock and someone changes the inode's permissions, so that the lock is no longer mandatory, nobody wakes up the blocked process, but probably should. Please explain in more detail why we need this patch. From this fixes an OOPs/deadlock/leak POV we do not. This is just an attempt to make the locking code be more consistent and clean. Why do you think we get a deadlock or leak? AFAICS if the user turns off I didn't' tell that. mandatory locks on the file, then the existing locks default back into advisory locks which use the same notification mechanism as the mandatory locks. True. IOW: the process that is waiting in locks_mandatory_area() will be released as soon as the advisory lock is dropped. If that theory is broken in practice, then that is the bug that we need to fix. We neither want to add a load of locking crap to notify_change(), nor should we need to. We have this for inotify already. Adding wakeup for mandatory lock is not that bad. Anyway - I noticed, that the system state can become not consistent and proposed the way to fix it. If this inconsistency is not a big deal, and nobody cares, than I'm fine with forgetting this patch, since I have no other arguments to protect it, but this is just not very nice without this patch. Cheers Trond Thanks, Pavel ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/4] Kernel memory accounting container (v3)
Christoph Lameter wrote: On Mon, 17 Sep 2007, Pavel Emelyanov wrote: As I have already told kmalloc caches cannot be accounted easily so turning the accounting on for them will fail with -EINVAL. Turning the accounting off is possible only if the cache has no objects. This is done so because turning accounting off implies unaccounting of all the objects in the cache, but due to full-pages in slub are not stored in any lists (usually) this is impossible to do so, however I'm open for discussion of how to make this work. Where can I find more information why is would not be possible to account kmalloc caches? It is possible, but the problem is that we want to account only allocations explicitly caused by the user request. E.g. the vfsmount name is kmalloc-ed by explicit user request, but such things as buffer heads are not. So we have to explicitly specify which calls to kmalloc() do we wand to be accounted and which we do not by passing additional flag (GFP_ACCT?) to kmalloc, but this is another patch. Yet again - as soon as we agree on the implementation of kmem caches accounting, I will proceed with working on kmalloc, vmalloc and buddy allocator. Thanks, Pavel ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/4] Kernel memory accounting container (v3)
Christoph Lameter wrote: On Tue, 18 Sep 2007, Balbir Singh wrote: I've wondered the same thing and asked the question. Pavel wrote back to me saying The pages that are full of objects are not linked in any list in kmem_cache so we just cannot find them. That is true for any types of slab cache and not restricted to kmalloc slabs. SLUB can be switched into a mode where it provides these lists (again at a performance penalty). But I thought we generate the counters at alloc and free time? So why do we need to traverse the object lists? When we make echo 0 /sys/slab/xxx/cache_notify we want all the objects to be unaccounted back immediately. Even __free_slab() won't catch this because the SLAB_NOTIFY flag will be turned off for this cache. So we have to walk all the objects and unaccount them. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 1/4] Add notification about some major slab events
Christoph Lameter wrote: On Mon, 17 Sep 2007, Pavel Emelyanov wrote: @@ -1036,7 +1121,10 @@ static struct page *allocate_slab(struct page = alloc_pages_node(node, flags, s-order); if (!page) -return NULL; +goto out; + +if (slub_newpage_notify(s, page, flags) 0) +goto out_free; mod_zone_page_state(page_zone(page), (s-flags SLAB_RECLAIM_ACCOUNT) ? @@ -1044,6 +1132,11 @@ static struct page *allocate_slab(struct pages); return page; + +out_free: +__free_pages(page, s-order); +out: +return NULL; } Ok that looks sane. static void setup_object(struct kmem_cache *s, struct page *page, @@ -1136,6 +1229,8 @@ static void rcu_free_slab(struct rcu_hea static void free_slab(struct kmem_cache *s, struct page *page) { +slub_freepage_notify(s, page); + if (unlikely(s-flags SLAB_DESTROY_BY_RCU)) { /* * RCU free overloads the RCU head over the LRU Ditto. @@ -1555,6 +1650,11 @@ static void __always_inline *slab_alloc( } local_irq_restore(flags); +if (object slub_alloc_notify(s, object, gfpflags) 0) { +kmem_cache_free(s, object); +return NULL; +} + if (unlikely((gfpflags __GFP_ZERO) object)) memset(object, 0, c-objsize); Please stay completely out of the fast path. No modifications to slab_alloc or slab_free please. It is possible to force all allocations of a particular slab of interest to use the slow path in __slab_alloc (maybe as a result of the slab page allocation hook returning a certain result code). See how the SLAB_DEBUG handling does it. You can adapt that and then do the object checks in __slab_alloc. That's true, but: 1. we perform only a flag check on a fast path 2. currently we cannot force the freeing of an object to go _always_ through the slow __slab_free(), and thus the following situation is possible: a. container A allocates an object and this object is accounted to it b. the object is freed and gets into lockless freelist (but stays accounted to A) c. container C allocates this object from the freelist and thus get unaccounted amount of memory this discrepancy can grow up infinitely. Sure, we can mark some caches to go through the slow path even on freeing the objects, but isn't it the same as checking for SLAB_NOTIFY on fast paths? Maybe it's worth having the notifiers under config option, so that those who don't need this won't suffer at all? Thanks, Pavel ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] Hookup group-scheduler with task container infrastructure
On Mon, 10 Sep 2007 22:40:49 +0530 Srivatsa Vaddagiri [EMAIL PROTECTED] wrote: + tg-cfs_rq = kzalloc(sizeof(cfs_rq) * num_possible_cpus(), GFP_KERNEL); + if (!tg-cfs_rq) + goto err; + tg-se = kzalloc(sizeof(se) * num_possible_cpus(), GFP_KERNEL); + if (!tg-se) + goto err; Sorry for very lazy responce.. num_possible_cpus() just returns # of possible cpus. Then it will not return Max-cpu-id to be used. I think just use NR_CPUS here is an easy way. Thanks, -Kame ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH][NETNS] Cleanup list walking in setup_net and cleanup_net
I proposed introducing a list_for_each_entry_continue_reverse macro to be used in setup_net() when unrolling the failed -init callback. Here is the macro and some more cleanup in the setup_net() itself to remove one variable from the stack :) The same thing is for the cleanup_net() - the existing list_for_each_entry_reverse() is used. Minor, but the code looks nicer. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Acked-by: Eric W. Biederman [EMAIL PROTECTED] --- diff --git a/include/linux/list.h b/include/linux/list.h index f29fc9c..ad9dcb9 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -525,6 +525,20 @@ static inline void list_splice_init_rcu( pos = list_entry(pos-member.next, typeof(*pos), member)) /** + * list_for_each_entry_continue_reverse - iterate backwards from the given point + * @pos: the type * to use as a loop cursor. + * @head: the head for your list. + * @member:the name of the list_struct within the struct. + * + * Start to iterate over list of given type backwards, continuing after + * the current position. + */ +#define list_for_each_entry_continue_reverse(pos, head, member) \ + for (pos = list_entry(pos-member.prev, typeof(*pos), member); \ +prefetch(pos-member.prev), pos-member != (head);\ +pos = list_entry(pos-member.prev, typeof(*pos), member)) + +/** * list_for_each_entry_from - iterate over list of given type from the current point * @pos: the type * to use as a loop cursor. * @head: the head for your list. diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c index 1fc513c..0e6cb02 100644 --- a/net/core/net_namespace.c +++ b/net/core/net_namespace.c @@ -56,7 +56,6 @@ static void net_free(struct net *net) static void cleanup_net(struct work_struct *work) { struct pernet_operations *ops; - struct list_head *ptr; struct net *net; net = container_of(work, struct net, work); @@ -69,8 +68,7 @@ static void cleanup_net(struct work_stru net_unlock(); /* Run all of the network namespace exit methods */ - list_for_each_prev(ptr, pernet_list) { - ops = list_entry(ptr, struct pernet_operations, list); + list_for_each_entry_reverse(ops, pernet_list, list) { if (ops-exit) ops-exit(net); } @@ -102,7 +100,6 @@ static int setup_net(struct net *net) { /* Must be called with net_mutex held */ struct pernet_operations *ops; - struct list_head *ptr; int error; memset(net, 0, sizeof(struct net)); @@ -110,8 +107,7 @@ static int setup_net(struct net *net) atomic_set(net-use_count, 0); error = 0; - list_for_each(ptr, pernet_list) { - ops = list_entry(ptr, struct pernet_operations, list); + list_for_each_entry(ops, pernet_list, list) { if (ops-init) { error = ops-init(net); if (error 0) @@ -120,12 +116,12 @@ static int setup_net(struct net *net) } out: return error; + out_undo: /* Walk through the list backwards calling the exit functions * for the pernet modules whose init functions did not fail. */ - for (ptr = ptr-prev; ptr != pernet_list; ptr = ptr-prev) { - ops = list_entry(ptr, struct pernet_operations, list); + list_for_each_entry_continue_reverse(ops, pernet_list, list) { if (ops-exit) ops-exit(net); } ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] Hookup group-scheduler with task container infrastructure
On Tue, Sep 18, 2007 at 05:19:45PM +0900, KAMEZAWA Hiroyuki wrote: Srivatsa Vaddagiri [EMAIL PROTECTED] wrote: + tg-cfs_rq = kzalloc(sizeof(cfs_rq) * num_possible_cpus(), GFP_KERNEL); + if (!tg-cfs_rq) + goto err; + tg-se = kzalloc(sizeof(se) * num_possible_cpus(), GFP_KERNEL); + if (!tg-se) + goto err; Sorry for very lazy responce.. num_possible_cpus() just returns # of possible cpus. Then it will not return Max-cpu-id to be used. I think just use NR_CPUS here is an easy way. Good catch! Yes, you are right, the pointer array needs to be NR_CPUS size. On closer examination, I think I can simply use alloc_percpu() here. Will send a patch after some testing. -- Regards, vatsa ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [net-2.6.24][patch 2/2] Dynamically allocate the loopback device
This would be a good opportunity to remove the single-allocated queue struct in netdevice (at the bottom) that we had to put in to accomodate the static loopback. Now we can set it back to a zero element list, and have alloc_netdev_mq() just allocate the number of queues requested, not num_queues - 1. I'll put a patch together based on this patchset. Thanks, -PJ Waskiewicz ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH] Consolidate sleeping routines in file locking code
This is the next step in fs/locks.c cleanup before turning it into using the struct pid *. This time I found, that there are some places that do a similar thing - they try to apply a lock on a file and go to sleep on error till the blocker exits. All these places can be easily consolidated, saving 28 lines of code and more than 600 bytes from the .text, but there is one minor note. The locks_mandatory_area() code becomes a bit different after this patch - it no longer checks for the inode's permissions change. Nevertheless, this check is useless without my another patch that wakes the waiter up in the notify_change(), which is not considered to be useful for now. Later, if we do need the fix with the wakeup this can be easily merged with this patch. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] --- fs/locks.c | 122 +++-- 1 files changed, 47 insertions(+), 75 deletions(-) diff --git a/fs/locks.c b/fs/locks.c index cb1c977..8e849ed 100644 --- a/fs/locks.c +++ b/fs/locks.c @@ -659,6 +659,29 @@ static int locks_block_on_timeout(struct return result; } +typedef int (*lock_wait_fn)(struct file *, struct file_lock *, void *); + +static int do_lock_file_wait(struct file *filp, struct file_lock *fl, + lock_wait_fn lockfn, void *arg) +{ + int error; + + might_sleep(); + while (1) { + error = lockfn(filp, fl, arg); + if ((error != -EAGAIN) || !(fl-fl_flags FL_SLEEP)) + break; + + error = wait_event_interruptible(fl-fl_wait, !fl-fl_next); + if (error) { + locks_delete_block(fl); + break; + } + } + + return error; +} + void posix_test_lock(struct file *filp, struct file_lock *fl) { @@ -720,7 +743,8 @@ next_task: * whether or not a lock was successfully freed by testing the return * value for -ENOENT. */ -static int flock_lock_file(struct file *filp, struct file_lock *request) +static +int flock_lock_file(struct file *filp, struct file_lock *request, void *x) { struct file_lock *new_fl = NULL; struct file_lock **before; @@ -1029,20 +1053,7 @@ EXPORT_SYMBOL(posix_lock_file); */ int posix_lock_file_wait(struct file *filp, struct file_lock *fl) { - int error; - might_sleep (); - for (;;) { - error = posix_lock_file(filp, fl, NULL); - if ((error != -EAGAIN) || !(fl-fl_flags FL_SLEEP)) - break; - error = wait_event_interruptible(fl-fl_wait, !fl-fl_next); - if (!error) - continue; - - locks_delete_block(fl); - break; - } - return error; + return do_lock_file_wait(filp, fl, (lock_wait_fn)posix_lock_file, NULL); } EXPORT_SYMBOL(posix_lock_file_wait); @@ -1085,12 +1096,17 @@ int locks_mandatory_locked(struct inode * This function is called from rw_verify_area() and * locks_verify_truncate(). */ + +static int lock_mandatory_fn(struct file *filp, struct file_lock *fl, void *arg) +{ + return __posix_lock_file((struct inode *)arg, fl, NULL); +} + int locks_mandatory_area(int read_write, struct inode *inode, struct file *filp, loff_t offset, size_t count) { struct file_lock fl; - int error; locks_init_lock(fl); fl.fl_owner = current-files; @@ -1103,27 +1119,12 @@ int locks_mandatory_area(int read_write, fl.fl_start = offset; fl.fl_end = offset + count - 1; - for (;;) { - error = __posix_lock_file(inode, fl, NULL); - if (error != -EAGAIN) - break; - if (!(fl.fl_flags FL_SLEEP)) - break; - error = wait_event_interruptible(fl.fl_wait, !fl.fl_next); - if (!error) { - /* -* If we've been sleeping someone might have -* changed the permissions behind our back. -*/ - if (__mandatory_lock(inode)) - continue; - } - - locks_delete_block(fl); - break; - } - - return error; + /* +* If we've been sleeping someone might have changed the permissions +* behind our back. However, nobody wakes us up, so go on spinning +* here till the blocker dies. +*/ + return do_lock_file_wait(filp, fl, lock_mandatory_fn, inode); } EXPORT_SYMBOL(locks_mandatory_area); @@ -1517,20 +1518,7 @@ out_unlock: */ int flock_lock_file_wait(struct file *filp, struct file_lock *fl) { - int error; - might_sleep(); - for (;;) { - error = flock_lock_file(filp, fl); - if ((error != -EAGAIN) ||
[Devel] Re: [PATCH] Wake up mandatory locks waiter on chmod (v2)
On Tue, Sep 18, 2007 at 10:33:26AM +0400, Pavel Emelyanov wrote: Trond Myklebust wrote: IOW: the process that is waiting in locks_mandatory_area() will be released as soon as the advisory lock is dropped. If that theory is broken in practice, then that is the bug that we need to fix. We neither want to add a load of locking crap to notify_change(), nor should we need to. We have this for inotify already. Adding wakeup for mandatory lock is not that bad. Anyway - I noticed, that the system state can become not consistent and proposed the way to fix it. If this inconsistency is not a big deal, and nobody cares, than I'm fine with forgetting this patch, since I have no other arguments to protect it, but this is just not very nice without this patch. Maybe this should be documented, e.g. in fcntl(2). I'm not sure exactly what we'd say--we probably don't want to commit to the current behavior. Maybe something like behavior is undefined when setting or clearing mandatory locking on a file while it is locked. --b. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] Wake up mandatory locks waiter on chmod (v2)
On Tue, Sep 18, 2007 at 12:14:55PM -0400, Trond Myklebust wrote: Note also that strictly speaking, we're not even compliant with the System V behaviour on read() and write(). See: http://www.unix.org.ua/orelly/networking_2ndEd/nfs/ch11_01.htm and http://docs.sun.com/app/docs/doc/801-6736/6i13fom0a?l=ena=viewq=mandatory+lock According to these docs, we should be wrapping each and every read() and write() syscall with a mandatory lock. The fact that we're not, and yet still not seeing any complaints just goes to show how few people are actually using and relying on this... So currently there's nothing to prevent this: - write passes locks_mandatory_area() checks - get mandatory lock - read old data - write updates file data - read new data You can see the data change even while you hold a mandatory lock that should exclude writes. Similarly you might think that an application could prevent anyone from seeing the intermediate state of a file while it performs a series of writes under an exclusive mandatory lock, but actually there's nothing to stop a read in progress from racing with acquisition of the lock. Unless I'm missing something, that makes our mandatory lock implementation pretty pointless. I wish we could either fix it or just ditch it, but I suppose either option would be unpopular. --b. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] Wake up mandatory locks waiter on chmod (v2)
On Tue, 2007-09-18 at 12:52 -0400, J. Bruce Fields wrote: On Tue, Sep 18, 2007 at 12:14:55PM -0400, Trond Myklebust wrote: Note also that strictly speaking, we're not even compliant with the System V behaviour on read() and write(). See: http://www.unix.org.ua/orelly/networking_2ndEd/nfs/ch11_01.htm and http://docs.sun.com/app/docs/doc/801-6736/6i13fom0a?l=ena=viewq=mandatory+lock According to these docs, we should be wrapping each and every read() and write() syscall with a mandatory lock. The fact that we're not, and yet still not seeing any complaints just goes to show how few people are actually using and relying on this... So currently there's nothing to prevent this: - write passes locks_mandatory_area() checks - get mandatory lock - read old data - write updates file data - read new data You can see the data change even while you hold a mandatory lock that should exclude writes. Similarly you might think that an application could prevent anyone from seeing the intermediate state of a file while it performs a series of writes under an exclusive mandatory lock, but actually there's nothing to stop a read in progress from racing with acquisition of the lock. Unless I'm missing something, that makes our mandatory lock implementation pretty pointless. I wish we could either fix it or just ditch it, but I suppose either option would be unpopular. It gets even better when you throw mmap() into the mix :-) Trond ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 20/33] memory controller add documentation
On Mon, 17 Sep 2007 14:03:27 -0700 Paul Menage wrote: From: Balbir Singh [EMAIL PROTECTED] (container-cgroup renaming by Paul Menage [EMAIL PROTECTED]) Signed-off-by: Balbir Singh [EMAIL PROTECTED] Signed-off-by: Paul Menage [EMAIL PROTECTED] --- Documentation/controllers/memory.txt | 259 + 1 files changed, 259 insertions(+) diff -puN /dev/null Documentation/controllers/memory.txt --- /dev/null +++ a/Documentation/controllers/memory.txt @@ -0,0 +1,259 @@ +Benefits and Purpose of the memory controller + +The memory controller isolates the memory behaviour of a group of tasks +from the rest of the system. The article on LWN [12] mentions some probable +uses of the memory controller. The memory controller can be used to + +a. Isolate an application or a group of applications + Memory hungry applications can be isolated and limited to a smaller + amount of memory. +b. Create a cgroup with limited amount of memory, this can be used , makes run-on sentence. Please use ; or 2 sentences. + as a good alternative to booting with mem=. +c. Virtualization solutions can control the amount of memory they want + to assign to a virtual machine instance. +d. A CD/DVD burner could control the amount of memory used by the + rest of the system to ensure that burning does not fail due to lack + of available memory. +e. There are several other use cases, find one or use the controller just Ditto. + for fun (to learn and hack on the VM subsystem). [snip] + +2.1. Design + +The core of the design is a counter called the res_counter. The res_counter +tracks the current memory usage and limit of the group of processes associated +with the controller. Each cgroup has a memory controller specific data +structure (mem_cgroup) associated with it. + +2.2. Accounting + [snip] + +The accounting is done as follows: mem_cgroup_charge() is invoked to setup +the necessary data structures and check if the cgroup that is being charged +is over its limit. If it is then reclaim is invoked on the cgroup. +More details can be found in the reclaim section of this document. +If everything goes well, a page meta-data-structure called page_cgroup is ^drop second hyphen (use space) +allocated and associated with the page. This routine also adds the page to +the per cgroup LRU. + +2.2.1 Accounting details + +All mapped pages (RSS) and unmapped user pages (Page Cache) are accounted. +RSS pages are accounted at the time of page_add_*_rmap() unless they've already +been accounted for earlier. A file page will be accounted for as Page Cache; +it's mapped into the page tables of a process, duplicate accounting is carefully , makes run-on sentence... +avoided. Page Cache pages are accounted at the time of add_to_page_cache(). +The corresponding routines that remove a page from the page tables or removes s/removes/remove/ +a page from Page Cache is used to decrement the accounting counters of the +cgroup. + +2.3 Shared Page Accounting + +Shared pages are accounted on the basis of the first touch approach. The +cgroup that first touches a page is accounted for the page. The principle +behind this approach is that a cgroup that aggressively uses a shared +page will eventually get charged for it (once it is uncharged from +the cgroup that brought it in -- this will happen on memory pressure). + +2.4 Reclaim + +Each cgroup maintains a per cgroup LRU that consists of an active +and inactive list. When a cgroup goes over its limit, we first try +to reclaim memory from the cgroup so as to make space for the new +pages that the cgroup has touched. If the reclaim is unsuccessful, +an OOM routine is invoked to select and kill the bulkiest task in the +cgroup. + +The reclaim algorithm has not been modified for cgroups, except that +pages that are selected for reclaiming come from the per cgroup LRU +list. + +2. Locking 2. ?? Section numbering is off. +The memory controller uses the following hierarchy + +1. zone-lru_lock is used for selecting pages to be isolated +2. mem-lru_lock protects the per cgroup LRU +3. lock_page_cgroup() is used to protect page-page_cgroup + +3. User Interface + +0. Configuration 0. Kernel build configuration + +a. Enable CONFIG_CGROUPS +b. Enable CONFIG_RESOURCE_COUNTERS +c. Enable CONFIG_CGROUP_MEM_CONT [snip] --- ~Randy ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [patch 1/1] fix bad netif_carrier_off place
From: Daniel Lezcano [EMAIL PROTECTED] If the netif_carrier_off is called before register_netdev that will use and generate an event for a non initialized network device and that leads to a Oops. I moved the netif_carrier_off from the setup function after each register_netdev call. Signed-off-by: Daniel Lezcano [EMAIL PROTECTED] --- drivers/net/veth.c |5 - 1 file changed, 4 insertions(+), 1 deletion(-) Index: net-2.6.24/drivers/net/veth.c === --- net-2.6.24.orig/drivers/net/veth.c +++ net-2.6.24/drivers/net/veth.c @@ -286,7 +286,6 @@ static void veth_setup(struct net_device dev-features |= NETIF_F_LLTX; dev-init = veth_dev_init; dev-destructor = veth_dev_free; - netif_carrier_off(dev); } /* @@ -357,6 +356,8 @@ static int veth_newlink(struct net_devic if (err 0) goto err_register_peer; + netif_carrier_off(peer); + /* * register dev last * @@ -382,6 +383,8 @@ static int veth_newlink(struct net_devic if (err 0) goto err_register_dev; + netif_carrier_off(dev); + /* * tie the deviced together */ -- ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [patch 0/1] fix veth netif_carrier_off
When I tryed the veth driver, I fall into a kernel oops. qemu login: Oops: [#1] Modules linked in: CPU:0 EIP:0060:[c0265c9e]Not tainted VLI EFLAGS: 0202 (2.6.23-rc6-g754f885d-dirty #33) EIP is at __linkwatch_run_queue+0x6a/0x175 eax: c7fc9550 ebx: 6b6b6b6b ecx: c3360c80 edx: 0246 esi: 0001 edi: 6b6b6b6b ebp: c7fd9f7c esp: c7fd9f5c ds: 007b es: 007b fs: gs: ss: 0068 Process events/0 (pid: 5, ti=c7fd8000 task=c7fc9550 task.ti=c7fd8000) Stack: c7fee5a8 c0387680 c7fd9f74 c02e1aaa 4f732564 c0387684 c7fee5a8 c0387680 c7fd9f84 c0265dc9 c7fd9fac c011fb3c c7fd9f94 c02e277e c7fd9fac c02e1166 c0265da9 c7fee5a8 c0120203 c7fd9fc8 c7fd9fd0 c01202ba c7fc9550 Call Trace: [c0102c69] show_trace_log_lvl+0x1a/0x2f [c0102d1b] show_stack_log_lvl+0x9d/0xa5 [c0102ee1] show_registers+0x1be/0x28f [c010309a] die+0xe8/0x208 [c010d5a1] do_page_fault+0x4ba/0x595 [c02e2842] error_code+0x6a/0x70 [c0265dc9] linkwatch_event+0x20/0x27 [c011fb3c] run_workqueue+0x7c/0x102 [c01202ba] worker_thread+0xb7/0xc5 [c012270c] kthread+0x39/0x61 [c0102913] kernel_thread_helper+0x7/0x10 === Code: b8 60 76 38 c0 e8 e3 ca 07 00 b8 60 76 38 c0 8b 1d 78 a7 3d c0 c7 05 78 a7 3d c0 00 00 00 00 e8 df ca 07 00 e9 ed 00 00 00 85 f6 8b bb f4 01 00 00 74 17 89 d8 e8 73 fe ff ff 85 c0 75 0c 89 d8 EIP: [c0265c9e] __linkwatch_run_queue+0x6a/0x175 SS:ESP 0068:c7fd9f5c Slab corruption: size-2048 start=c473eac8, len=2048 Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. Last user: [c025be72](free_netdev+0x1f/0x41) 200: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b c0 e2 73 c4 Prev obj: start=c473e2b0, len=2048 Redzone: 0xd84156c5635688c0/0xd84156c5635688c0. Last user: [c025bed0](alloc_netdev_mq+0x3c/0xa1) 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 010: 76 65 74 68 30 00 00 00 00 00 00 00 00 00 00 00 Next obj: start=c473f2e0, len=2048 Redzone: 0x9f911029d74e35b/0x9f911029d74e35b. Last user: [c0260e69](neigh_sysctl_unregister+0x2b/0x2e) 000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b That happens when trying to add the veth driver using the ip command: ip link add veth0 which fail. It appears that the netif_carrier_off is placed into the setup function and this one is called before register_netdevice. The register_netdevice function does a lot of initialization to the netdev and if the netif_carrier_off is called before the register_netdev function, it will use and trigger an event for an uninitialized netdev. -- ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] Wake up mandatory locks waiter on chmod (v2)
On Tue, Sep 18, 2007 at 12:54:56PM -0400, Trond Myklebust wrote: On Tue, 2007-09-18 at 12:52 -0400, J. Bruce Fields wrote: So currently there's nothing to prevent this: - write passes locks_mandatory_area() checks - get mandatory lock - read old data - write updates file data - read new data You can see the data change even while you hold a mandatory lock that should exclude writes. Similarly you might think that an application could prevent anyone from seeing the intermediate state of a file while it performs a series of writes under an exclusive mandatory lock, but actually there's nothing to stop a read in progress from racing with acquisition of the lock. Unless I'm missing something, that makes our mandatory lock implementation pretty pointless. I wish we could either fix it or just ditch it, but I suppose either option would be unpopular. It gets even better when you throw mmap() into the mix :-) Hm. Documentation/mandatory.txt claims that it mandatory locks and mmap() with MAP_SHARED exclude each other, but I can't see where that's enfoced. That file doesn't make any mention of the above race. So for now I think someone should update that file and fcntl(2) to mention these problems and to recommend rather strongly against using mandatory locking. --b. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 1/3] Signal semantics for /sbin/init
Oleg Nesterov [EMAIL PROTECTED] wrote: | On 09/13, [EMAIL PROTECTED] wrote: | | Oleg Nesterov [EMAIL PROTECTED] wrote: | | | | Notes: | | | |- Blocked signals are never ignored, so init still can receive | | a pending blocked signal after sigprocmask(SIG_UNBLOCK). | | Easy to fix, but probably we can ignore this issue. | | | | I was wrong. This should be fixed right now. I _think_ this is easy, | | and I was going to finish this patch yesterday, but - sorry! - I just | | can't switch to kernel mode these days, I am fighting with some urgent | | tasks on my paid job. | | | | To respect the current init semantic, | | | | The current init semantic is broken in many ways ;) | | | | shouldn't we discard any unblockable | | signal (STOP and KILL) sent by a process to its pid namespace init process ? | | Yes. And Patch 1/3 (Oleg's patch) in the set I sent, handles this already | (since STOP and KILL are never in the task-blocked list) | | | | Then, all other signals should be handled appropriately by the pid namespace | | init. | | | | | Yes, I think you are probably right, this should be enough in practice. After all, | | only root can send the signal to /sbin/init. | | I agree - the assumption that the container-init will handle these | other signals, simplifies the kernel implementation for now. | | | | On my machine, /proc/1/status shows that init doesn't have a handler for | | non-ignored SIGUNUSED == 31, though. | | | | But who knows? The kernel promises some guarantees, it is not good to break them. | | Perhaps some strange non-standard environment may suffer. | | | | We are assuming that the pid namespace init is not doing anything silly and | | I guess it's OK if the consequences are only on the its pid namespace and | | not the whole system. | | | | The sub-namespace case is very easy afaics, we only need the signal comes from | | the parent namespace check, not a problem if we make the decision on the sender's | | path, like this patch does. | | Yes, patches 2 and 3 of the set already do the ancestor-ns check. no ? | | Yes, I think patches 2-3 are good. But this patch is not. I thought that we | can ignore the Blocked signals are never ignored problem, now I am not sure. | It is possible that init temporary blocks a signal which it is not going to | handle. | | Perhaps we can do something like the patch below, but I don't like it. With | this patch, we check the signal handler even if /sbin/init blocks the signal. | This makes the semantics a bit strange for /sbin/init. Hopefully not a problem | in practice, but still not good. I think this is one step ahead of what we were discussing last week. A container-init that does not have a handler for a fatal signal would survive even if the signal is posted when it is blocked. | | Unfortunately, I don't know how to make it better. The problem with blocked | signals is that we don't know who is the sender of the signal at the time | when the signal is unblocked. One solution I was thinking of was to possibly queue pending blocked signals to a container init seperately and then requeue them on the normal queue when signals are unblocked. Its definitely not an easier solution, but might be less intrusive than the signal from parent ns flag solution. i.e suppose we have: struct pid_namespace { ... struct sigpending cinit_blocked_pending; struct sigpending cinit_blocked_shared_pending; } Signals from ancestor ns are queued as usual on task-pending and task-signal-shared_pending. They don't need any special handling. Only signals posted to a container-init from within its namespace need special handling (as in: ignore unhandled fatal signals from same namespace). If the container-init has say SIGUSR1 blocked, and a descendant of container-init posts SIGUSR1 to container-init, queue the SIGUSR1 in pid_namespace-cinit_blocked_pending. When container-init unblocks SIGUSR1, check if there was a pending SIGUSR1 from same namespace (i.e check -cinit_blocked_pending list). If there was and container-init has a handler for SIGUSR1, post SIGUSR1 on task-pending queue and let the container-init handle SIGUSR1. If there was a SIGUSR1 posted to containier init and there is no handler for SIGUSR1, then just ignore the SIGUSR1 (since it would be fatal otherwise). I chose 'struct pid_namespace' for the temporary queue, since we need the temporary queues only for container-inits (not for all processes). And having it allocated ahead of time, ensures we can queue the signal even under low-memory conditions. Just an idea at this point. | | What do you think? Can we live with this oddity? Otherwise, we have to add | something like the the signal is from the parent namespace flag, and I bet | this is not trivial to implement correctly. I think its reasonable to place
[Devel] Re: [PATCH 2/4] Switch caches notification dynamically
On Tue, 18 Sep 2007, Pavel Emelyanov wrote: I meant that we cannot find the pages that are full of objects to notify others that these ones are no longer tracked. I know that we can do it by tracking these pages with some performance penalty, but does it worth having the ability to turn notifications off by the cost of the performance degradation? Not sure. That may be your call. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH][NETNS] Cleanup list walking in setup_net and cleanup_net
From: Pavel Emelyanov [EMAIL PROTECTED] Date: Tue, 18 Sep 2007 12:06:53 +0400 I proposed introducing a list_for_each_entry_continue_reverse macro to be used in setup_net() when unrolling the failed -init callback. Here is the macro and some more cleanup in the setup_net() itself to remove one variable from the stack :) The same thing is for the cleanup_net() - the existing list_for_each_entry_reverse() is used. Minor, but the code looks nicer. Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED] Acked-by: Eric W. Biederman [EMAIL PROTECTED] Applied, thanks Pavel. ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel