[Devel] Re: [PATCH 5/5][NFS] Cleanup explicit check for mandatory locks

2007-09-18 Thread Pavel Emelyanov
Trond Myklebust wrote:
 On Mon, 2007-09-17 at 11:57 +0400, Pavel Emelyanov wrote:
 The __mandatory_lock(inode) macro makes the same check, but
 makes the code more readable.
 
 Could we please avoid using underscores in macros. Also, why are we
 breaking the usual convention of capitalising macro names?

Sorry, I've forgot to change all the log - this is not a macro, 
but a static inline function. The underscores are here, because 
the mandatory_lock() one already exists and additionally checks 
for if (IS_MANDLOCK(inode))

Thanks,
Pavel

 Cheers
   Trond
 
 Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]
 Cc: Trond Myklebust [EMAIL PROTECTED]

 ---

  fs/nfs/file.c |3 +--
  1 files changed, 1 insertion(+), 2 deletions(-)

 diff --git a/fs/nfs/file.c b/fs/nfs/file.c
 index 73ddd2e..7a07be1 100644
 --- a/fs/nfs/file.c
 +++ b/fs/nfs/file.c
 @@ -605,8 +605,7 @@ static int nfs_lock(struct file *filp, i
  nfs_inc_stats(inode, NFSIOS_VFSLOCK);
  
  /* No mandatory locks over NFS */
 -if ((inode-i_mode  (S_ISGID | S_IXGRP)) == S_ISGID 
 -fl-fl_type != F_UNLCK)
 +if (__mandatory_lock(inode)  fl-fl_type != F_UNLCK)
  return -ENOLCK;
  
  if (IS_GETLK(cmd))

 
 

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/4] Account for the slub objects

2007-09-18 Thread Pavel Emelyanov
Dave Hansen wrote:
 On Mon, 2007-09-17 at 16:35 +0400, Pavel Emelyanov wrote:
 +
 +   rcu_read_lock();
 +   cnt = task_kmem_container(current);
 +   if (res_counter_charge(cnt-res, s-size))
 +   goto err_locked;
 +
 +   css_get(cnt-css);
 +   rcu_read_unlock();
 +   obj_container[slab_index(obj, s, page_address(pg))] = cnt; 
 
 You made some effort in the description, but could we get some big fat
 comments here about what RCU is doing?

No big fat comment here - this all is containers API.

 -- Dave
 
 

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 4/4] Account for the slub objects

2007-09-18 Thread Pavel Emelyanov
Dave Hansen wrote:
 On Mon, 2007-09-17 at 16:35 +0400, Pavel Emelyanov wrote:
 The struct page gets an extra pointer (just like it has with
 the RSS controller) and this pointer points to the array of
 the kmem_container pointers - one for each object stored on
 that page itself.
 
 Can't these at least be unioned so we don't make it any _worse_ when
 both are turned on?

Your comment makes me suspect, that you don't look at the patches
at all :(

  They ARE unioned.

 -- Dave
 
 

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Wake up mandatory locks waiter on chmod

2007-09-18 Thread Pavel Emelyanov
J. Bruce Fields wrote:
 On Mon, Sep 17, 2007 at 10:37:56AM +0400, Pavel Emelyanov wrote:
 J. Bruce Fields wrote:
 Is there a small chance that a lock may be applied after this check:

 +  mandatory = (inode-i_flock  MANDATORY_LOCK(inode));
 +
 but early enough that someone can still block on the lock while the file
 is still marked for mandatory locking?  (And is the inode-i_flock check
 there really necessary?)
 There is, but as you have noticed:
 
 OK, but why not just remove the inode-i_flock check there?  I can't see
 how it helps anyway.
 
 Well, there are probably worse races in the mandatory locking code.
 ...there are. The inode-i_lock is protected with lock_kernel() only
 and is not in sync with any other checks for inodes. This is sad :(
 but a good locking for locks is to be done...
 
 I would also prefer a locking scheme that didn't rely on the BKL.  That
 said, except for this race:

I would as well :) But I don't know the locking code good enough to
start fixing. Besides, even if I send a patch series that handles this,
I don't think that anyone will accept it, due to this changes too much
code, can you prove you fixed all the places and so on...

 (For example, my impression is that a mandatory lock can be applied just
 after the locks_mandatory_area() checks but before the io actually
 completes.)
 
 ... I'm not aware of other races in the existing file-locking code.  It
 sounds like you might be.  Could you give specific examples?

Well, there's a long standing BUG in leases code - when we made all the
checks in inserting lease, we call the locks_alloc_lock() and may fall
asleep. Bu after the wakeup nobody re-checks for the things to change.

I suspect there are other bad places.

 --b.
 

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/4] Switch caches notification dynamically

2007-09-18 Thread Pavel Emelyanov
Christoph Lameter wrote:
 On Mon, 17 Sep 2007, Pavel Emelyanov wrote:
 
 If we turn accounting on on some cache and this cache 
 is merged with some other, this other will be notified
 as well. We can solve this by disabling of cache merging, 
 but maybe we can do it some other way.
 
 You could write a 1 to slub_nomerge during bootup if containers are to 
 be supported? Once they are merged it is going to be difficult to separate 
 them again.

Yes. I also thought that disabling the merge is the only way to
handle this case. Thanks.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/4] Switch caches notification dynamically

2007-09-18 Thread Pavel Emelyanov
Christoph Lameter wrote:
 On Mon, 17 Sep 2007, Pavel Emelyanov wrote:
 
  struct kmem_cache kmalloc_caches[PAGE_SHIFT] __cacheline_aligned;
  EXPORT_SYMBOL(kmalloc_caches);
  
 +static inline int is_kmalloc_cache(struct kmem_cache *s)
 +{
 +int km_idx;
 +
 +km_idx = s - kmalloc_caches;
 +return km_idx = 0  km_idx  ARRAY_SIZE(kmalloc_caches);
 +}
 
 
 Could be as simple at
 
   return s  kmalloc_caches  s  kmalloc_caches + 
 ARRAY_SIZE(kmalloc_caches);
 
 
 +if (buf[0] == '0') {
 +if (any_slab_objects(s))
 +/*
 + * we cannot turn this off because of the
 + * full slabs cannot be found in this case
 + */
 +return -EBUSY;
 
 The full slabs can be checked by subtracting the partial slabs from the 
 allocated slabs in the per node structure.

No no! This is not that I meant here. This is just like the redzoning
turning on/off dynamically.

I meant that we cannot find the pages that are full of objects to notify
others that these ones are no longer tracked. I know that we can do
it by tracking these pages with some performance penalty, but does it
worth having the ability to turn notifications off by the cost of the
performance degradation?

Thanks,
Pavel

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Wake up mandatory locks waiter on chmod (v2)

2007-09-18 Thread Pavel Emelyanov
Trond Myklebust wrote:
 On Mon, 2007-09-17 at 18:16 +0400, Pavel Emelyanov wrote:
 Trond Myklebust wrote:
 On Mon, 2007-09-17 at 12:13 +0400, Pavel Emelyanov wrote:
 When the process is blocked on mandatory lock and someone changes 
 the inode's permissions, so that the lock is no longer mandatory,
 nobody wakes up the blocked process, but probably should.
 Please explain in more detail why we need this patch.
 From this fixes an OOPs/deadlock/leak POV we do not. This is
 just an attempt to make the locking code be more consistent and
 clean.
 
 Why do you think we get a deadlock or leak? AFAICS if the user turns off

I didn't' tell that.

 mandatory locks on the file, then the existing locks default back into
 advisory locks which use the same notification mechanism as the
 mandatory locks.

True.

 IOW: the process that is waiting in locks_mandatory_area() will be
 released as soon as the advisory lock is dropped. If that theory is
 broken in practice, then that is the bug that we need to fix. We neither
 want to add a load of locking crap to notify_change(), nor should we
 need to.

We have this for inotify already. Adding wakeup for mandatory lock
is not that bad.

Anyway - I noticed, that the system state can become not consistent
and proposed the way to fix it. If this inconsistency is not a big
deal, and nobody cares, than I'm fine with forgetting this patch, 
since I have no other arguments to protect it, but this is just not
very nice without this patch.

 Cheers
Trond
 
 

Thanks,
Pavel

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/4] Kernel memory accounting container (v3)

2007-09-18 Thread Pavel Emelyanov
Christoph Lameter wrote:
 On Mon, 17 Sep 2007, Pavel Emelyanov wrote:
 
 As I have already told kmalloc caches cannot be accounted easily
 so turning the accounting on for them will fail with -EINVAL.
 Turning the accounting off is possible only if the cache has
 no objects. This is done so because turning accounting off
 implies unaccounting of all the objects in the cache, but due
 to full-pages in slub are not stored in any lists (usually)
 this is impossible to do so, however I'm open for discussion
 of how to make this work.
 
 Where can I find more information why is would not be possible to 
 account kmalloc caches?

It is possible, but the problem is that we want to account only
allocations explicitly caused by the user request. E.g. the
vfsmount name is kmalloc-ed by explicit user request, but such
things as buffer heads are not.

So we have to explicitly specify which calls to kmalloc() do we
wand to be accounted and which we do not by passing additional
flag (GFP_ACCT?) to kmalloc, but this is another patch.

Yet again - as soon as we agree on the implementation of kmem
caches accounting, I will proceed with working on kmalloc, vmalloc
and buddy allocator.

 

Thanks,
Pavel

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/4] Kernel memory accounting container (v3)

2007-09-18 Thread Pavel Emelyanov
Christoph Lameter wrote:
 On Tue, 18 Sep 2007, Balbir Singh wrote:
 
 I've wondered the same thing and asked the question. Pavel wrote
 back to me saying

 The pages that are full of objects are not linked in any list
 in kmem_cache so we just cannot find them.
 
 That is true for any types of slab cache and not restricted to kmalloc 
 slabs. SLUB can be switched into a mode where it provides these lists 
 (again at a performance penalty).
 
 But I thought we generate the counters at alloc and free time? So why do 
 we need to traverse the object lists?

When we make echo 0  /sys/slab/xxx/cache_notify we want all the objects
to be unaccounted back immediately. Even __free_slab() won't catch this
because the  SLAB_NOTIFY flag will be turned off for this cache. So we
have to walk all the objects and unaccount them.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/4] Add notification about some major slab events

2007-09-18 Thread Pavel Emelyanov
Christoph Lameter wrote:
 On Mon, 17 Sep 2007, Pavel Emelyanov wrote:
 
 @@ -1036,7 +1121,10 @@ static struct page *allocate_slab(struct
  page = alloc_pages_node(node, flags, s-order);
  
  if (!page)
 -return NULL;
 +goto out;
 +
 +if (slub_newpage_notify(s, page, flags)  0)
 +goto out_free;
  
  mod_zone_page_state(page_zone(page),
  (s-flags  SLAB_RECLAIM_ACCOUNT) ?
 @@ -1044,6 +1132,11 @@ static struct page *allocate_slab(struct
  pages);
  
  return page;
 +
 +out_free:
 +__free_pages(page, s-order);
 +out:
 +return NULL;
  }
 
 Ok that looks sane.
 
  static void setup_object(struct kmem_cache *s, struct page *page,
 @@ -1136,6 +1229,8 @@ static void rcu_free_slab(struct rcu_hea
  
  static void free_slab(struct kmem_cache *s, struct page *page)
  {
 +slub_freepage_notify(s, page);
 +
  if (unlikely(s-flags  SLAB_DESTROY_BY_RCU)) {
  /*
   * RCU free overloads the RCU head over the LRU
 
 Ditto.
 
 @@ -1555,6 +1650,11 @@ static void __always_inline *slab_alloc(
  }
  local_irq_restore(flags);
  
 +if (object  slub_alloc_notify(s, object, gfpflags)  0) {
 +kmem_cache_free(s, object);
 +return NULL;
 +}
 +
  if (unlikely((gfpflags  __GFP_ZERO)  object))
  memset(object, 0, c-objsize);
  
 
 Please stay completely out of the fast path. No modifications to 
 slab_alloc or slab_free please. It is possible to force all allocations of 
 a particular slab of interest to use the slow path in __slab_alloc (maybe 
 as a result of the slab page allocation hook returning a certain result 
 code). See how the SLAB_DEBUG handling does it. You can adapt that and then 
 do the 
 object checks in __slab_alloc.

That's true, but:
1. we perform only a flag check on a fast path
2. currently we cannot force the freeing of an object to go _always_ 
   through the slow __slab_free(), and thus the following situation is 
   possible:
 a. container A allocates an object and this object is 
accounted to it
 b. the object is freed and gets into lockless freelist (but 
stays accounted to A)
 c. container C allocates this object from the freelist 
and thus get unaccounted amount of memory
  this discrepancy can grow up infinitely. Sure, we can mark some caches to
  go through the slow path even on freeing the objects, but isn't it the
  same as checking for SLAB_NOTIFY on fast paths?

Maybe it's worth having the notifiers under config option, so that those
who don't need this won't suffer at all?

Thanks,
Pavel

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Hookup group-scheduler with task container infrastructure

2007-09-18 Thread KAMEZAWA Hiroyuki
On Mon, 10 Sep 2007 22:40:49 +0530
Srivatsa Vaddagiri [EMAIL PROTECTED] wrote:
 + tg-cfs_rq = kzalloc(sizeof(cfs_rq) * num_possible_cpus(), GFP_KERNEL);
 + if (!tg-cfs_rq)
 + goto err;
 + tg-se = kzalloc(sizeof(se) * num_possible_cpus(), GFP_KERNEL);
 + if (!tg-se)
 + goto err;
Sorry for very lazy responce..

num_possible_cpus() just returns # of possible cpus. Then it will not return
Max-cpu-id to be used. I think just use NR_CPUS here is an easy way.

Thanks,
-Kame

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH][NETNS] Cleanup list walking in setup_net and cleanup_net

2007-09-18 Thread Pavel Emelyanov
I proposed introducing a list_for_each_entry_continue_reverse macro
to be used in setup_net() when unrolling the failed -init callback.

Here is the macro and some more cleanup in the setup_net() itself
to remove one variable from the stack :) The same thing is for the
cleanup_net() - the existing list_for_each_entry_reverse() is used.

Minor, but the code looks nicer.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]
Acked-by: Eric W. Biederman [EMAIL PROTECTED]

---

diff --git a/include/linux/list.h b/include/linux/list.h
index f29fc9c..ad9dcb9 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -525,6 +525,20 @@ static inline void list_splice_init_rcu(
 pos = list_entry(pos-member.next, typeof(*pos), member))
 
 /**
+ * list_for_each_entry_continue_reverse - iterate backwards from the given 
point
+ * @pos:   the type * to use as a loop cursor.
+ * @head:  the head for your list.
+ * @member:the name of the list_struct within the struct.
+ *
+ * Start to iterate over list of given type backwards, continuing after
+ * the current position.
+ */
+#define list_for_each_entry_continue_reverse(pos, head, member)
\
+   for (pos = list_entry(pos-member.prev, typeof(*pos), member);  \
+prefetch(pos-member.prev), pos-member != (head);\
+pos = list_entry(pos-member.prev, typeof(*pos), member))
+
+/**
  * list_for_each_entry_from - iterate over list of given type from the current 
point
  * @pos:   the type * to use as a loop cursor.
  * @head:  the head for your list.
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 1fc513c..0e6cb02 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -56,7 +56,6 @@ static void net_free(struct net *net)
 static void cleanup_net(struct work_struct *work)
 {
struct pernet_operations *ops;
-   struct list_head *ptr;
struct net *net;
 
net = container_of(work, struct net, work);
@@ -69,8 +68,7 @@ static void cleanup_net(struct work_stru
net_unlock();
 
/* Run all of the network namespace exit methods */
-   list_for_each_prev(ptr, pernet_list) {
-   ops = list_entry(ptr, struct pernet_operations, list);
+   list_for_each_entry_reverse(ops, pernet_list, list) {
if (ops-exit)
ops-exit(net);
}
@@ -102,7 +100,6 @@ static int setup_net(struct net *net)
 {
/* Must be called with net_mutex held */
struct pernet_operations *ops;
-   struct list_head *ptr;
int error;
 
memset(net, 0, sizeof(struct net));
@@ -110,8 +107,7 @@ static int setup_net(struct net *net)
atomic_set(net-use_count, 0);
 
error = 0;
-   list_for_each(ptr, pernet_list) {
-   ops = list_entry(ptr, struct pernet_operations, list);
+   list_for_each_entry(ops, pernet_list, list) {
if (ops-init) {
error = ops-init(net);
if (error  0)
@@ -120,12 +116,12 @@ static int setup_net(struct net *net)
}
 out:
return error;
+
 out_undo:
/* Walk through the list backwards calling the exit functions
 * for the pernet modules whose init functions did not fail.
 */
-   for (ptr = ptr-prev; ptr != pernet_list; ptr = ptr-prev) {
-   ops = list_entry(ptr, struct pernet_operations, list);
+   list_for_each_entry_continue_reverse(ops, pernet_list, list) {
if (ops-exit)
ops-exit(net);
}

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Hookup group-scheduler with task container infrastructure

2007-09-18 Thread Srivatsa Vaddagiri
On Tue, Sep 18, 2007 at 05:19:45PM +0900, KAMEZAWA Hiroyuki wrote:
 Srivatsa Vaddagiri [EMAIL PROTECTED] wrote:
  +   tg-cfs_rq = kzalloc(sizeof(cfs_rq) * num_possible_cpus(), GFP_KERNEL);
  +   if (!tg-cfs_rq)
  +   goto err;
  +   tg-se = kzalloc(sizeof(se) * num_possible_cpus(), GFP_KERNEL);
  +   if (!tg-se)
  +   goto err;
 Sorry for very lazy responce..
 
 num_possible_cpus() just returns # of possible cpus. Then it will not return
 Max-cpu-id to be used. I think just use NR_CPUS here is an easy way.

Good catch! Yes, you are right, the pointer array needs to be NR_CPUS
size.

On closer examination, I think I can simply use alloc_percpu() here.
Will send a patch after some testing.

-- 
Regards,
vatsa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [net-2.6.24][patch 2/2] Dynamically allocate the loopback device

2007-09-18 Thread Peter Waskiewicz
This would be a good opportunity to remove the single-allocated queue struct
in netdevice (at the bottom) that we had to put in to accomodate the static
loopback.  Now we can set it back to a zero element list, and have
alloc_netdev_mq() just allocate the number of queues requested, not
num_queues - 1.

I'll put a patch together based on this patchset.

Thanks,
-PJ Waskiewicz
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH] Consolidate sleeping routines in file locking code

2007-09-18 Thread Pavel Emelyanov
This is the next step in fs/locks.c cleanup before turning
it into using the struct pid *. 

This time I found, that there are some places that do a
similar thing - they try to apply a lock on a file and go 
to sleep on error till the blocker exits.

All these places can be easily consolidated, saving 28 
lines of code and more than 600 bytes from the .text,
but there is one minor note. 

The locks_mandatory_area() code becomes a bit different 
after this patch - it no longer checks for the inode's 
permissions change. Nevertheless, this check is useless 
without my another patch that wakes the waiter up in the
notify_change(), which is not considered to be useful for
now.

Later, if we do need the fix with the wakeup this can be
easily merged with this patch.

Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]

---

 fs/locks.c |  122 +++--
 1 files changed, 47 insertions(+), 75 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index cb1c977..8e849ed 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -659,6 +659,29 @@ static int locks_block_on_timeout(struct
return result;
 }
 
+typedef int (*lock_wait_fn)(struct file *, struct file_lock *, void *);
+
+static int do_lock_file_wait(struct file *filp, struct file_lock *fl,
+   lock_wait_fn lockfn, void *arg)
+{
+   int error;
+
+   might_sleep();
+   while (1) {
+   error = lockfn(filp, fl, arg);
+   if ((error != -EAGAIN) || !(fl-fl_flags  FL_SLEEP))
+   break;
+
+   error = wait_event_interruptible(fl-fl_wait, !fl-fl_next);
+   if (error) {
+   locks_delete_block(fl);
+   break;
+   }
+   }
+
+   return error;
+}
+
 void
 posix_test_lock(struct file *filp, struct file_lock *fl)
 {
@@ -720,7 +743,8 @@ next_task:
  * whether or not a lock was successfully freed by testing the return
  * value for -ENOENT.
  */
-static int flock_lock_file(struct file *filp, struct file_lock *request)
+static
+int flock_lock_file(struct file *filp, struct file_lock *request, void *x)
 {
struct file_lock *new_fl = NULL;
struct file_lock **before;
@@ -1029,20 +1053,7 @@ EXPORT_SYMBOL(posix_lock_file);
  */
 int posix_lock_file_wait(struct file *filp, struct file_lock *fl)
 {
-   int error;
-   might_sleep ();
-   for (;;) {
-   error = posix_lock_file(filp, fl, NULL);
-   if ((error != -EAGAIN) || !(fl-fl_flags  FL_SLEEP))
-   break;
-   error = wait_event_interruptible(fl-fl_wait, !fl-fl_next);
-   if (!error)
-   continue;
-
-   locks_delete_block(fl);
-   break;
-   }
-   return error;
+   return do_lock_file_wait(filp, fl, (lock_wait_fn)posix_lock_file, NULL);
 }
 EXPORT_SYMBOL(posix_lock_file_wait);
 
@@ -1085,12 +1096,17 @@ int locks_mandatory_locked(struct inode 
  * This function is called from rw_verify_area() and
  * locks_verify_truncate().
  */
+
+static int lock_mandatory_fn(struct file *filp, struct file_lock *fl, void 
*arg)
+{
+   return __posix_lock_file((struct inode *)arg, fl, NULL);
+}
+
 int locks_mandatory_area(int read_write, struct inode *inode,
 struct file *filp, loff_t offset,
 size_t count)
 {
struct file_lock fl;
-   int error;
 
locks_init_lock(fl);
fl.fl_owner = current-files;
@@ -1103,27 +1119,12 @@ int locks_mandatory_area(int read_write,
fl.fl_start = offset;
fl.fl_end = offset + count - 1;
 
-   for (;;) {
-   error = __posix_lock_file(inode, fl, NULL);
-   if (error != -EAGAIN)
-   break;
-   if (!(fl.fl_flags  FL_SLEEP))
-   break;
-   error = wait_event_interruptible(fl.fl_wait, !fl.fl_next);
-   if (!error) {
-   /*
-* If we've been sleeping someone might have
-* changed the permissions behind our back.
-*/
-   if (__mandatory_lock(inode))
-   continue;
-   }
-
-   locks_delete_block(fl);
-   break;
-   }
-
-   return error;
+   /*
+* If we've been sleeping someone might have changed the permissions
+* behind our back. However, nobody wakes us up, so go on spinning
+* here till the blocker dies.
+*/
+   return do_lock_file_wait(filp, fl, lock_mandatory_fn, inode);
 }
 
 EXPORT_SYMBOL(locks_mandatory_area);
@@ -1517,20 +1518,7 @@ out_unlock:
  */
 int flock_lock_file_wait(struct file *filp, struct file_lock *fl)
 {
-   int error;
-   might_sleep();
-   for (;;) {
-   error = flock_lock_file(filp, fl);
-   if ((error != -EAGAIN) || 

[Devel] Re: [PATCH] Wake up mandatory locks waiter on chmod (v2)

2007-09-18 Thread J. Bruce Fields
On Tue, Sep 18, 2007 at 10:33:26AM +0400, Pavel Emelyanov wrote:
 Trond Myklebust wrote:
  IOW: the process that is waiting in locks_mandatory_area() will be
  released as soon as the advisory lock is dropped. If that theory is
  broken in practice, then that is the bug that we need to fix. We neither
  want to add a load of locking crap to notify_change(), nor should we
  need to.
 
 We have this for inotify already. Adding wakeup for mandatory lock
 is not that bad.
 
 Anyway - I noticed, that the system state can become not consistent
 and proposed the way to fix it. If this inconsistency is not a big
 deal, and nobody cares, than I'm fine with forgetting this patch, 
 since I have no other arguments to protect it, but this is just not
 very nice without this patch.

Maybe this should be documented, e.g. in fcntl(2).  I'm not sure exactly
what we'd say--we probably don't want to commit to the current behavior.
Maybe something like behavior is undefined when setting or clearing
mandatory locking on a file while it is locked.

--b.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Wake up mandatory locks waiter on chmod (v2)

2007-09-18 Thread J. Bruce Fields
On Tue, Sep 18, 2007 at 12:14:55PM -0400, Trond Myklebust wrote:
 Note also that strictly speaking, we're not even compliant with the
 System V behaviour on read() and write(). See:
 
   http://www.unix.org.ua/orelly/networking_2ndEd/nfs/ch11_01.htm
 and
   
 http://docs.sun.com/app/docs/doc/801-6736/6i13fom0a?l=ena=viewq=mandatory+lock
 
 According to these docs, we should be wrapping each and every read() and
 write() syscall with a mandatory lock. The fact that we're not, and yet
 still not seeing any complaints just goes to show how few people are
 actually using and relying on this...

So currently there's nothing to prevent this:

- write passes locks_mandatory_area() checks
- get mandatory lock
- read old data
- write updates file data
- read new data

You can see the data change even while you hold a mandatory lock that
should exclude writes.

Similarly you might think that an application could prevent anyone from
seeing the intermediate state of a file while it performs a series of
writes under an exclusive mandatory lock, but actually there's nothing
to stop a read in progress from racing with acquisition of the lock.

Unless I'm missing something, that makes our mandatory lock
implementation pretty pointless.  I wish we could either fix it or just
ditch it, but I suppose either option would be unpopular.

--b.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Wake up mandatory locks waiter on chmod (v2)

2007-09-18 Thread Trond Myklebust
On Tue, 2007-09-18 at 12:52 -0400, J. Bruce Fields wrote:
 On Tue, Sep 18, 2007 at 12:14:55PM -0400, Trond Myklebust wrote:
  Note also that strictly speaking, we're not even compliant with the
  System V behaviour on read() and write(). See:
  
http://www.unix.org.ua/orelly/networking_2ndEd/nfs/ch11_01.htm
  and

  http://docs.sun.com/app/docs/doc/801-6736/6i13fom0a?l=ena=viewq=mandatory+lock
  
  According to these docs, we should be wrapping each and every read() and
  write() syscall with a mandatory lock. The fact that we're not, and yet
  still not seeing any complaints just goes to show how few people are
  actually using and relying on this...
 
 So currently there's nothing to prevent this:
 
   - write passes locks_mandatory_area() checks
   - get mandatory lock
   - read old data
   - write updates file data
   - read new data
 
 You can see the data change even while you hold a mandatory lock that
 should exclude writes.
 
 Similarly you might think that an application could prevent anyone from
 seeing the intermediate state of a file while it performs a series of
 writes under an exclusive mandatory lock, but actually there's nothing
 to stop a read in progress from racing with acquisition of the lock.
 
 Unless I'm missing something, that makes our mandatory lock
 implementation pretty pointless.  I wish we could either fix it or just
 ditch it, but I suppose either option would be unpopular.

It gets even better when you throw mmap() into the mix :-)

Trond

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 20/33] memory controller add documentation

2007-09-18 Thread Randy Dunlap
On Mon, 17 Sep 2007 14:03:27 -0700 Paul Menage wrote:

 From: Balbir Singh [EMAIL PROTECTED]
 (container-cgroup renaming by Paul Menage [EMAIL PROTECTED])
 
 Signed-off-by: Balbir Singh [EMAIL PROTECTED]
 Signed-off-by: Paul Menage [EMAIL PROTECTED]
 
 ---
 
  Documentation/controllers/memory.txt |  259 +
  1 files changed, 259 insertions(+)
 
 diff -puN /dev/null Documentation/controllers/memory.txt
 --- /dev/null
 +++ a/Documentation/controllers/memory.txt
 @@ -0,0 +1,259 @@

 +Benefits and Purpose of the memory controller
 +
 +The memory controller isolates the memory behaviour of a group of tasks
 +from the rest of the system. The article on LWN [12] mentions some probable
 +uses of the memory controller. The memory controller can be used to
 +
 +a. Isolate an application or a group of applications
 +   Memory hungry applications can be isolated and limited to a smaller
 +   amount of memory.
 +b. Create a cgroup with limited amount of memory, this can be used

, makes run-on sentence.  Please use ; or 2 sentences.

 +   as a good alternative to booting with mem=.
 +c. Virtualization solutions can control the amount of memory they want
 +   to assign to a virtual machine instance.
 +d. A CD/DVD burner could control the amount of memory used by the
 +   rest of the system to ensure that burning does not fail due to lack
 +   of available memory.
 +e. There are several other use cases, find one or use the controller just

Ditto.

 +   for fun (to learn and hack on the VM subsystem).

[snip]

 +
 +2.1. Design
 +
 +The core of the design is a counter called the res_counter. The res_counter
 +tracks the current memory usage and limit of the group of processes 
 associated
 +with the controller. Each cgroup has a memory controller specific data
 +structure (mem_cgroup) associated with it.
 +
 +2.2. Accounting
 +
[snip]
 +
 +The accounting is done as follows: mem_cgroup_charge() is invoked to setup
 +the necessary data structures and check if the cgroup that is being charged
 +is over its limit. If it is then reclaim is invoked on the cgroup.
 +More details can be found in the reclaim section of this document.
 +If everything goes well, a page meta-data-structure called page_cgroup is

   ^drop second hyphen (use space)

 +allocated and associated with the page.  This routine also adds the page to
 +the per cgroup LRU.
 +
 +2.2.1 Accounting details
 +
 +All mapped pages (RSS) and unmapped user pages (Page Cache) are accounted.
 +RSS pages are accounted at the time of page_add_*_rmap() unless they've 
 already
 +been accounted for earlier. A file page will be accounted for as Page Cache;
 +it's mapped into the page tables of a process, duplicate accounting is 
 carefully

, makes run-on sentence...

 +avoided. Page Cache pages are accounted at the time of add_to_page_cache().
 +The corresponding routines that remove a page from the page tables or removes

s/removes/remove/

 +a page from Page Cache is used to decrement the accounting counters of the
 +cgroup.
 +
 +2.3 Shared Page Accounting
 +
 +Shared pages are accounted on the basis of the first touch approach. The
 +cgroup that first touches a page is accounted for the page. The principle
 +behind this approach is that a cgroup that aggressively uses a shared
 +page will eventually get charged for it (once it is uncharged from
 +the cgroup that brought it in -- this will happen on memory pressure).
 +
 +2.4 Reclaim
 +
 +Each cgroup maintains a per cgroup LRU that consists of an active
 +and inactive list. When a cgroup goes over its limit, we first try
 +to reclaim memory from the cgroup so as to make space for the new
 +pages that the cgroup has touched. If the reclaim is unsuccessful,
 +an OOM routine is invoked to select and kill the bulkiest task in the
 +cgroup.
 +
 +The reclaim algorithm has not been modified for cgroups, except that
 +pages that are selected for reclaiming come from the per cgroup LRU
 +list.
 +
 +2. Locking

2. ??  Section numbering is off.

 +The memory controller uses the following hierarchy
 +
 +1. zone-lru_lock is used for selecting pages to be isolated
 +2. mem-lru_lock protects the per cgroup LRU
 +3. lock_page_cgroup() is used to protect page-page_cgroup
 +
 +3. User Interface
 +
 +0. Configuration

   0. Kernel build configuration
 +
 +a. Enable CONFIG_CGROUPS
 +b. Enable CONFIG_RESOURCE_COUNTERS
 +c. Enable CONFIG_CGROUP_MEM_CONT
[snip]


---
~Randy
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [patch 1/1] fix bad netif_carrier_off place

2007-09-18 Thread dlezcano
From: Daniel Lezcano [EMAIL PROTECTED]

If the netif_carrier_off is called before register_netdev
that will use and generate an event for a non initialized network
device and that leads to a Oops.

I moved the netif_carrier_off from the setup function after each 
register_netdev call.

Signed-off-by: Daniel Lezcano [EMAIL PROTECTED]
---
 drivers/net/veth.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: net-2.6.24/drivers/net/veth.c
===
--- net-2.6.24.orig/drivers/net/veth.c
+++ net-2.6.24/drivers/net/veth.c
@@ -286,7 +286,6 @@ static void veth_setup(struct net_device
dev-features |= NETIF_F_LLTX;
dev-init = veth_dev_init;
dev-destructor = veth_dev_free;
-   netif_carrier_off(dev);
 }
 
 /*
@@ -357,6 +356,8 @@ static int veth_newlink(struct net_devic
if (err  0)
goto err_register_peer;
 
+   netif_carrier_off(peer);
+
/*
 * register dev last
 *
@@ -382,6 +383,8 @@ static int veth_newlink(struct net_devic
if (err  0)
goto err_register_dev;
 
+   netif_carrier_off(dev);
+
/*
 * tie the deviced together
 */

-- 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [patch 0/1] fix veth netif_carrier_off

2007-09-18 Thread dlezcano
When I tryed the veth driver, I fall into a kernel oops.

qemu login: Oops:  [#1]
Modules linked in:
CPU:0
EIP:0060:[c0265c9e]Not tainted VLI
EFLAGS: 0202   (2.6.23-rc6-g754f885d-dirty #33)
EIP is at __linkwatch_run_queue+0x6a/0x175
eax: c7fc9550   ebx: 6b6b6b6b   ecx: c3360c80   edx: 0246
esi: 0001   edi: 6b6b6b6b   ebp: c7fd9f7c   esp: c7fd9f5c
ds: 007b   es: 007b   fs:   gs:   ss: 0068
Process events/0 (pid: 5, ti=c7fd8000 task=c7fc9550 task.ti=c7fd8000)
Stack: c7fee5a8 c0387680 c7fd9f74 c02e1aaa 4f732564 c0387684 c7fee5a8 c0387680 
   c7fd9f84 c0265dc9 c7fd9fac c011fb3c c7fd9f94 c02e277e c7fd9fac c02e1166 
   c0265da9 c7fee5a8 c0120203 c7fd9fc8 c7fd9fd0 c01202ba  c7fc9550 
Call Trace:
 [c0102c69] show_trace_log_lvl+0x1a/0x2f
 [c0102d1b] show_stack_log_lvl+0x9d/0xa5
 [c0102ee1] show_registers+0x1be/0x28f
 [c010309a] die+0xe8/0x208
 [c010d5a1] do_page_fault+0x4ba/0x595
 [c02e2842] error_code+0x6a/0x70
 [c0265dc9] linkwatch_event+0x20/0x27
 [c011fb3c] run_workqueue+0x7c/0x102
 [c01202ba] worker_thread+0xb7/0xc5
 [c012270c] kthread+0x39/0x61
 [c0102913] kernel_thread_helper+0x7/0x10
 ===
Code: b8 60 76 38 c0 e8 e3 ca 07 00 b8 60 76 38 c0 8b 1d 78 a7 3d c0 c7 05 78 
a7 3d c0 00 00 00 00 e8 df ca 07 00 e9 ed 00 00 00 85 f6 8b bb f4 01 00 00 74 
17 89 d8 e8 73 fe ff ff 85 c0 75 0c 89 d8 
EIP: [c0265c9e] __linkwatch_run_queue+0x6a/0x175 SS:ESP 0068:c7fd9f5c
Slab corruption: size-2048 start=c473eac8, len=2048
Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
Last user: [c025be72](free_netdev+0x1f/0x41)
200: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b c0 e2 73 c4
Prev obj: start=c473e2b0, len=2048
Redzone: 0xd84156c5635688c0/0xd84156c5635688c0.
Last user: [c025bed0](alloc_netdev_mq+0x3c/0xa1)
000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
010: 76 65 74 68 30 00 00 00 00 00 00 00 00 00 00 00
Next obj: start=c473f2e0, len=2048
Redzone: 0x9f911029d74e35b/0x9f911029d74e35b.
Last user: [c0260e69](neigh_sysctl_unregister+0x2b/0x2e)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b

That happens when trying to add the veth driver using the ip command:

ip link add veth0 

which fail.

It appears that the netif_carrier_off is placed into the setup function
and this one is called before register_netdevice.

The register_netdevice function does a lot of initialization to the netdev
and if the netif_carrier_off is called before the register_netdev function,
it will use and trigger an event for an uninitialized netdev.

-- 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Wake up mandatory locks waiter on chmod (v2)

2007-09-18 Thread J. Bruce Fields
On Tue, Sep 18, 2007 at 12:54:56PM -0400, Trond Myklebust wrote:
 On Tue, 2007-09-18 at 12:52 -0400, J. Bruce Fields wrote:
  So currently there's nothing to prevent this:
  
  - write passes locks_mandatory_area() checks
  - get mandatory lock
  - read old data
  - write updates file data
  - read new data
  
  You can see the data change even while you hold a mandatory lock that
  should exclude writes.
  
  Similarly you might think that an application could prevent anyone from
  seeing the intermediate state of a file while it performs a series of
  writes under an exclusive mandatory lock, but actually there's nothing
  to stop a read in progress from racing with acquisition of the lock.
  
  Unless I'm missing something, that makes our mandatory lock
  implementation pretty pointless.  I wish we could either fix it or just
  ditch it, but I suppose either option would be unpopular.
 
 It gets even better when you throw mmap() into the mix :-)

Hm.  Documentation/mandatory.txt claims that it mandatory locks and
mmap() with MAP_SHARED exclude each other, but I can't see where that's
enfoced.  That file doesn't make any mention of the above race.

So for now I think someone should update that file and fcntl(2) to
mention these problems and to recommend rather strongly against using
mandatory locking.

--b.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 1/3] Signal semantics for /sbin/init

2007-09-18 Thread sukadev
Oleg Nesterov [EMAIL PROTECTED] wrote:
| On 09/13, [EMAIL PROTECTED] wrote:
| 
|  Oleg Nesterov [EMAIL PROTECTED] wrote:
|  |   
|  |   Notes:
|  |  
|  |- Blocked signals are never ignored, so init still can receive
|  |  a pending blocked signal after sigprocmask(SIG_UNBLOCK).
|  |  Easy to fix, but probably we can ignore this issue.
|  |   
|  |   I was wrong. This should be fixed right now. I _think_ this is easy,
|  |   and I was going to finish this patch yesterday, but - sorry! - I just
|  |   can't switch to kernel mode these days, I am fighting with some 
urgent
|  |   tasks on my paid job.
|  |   
|  |  To respect the current init semantic,
|  | 
|  | The current init semantic is broken in many ways ;)
|  | 
|  |  shouldn't we discard any unblockable 
|  |  signal (STOP and KILL) sent by a process to its pid namespace init 
process ? 
|  
|  Yes. And Patch 1/3 (Oleg's patch) in the set I sent, handles this already
|  (since STOP and KILL are never in the task-blocked list)
|  
|  
|  |  Then, all other signals should be handled appropriately by the pid 
namespace 
|  |  init. 
|  
|  | 
|  | Yes, I think you are probably right, this should be enough in practice. 
After all,
|  | only root can send the signal to /sbin/init.
|  
|  I agree - the assumption that the container-init will handle these
|  other signals, simplifies the kernel implementation for now.
|  
|  
|  | On my machine, /proc/1/status shows that init doesn't have a handler for
|  | non-ignored SIGUNUSED == 31, though.
|  | 
|  | But who knows? The kernel promises some guarantees, it is not good to 
break them.
|  | Perhaps some strange non-standard environment may suffer.
|  | 
|  |  We are assuming that the pid namespace init is not doing anything silly 
and 
|  |  I guess it's OK if the consequences are only on the its pid namespace 
and 
|  |  not the whole system.
|  | 
|  | The sub-namespace case is very easy afaics, we only need the signal 
comes from
|  | the parent namespace check, not a problem if we make the decision on the 
sender's
|  | path, like this patch does.
|  
|  Yes, patches 2 and 3 of the set already do the ancestor-ns check. no ?
| 
| Yes, I think patches 2-3 are good. But this patch is not. I thought that we
| can ignore the Blocked signals are never ignored problem, now I am not sure.
| It is possible that init temporary blocks a signal which it is not going to
| handle.
| 
| Perhaps we can do something like the patch below, but I don't like it. With
| this patch, we check the signal handler even if /sbin/init blocks the signal.
| This makes the semantics a bit strange for /sbin/init. Hopefully not a problem
| in practice, but still not good.

I think this is one step ahead of what we were discussing last week.
A container-init that does not have a handler for a fatal signal would
survive even if the signal is posted when it is blocked.

| 
| Unfortunately, I don't know how to make it better. The problem with blocked
| signals is that we don't know who is the sender of the signal at the time
| when the signal is unblocked.

One solution I was thinking of was to possibly queue pending blocked
signals to a container init seperately and then requeue them on the
normal queue when signals are unblocked. Its definitely not an easier
solution, but might be less intrusive than the signal from parent ns
flag solution.

i.e suppose we have:

struct pid_namespace {
...
struct sigpending cinit_blocked_pending;
struct sigpending cinit_blocked_shared_pending;
}

Signals from ancestor ns are queued as usual on task-pending and
task-signal-shared_pending. They don't need any special handling.

Only signals posted to a container-init from within its namespace
need special handling (as in: ignore unhandled fatal signals from
same namespace).

If the container-init has say SIGUSR1 blocked, and a descendant of
container-init posts SIGUSR1 to container-init, queue the SIGUSR1
in pid_namespace-cinit_blocked_pending.

When container-init unblocks SIGUSR1, check if there was a pending
SIGUSR1 from same namespace (i.e check -cinit_blocked_pending list).
If there was and container-init has a handler for SIGUSR1, post SIGUSR1
on task-pending queue and let the container-init handle SIGUSR1.

If there was a SIGUSR1 posted to containier init and there is no handler
for SIGUSR1, then just ignore the SIGUSR1 (since it would be fatal
otherwise).

I chose 'struct pid_namespace' for the temporary queue, since we need
the temporary queues only for container-inits (not for all processes).
And having it allocated ahead of time, ensures we can queue the signal
even under low-memory conditions.

Just an idea at this point.

| 
| What do you think? Can we live with this oddity? Otherwise, we have to add
| something like the the signal is from the parent namespace flag, and I bet
| this is not trivial to implement correctly.

I think its reasonable to place 

[Devel] Re: [PATCH 2/4] Switch caches notification dynamically

2007-09-18 Thread Christoph Lameter
On Tue, 18 Sep 2007, Pavel Emelyanov wrote:

 I meant that we cannot find the pages that are full of objects to notify
 others that these ones are no longer tracked. I know that we can do
 it by tracking these pages with some performance penalty, but does it
 worth having the ability to turn notifications off by the cost of the
 performance degradation?

Not sure. That may be your call.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH][NETNS] Cleanup list walking in setup_net and cleanup_net

2007-09-18 Thread David Miller
From: Pavel Emelyanov [EMAIL PROTECTED]
Date: Tue, 18 Sep 2007 12:06:53 +0400

 I proposed introducing a list_for_each_entry_continue_reverse macro
 to be used in setup_net() when unrolling the failed -init callback.
 
 Here is the macro and some more cleanup in the setup_net() itself
 to remove one variable from the stack :) The same thing is for the
 cleanup_net() - the existing list_for_each_entry_reverse() is used.
 
 Minor, but the code looks nicer.
 
 Signed-off-by: Pavel Emelyanov [EMAIL PROTECTED]
 Acked-by: Eric W. Biederman [EMAIL PROTECTED]

Applied, thanks Pavel.

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel