date:20080729

[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread KAMEZAWA Hiroyuki

On Fri, 25 Jul 2008 17:46:45 +0100 (BST)
Hugh Dickins [EMAIL PROTECTED] wrote:

 IIRC Rik expressed the same by pointing out that a cgroup at its
 swap limit would then be forced to grow in mem (until it hits its
 mem limit): so controlling the less precious resource would increase
 pressure on the more precious resource.  (Actually, that probably
 bears little relation to what he said - sorry, Rik!)  I don't recall
 what answer he got, perhaps I'd be persuaded if I heard it again.
 
Added Nishimura to CC.

IMHO, from user point of view, both of
 - having 2 controls as mem controller + swap controller
 - mem + swap controller
doesn't have much difference. The users will use as they like.

From memory controller's point of view, treating mem+swap by the same
controller makes sense. Because memory controller can check wheter we can use
more swap or not, we can avoid hopeless-scanning of Anon at swap-shortage.
(By split-lru, I think we can do this avoidance.)
 
Another-Topic?

In recent servers, memory is big, swap is (relatively) small.
And under memory resource controller, the whole swap is easily occupied
by a group. I want to avoid it.

For users, swap is not precious because it's not fast. 
But for memory reclaiming, swap is precious resource to page out
anonymous/shmem/tmpfs memory. I think usual system-admin considers swap as
some emergency spare of memory. I'd like to allow this emergency spare to each
cgroup.
(For example, swap is used even if vm.swappiness==0. This is for avoiding 
OOM-Killer
 under some situation, this behavior is added by Rik.)


== following is another use case I explained to Rik at 23/May/08 ==

IIRC, a man shown his motivation to controll swap in OLS2007/BOF as following.
Consider following system. (and there is no swap controller.) 
Memory 4G. Swap 1G. with 2 cgroups A, B.

state 1) swap is not used.
  Amemory limit to be 1G  no swap usage memory_usage=0M
  Bmemory limit to be 1G  no swap usage memory_usage=0M

state 2) Run a big program on A.
  Amemory limit to be 1G and try to use 1.7G. uses 700MBytes of swap.
   memory_usage=1G swap_usage=700M
  Bmemory_usage=0M

state 3) A some of programs ends in 'A'
  Amemory_usage=500M swap_usage=700M
  Bmemory_usage=0M.

state 4) Run a big program on B.
  A...memory_usage=500M swap_usage=700M.
  B...memory_usage=1G   swap_usage=300M

Group B can only use 1.3G because of unfair swap use of group A.
But users think why A uses 700M of swap with 500M of free memory
==



Thanks,
-Kame

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 2/6] user namespaces: move user_ns from nsproxy into user struct

2008-07-29 Thread Serge E. Hallyn

Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Serge E. Hallyn [EMAIL PROTECTED] writes:

 From ec5f54faf5afd16cb6cef40ebaaf3da25989d185 Mon Sep 17 00:00:00 2001
  From: Serge Hallyn [EMAIL PROTECTED]
  Date: Thu, 24 Jul 2008 17:52:41 -0500
  Subject: [PATCH 2/6] user namespaces: move user_ns from nsproxy into user 
  struct

  When we get the sysfs support needed to support fair user scheduling
  along with user namespaces, then we will need to be able to get the
  user namespace from the user struct.

  So we need the user_ns to be a part of struct user.  Once we can
  access it from tsk-user, we no longer have a use for
  tsk-nsproxy-user_ns.

 Is this true?  Even in the general case of supporting setuid and setgid
 and everything else that potentially is in the user namespace?

Sure.  At any time we can get tsk-user-user_ns, and from that we can
get tsk-user-user_ns-creator.

 I certainly support the cleanups you have made for the reasons you describe.
 I think however that there is there are no technical reasons not to have
  nsproxy-user_ns after the changes have been made.

Well I ended up tossing it to clarify my thinking about the refcounting.
With that done, I think we could safely not have nsproxy pin the
user_ns, trusting the tsk-user to pin it instead.

But keeping it in two places just seems needlessly complicated.

 I also agree that
 there are no technical reasons for keeping nsproxy-user_ns at the moment.

What is your preference?

  When a user_namespace is created, the user which created it is
  marked as its 'creator'.  The user_namespace pins the creator.
  Each userid in a user_ns pins the user_ns.  This keeps refcounting
  nice and simple.

 Eric
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 4/6] user namespaces: add user_ns to super block

2008-07-29 Thread Serge E. Hallyn

Quoting Matt Helsley ([EMAIL PROTECTED]):

 On Mon, 2008-07-28 at 14:53 -0700, Eric W. Biederman wrote:
  Serge E. Hallyn [EMAIL PROTECTED] writes:

  From 420d6e81ce29d7a6fe3ab7b43c1171e105f8b697 Mon Sep 17 00:00:00 2001
   From: Serge Hallyn [EMAIL PROTECTED]
   Date: Thu, 24 Jul 2008 18:00:54 -0500
   Subject: [PATCH 4/6] user namespaces: add user_ns to super block

   Add a user_ns to the super_block, and set it to the user_ns of
   the process which mounted the fs.

   In generic_permission() compare the current user_ns to that
   of the user_ns which mounted the inode's filesystem.

  I don't think this is the right approach.

  When we had the conversation earlier this was conceptually rejected
  as it prevents nfs superblock unification.

  We really want to store this in the vfsmount and pass the user namespace 
  down
  from there to where we are going to use it if at all possible.

  The vfsmount also appears necessary if we are ever going to support multiple
  user namespaces per filesystem as the filesystem still need to know which 
  user namespace to interpret it's data in.

The filesystem can figure that out based on current's context, no?

With the per-sb user_ns, the default behavior is indeed very limited,
but since you want to move all the user_ns functionality into the
filesystem, the fs can tag vfsmounts based on the new remount you
had talked about before.

   Would this require passing the vfsmount to the filesystems themselves,
 or would they be within the VFS code only? If not wholly within the VFS
 I wonder if Al Viro would object to this. He's resisted past attempts to
 pass the vfsmount structs into more filesystem code paths and I'm
 guessing that could affect whether or not this approach can be
 implemented.

Right, that's the main reason we might want to pursue the per-sb
approach.  Otherwise I would prefer the per-vfsmount approach.

Eric, if you think the per-vfsmount fight is worth fighting, then by all
means let's do it and see what happens.  So in that case ignore patches
3-5 from this set :)

-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 4/6] user namespaces: add user_ns to super block

2008-07-29 Thread Serge E. Hallyn

Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Matt Helsley [EMAIL PROTECTED] writes:
 
  Would this require passing the vfsmount to the filesystems themselves,
  or would they be within the VFS code only? 
 
 The interesting bit is the user_namespace contained in the vfsmount.  We
 can pass that down.  I think semantically it makes sense for a filesystem
 mount to only operate in a single mount namespace.
 
  If not wholly within the VFS
  I wonder if Al Viro would object to this. He's resisted past attempts to
  pass the vfsmount structs into more filesystem code paths and I'm
  guessing that could affect whether or not this approach can be
  implemented.
 
 Dave Hansen raised that concern when we were talking about it earlier.  Since
 we just care about a property of the mount it isn't a big deal.
 
 Actually thinking about this a little farther it may be simplest to have the
 mnt_namespace capture the user_namespace, although that doesn't seem to map
 semantically very well with cloning of the filesystem.

Interesting idea.  I'm going to pursue that.

So at a do_new_mount(), mnt-user_ns = current-user_ns.  At
do_loopback(), we ask the fs whether the new_mnt-user_ns can be set to
current-user_ns.  If not, it keeps the original, meaning that current
will always receive user nobody access to the fs.  Otherwise, the
fs is saying that it knows how to properly convert userids from
current-user-user_ns to ones which make sense in the
original_mnt-user_ns.

 This is very much a question of how do we map the uid/gids store in the 
 filesystem
 into the uids/gids in the kernel.  Which user namespace do they belong in.
 
 Especially in the case of read only mounts we can safely share a filesystem 
 between
 user_namespaces with no changes to the filesystem.Which I suspect is the
 first case we want to allow as that is a tremendous savings in space if you 
 have
 lots of instances of the same distro, and people have been doing it with /usr
 for years.
 
 Eric
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 0/2] sys_restore prototype

2008-07-29 Thread Serge E. Hallyn

Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Serge E. Hallyn [EMAIL PROTECTED] writes:
 
  We were talking this morning about what trivial patchset to begin
  with to get a start on checkpoint and restart.  We thought that
  rather than start with checkpoint, maybe we should start with
  something that reads a checkpoint file and restarts a single
  task.  In this case, restart means it sets the process id and
  executes the file which are found in the checkpoint file.
 
  So here's what we whipped up for a half hour this morning,
  and during some of Mark's talk this afternoon.
 
  It refuses to run if it isn't the container init, so you must
  unshare your pidns before calling sys_restore().
 
 A reasonable approximation. 
 
 Dave Hansen made a good point when he asked how do we graft a restored
 checkpoint into the rest of the system.  Requiring us to unshare
 everything we intend to unshare before restore achieves this easily,
 and you are following in that model.
 
 That leads to an interesting implication.  We don't need to set the pid
 of the first process.  At most we can verify that the pid is the same.
 If we have unshared the pid namespace the pid will be 1 and the needed
 pid of the first process will be 1.
 
 More later.

Good point.  Sounds like our trivial prototype was still way
over-featureful :)

-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

Re: [Devel] [PATCH 2/2] sys_restore: set the pid number

2008-07-29 Thread Serge E. Hallyn

Quoting Pavel Emelyanov ([EMAIL PROTECTED]):
 Serge E. Hallyn wrote:
  Set the pid number for a restored task.  This is purely a toy, as it
  only sets the pidnr in the lowest level pid namespace.
  
  Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
  ---
   kernel/fork.c |5 +
   kernel/pid.c  |   19 +++
   2 files changed, 24 insertions(+), 0 deletions(-)
  
  diff --git a/kernel/fork.c b/kernel/fork.c
  index 0e43f69..41c46d2 100644
  --- a/kernel/fork.c
  +++ b/kernel/fork.c
  @@ -1685,6 +1685,8 @@ int unshare_files(struct files_struct **displaced)
   char buf[CKPT_SIZE];
   char exe_filename[PAGE_SIZE];
   
  +extern int choose_pidmap(struct pid *pid, int new);
  +
   /*
* userspace will already have made us a new pidns
*/
  @@ -1717,6 +1719,9 @@ long do_restore(unsigned int fd,
  if (nr_scanned != 2)
  return -EINVAL;
   
  +   ret = choose_pidmap(task_pid(current), pid);
  +   if (!ret)
  +   return -EAGAIN;
   
  ret = do_execve(exe_filename, argv, envp, regs);
   out:
  diff --git a/kernel/pid.c b/kernel/pid.c
  index 30bd5d4..88a5e2a 100644
  --- a/kernel/pid.c
  +++ b/kernel/pid.c
  @@ -122,6 +122,25 @@ static void free_pidmap(struct upid *upid)
  atomic_inc(map-nr_free);
   }
   
  +int choose_pidmap(struct pid *pid, int new)
  +{
  +   struct pidmap *map;
  +   int level = pid-level;
  +   struct upid *upid = pid-numbers[level];
  +   struct pid_namespace *pid_ns = upid-ns;
  +   int old = upid-nr;
  +
  +   map = pid_ns-pidmap[new/BITS_PER_PAGE];
  +   if (!test_and_set_bit(new, map-page)) {
 
 This and...
 
  +   map = pid_ns-pidmap[old/BITS_PER_PAGE];
  +   clear_bit(old, map-page);
 
 this are both a bit buggy, since the bit number on the page
 is not the same as the pid id itself ;)

Oops.  Yes, I see.

thanks,
-serge

  +   upid-nr = new;
  +   return 1;
  +   }
  +
  +   return 0;
  +}
  +
   static int alloc_pidmap(struct pid_namespace *pid_ns)
   {
  int i, offset, max_scan, pid, last = pid_ns-last_pid;
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH 4/6] user namespaces: add user_ns to super block

2008-07-29 Thread Eric W. Biederman

Serge E. Hallyn [EMAIL PROTECTED] writes:


 The filesystem can figure that out based on current's context, no?

 With the per-sb user_ns, the default behavior is indeed very limited,
 but since you want to move all the user_ns functionality into the
 filesystem, the fs can tag vfsmounts based on the new remount you
 had talked about before.

I guess I want the filesystem to coordinate.

  Would this require passing the vfsmount to the filesystems themselves,
 or would they be within the VFS code only? If not wholly within the VFS
 I wonder if Al Viro would object to this. He's resisted past attempts to
 pass the vfsmount structs into more filesystem code paths and I'm
 guessing that could affect whether or not this approach can be
 implemented.

 Right, that's the main reason we might want to pursue the per-sb
 approach.  Otherwise I would prefer the per-vfsmount approach.

 Eric, if you think the per-vfsmount fight is worth fighting, then by all
 means let's do it and see what happens.  So in that case ignore patches
 3-5 from this set :)

My intuitive sense is that the treating the handling of different
user namespaces in the same filesystem is a trivial case of the
superblock merging that nfs performs.  And that we will preserve
existing semantics much better if the user namespace is stored
in the vfsmount.  This allows mount propagation and friends to work
without surprises.

The practical limitation I see of storing things outside of the
vfsmount is when do you setup the mapping to handle a new user
namespace?

So yes.  I think it is worth the discussion.  Let's not
move the vfsmount down, and just move the user namespace pointer
down as that is fundamentally what we care about.

Eric
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread Hugh Dickins

On Tue, 29 Jul 2008, KAMEZAWA Hiroyuki wrote:
 On Fri, 25 Jul 2008 17:46:45 +0100 (BST)
 Hugh Dickins [EMAIL PROTECTED] wrote:
 
  IIRC Rik expressed the same by pointing out that a cgroup at its
  swap limit would then be forced to grow in mem (until it hits its
  mem limit): so controlling the less precious resource would increase
  pressure on the more precious resource.  (Actually, that probably
  bears little relation to what he said - sorry, Rik!)  I don't recall
  what answer he got, perhaps I'd be persuaded if I heard it again.
  
 Added Nishimura to CC.
 
 IMHO, from user point of view, both of
  - having 2 controls as mem controller + swap controller
  - mem + swap controller
 doesn't have much difference. The users will use as they like.

I'm not suggesting either one of those alternatives.

I'm suggesting we have a mem controller (the thing we already have)
and a mem+swap controller (which we don't yet have: a controller
for the total mem+swap of a cgroup); the mem+swap controller likely
making use of much that is in the mem controller, as Paul has said.

(Unfortunately I don't have a good name for this mem+swap.)

I happen to believe that the mem+swap controller would actually be
a lot more useful than the current mem controller, and would expect
many to run with mem+swap controller enabled but mem controller
disabled or unlimited.  How much is mem and how much is swap being
left to global reclaim to decide, not imposed by any cgroup policy.

What I don't like the sound of at all is a swap controller.  Do you
think that a mem controller (limit 1G) and a mem+swap controller
(limit 2G) is equivalent to a mem controller (limit 1G) and a
swap controller (limit 1G)?  No: imagine memory pressure from
outside the cgroup - with the mem+swap controller it can push as
much as suits of the 2G out to swap; whereas with the swap controller,
once 1G is out, it has to stop pushing any more of that cgroup out.
I think that's absurd - but perhaps I just haven't looked, and
I've totally misinterpreted the talk of a swap controller.

 
 From memory controller's point of view, treating mem+swap by the same
 controller makes sense. Because memory controller can check wheter we can use
 more swap or not, we can avoid hopeless-scanning of Anon at swap-shortage.
 (By split-lru, I think we can do this avoidance.)

That's a detail I'm not concerned with on this level.

  
 Another-Topic?
 
 In recent servers, memory is big, swap is (relatively) small.

You'll know much more about those common proportions than I do.
I'd wonder why such big memory servers have any swap at all:
to cope with VM management defects we should be fixing?

 And under memory resource controller, the whole swap is easily occupied
 by a group. I want to avoid it.

Why?  I presume because you're thinking it a precious resource.
I don't think its relative smallness makes it more precious.

 
 For users, swap is not precious because it's not fast. 

Yes, and that's my view.

 But for memory reclaiming, swap is precious resource to page out
 anonymous/shmem/tmpfs memory.

I see that makes swap a useful resource, I don't see that it makes
it a precious resource.  We page out to it precisely because it's
less precious than the memory; both users and kernel would much
prefer to keep all the data in memory, but sometimes there isn't
enough memory so we go to swap.

There is just one way in which I see swap as precious, and that
is to get around some VM management stupidity.  If, for example,
on i386 there's a shortage of lowmem and lots of anonymous in lowmem
that we should shift to highmem, then I think it's still the case
that we have to do that balancing via writing out to and reading
in from swap, because nobody has actually hooked up page migration
to do that when appropriate?  But that's an argument for extending
page migration, not for needing a swap controller.

 I think usual system-admin considers swap as some emergency spare of memory.

Yes, I do too.

 I'd like to allow this emergency spare to each cgroup.

We do allow that emergency spare to each cgroup.  Perhaps you're
saying you want to divide it up in advance between the cgroups?
But why?  Sounds like a nice idea (reminds me of what Paul said
about using temporary files), but a solution to what problem?

 (For example, swap is used even if vm.swappiness==0. This is for avoiding
 OOM-Killer under some situation, this behavior is added by Rik.)

Sorry, I don't know what you're referring to there, but again,
suspect it's a detail we don't need to be concerned with here.

 
 == following is another use case I explained to Rik at 23/May/08 ==
 
 IIRC, a man shown his motivation to controll swap in OLS2007/BOF as following.
 Consider following system. (and there is no swap controller.) 
 Memory 4G. Swap 1G. with 2 cgroups A, B.
 
 state 1) swap is not used.
   Amemory limit to be 1G  no swap usage memory_usage=0M
   Bmemory limit to be 1G  no swap usage memory_usage=0M
 
 state 2) Run a

[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread Paul Menage

On Tue, Jul 29, 2008 at 5:31 PM, Hugh Dickins [EMAIL PROTECTED] wrote:

 I don't see that I'm denying you a way to guarantee that (though I've
 been thinking more of the limits than the guarantees): I'm not saying
 that you cannot have a mem controller, I'm saying that you can also
 have a mem+swap controller; but that a swap-by-itself controller
 makes no sense to me.

OK, fair enough.


 I think that works until you get to fork: shared files and
 private/anonymous/swap behave differently from then on.


Good point. It works as long as you never do a plain fork() without
immediate execve() though.

Paul
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread Hugh Dickins

On Fri, 25 Jul 2008, Paul Menage wrote:
 On Fri, Jul 25, 2008 at 12:46 PM, Hugh Dickins [EMAIL PROTECTED] wrote:
  No, I'm trying to say something stronger than that.  I'm saying,
  as I've said before, that I cannot imagine why anyone would want
  to control swap itself - what they want to control is the total
  of mem+swap.  Swap is a second-class citizen, nobody wants swap
  if they can have mem, so why control it separately?
 
 Scheduling jobs on to machines is much more straightforward when they
 request xGB of memory and yGB of swap rather than just (x+y)GB of
 (memory+swap). We want to be able to guarantee to jobs that they will
 be able to use xGB of real memory.

I don't see that I'm denying you a way to guarantee that (though I've
been thinking more of the limits than the guarantees): I'm not saying
that you cannot have a mem controller, I'm saying that you can also
have a mem+swap controller; but that a swap-by-itself controller
makes no sense to me.

 Actually my preferred approach to swap controlling would be something like:
 
 - allow malloc to support mmaping pages from a temporary file rather
 than mmapping anonymous memory

I think that works until you get to fork: shared files and
private/anonymous/swap behave differently from then on.

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread Hugh Dickins

On Fri, 25 Jul 2008, Balbir Singh wrote:
 
 I see what your saying. When you look at Linux right now, we control swap
 independent of memory, so I am not totally opposed to setting swap, instead of
 swap+mem. I might not want to swap from a particular cgroup, in which case, I
 set swap to 0 and risk OOMing, which might be an acceptable trade-off 
 depending
 on my setup. I could easily change this policy on demand and add swap if 
 OOMing
 was no longer OK.

It's taken me a while to understand your point.  I think you're
saying that with a swap controller, you can set the swap limit to 0
on a cgroup if you want to keep it entirely in memory, without setting
any mem limit upon it; whereas with my mem+swap controller, you'd have
to set a mem limit then an equal mem+swap limit to achieve the same
never go to swap effect, and maybe you don't want to set a mem limit.

Hmm, but an unreachably high mem limit, and equal mem+swap limit,
would achieve that effect.  Sorry, I don't think I have understood
(and even if the unreachably high limit didn't work, this seems more
about setting a don't-swap flag than imposing a swap limit).

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread KAMEZAWA Hiroyuki

On Wed, 30 Jul 2008 01:16:17 +0100 (BST)
Hugh Dickins [EMAIL PROTECTED] wrote:

 On Tue, 29 Jul 2008, KAMEZAWA Hiroyuki wrote:
  On Fri, 25 Jul 2008 17:46:45 +0100 (BST)
  Hugh Dickins [EMAIL PROTECTED] wrote:
  
   IIRC Rik expressed the same by pointing out that a cgroup at its
   swap limit would then be forced to grow in mem (until it hits its
   mem limit): so controlling the less precious resource would increase
   pressure on the more precious resource.  (Actually, that probably
   bears little relation to what he said - sorry, Rik!)  I don't recall
   what answer he got, perhaps I'd be persuaded if I heard it again.
   
  Added Nishimura to CC.
  
  IMHO, from user point of view, both of
   - having 2 controls as mem controller + swap controller
   - mem + swap controller
  doesn't have much difference. The users will use as they like.
 
 I'm not suggesting either one of those alternatives.
 
 I'm suggesting we have a mem controller (the thing we already have)
 and a mem+swap controller (which we don't yet have: a controller
 for the total mem+swap of a cgroup); the mem+swap controller likely
 making use of much that is in the mem controller, as Paul has said.
 
Ah, what mem+swap controller means is limitiing mem+swap by 'a' limit ?
It's a choice for me. From view of global LRU management, it's better.
If we can avoid an accident that the swap is fully used by some silly program,
anything is ok to me.

How about you, Nishimura-san ?

A story I talked is based on the assumption that there may be not enough swap
space against memory. We can ask cutomers to equip tons of swap when 
memory is huge. BTW, what is the maximum swap size now ?
Can we extend it if it's small ?


snip
  state 4) Run a big program on B.
A...memory_usage=500M swap_usage=700M.
B...memory_usage=1G   swap_usage=300M
 If you believe a swap controller would make that better, what limits
 do you suggest?  If you assign A a swap limit of 700M or above, it
 changes nothing; if you assign A a swap limit below 700M, it cannot
 do all the work that it could do in the example.

Of course, set A's swap_limit of 300M and get swap pages into memory and
free swap entries and make A on memory. (before B starts.)

  But users think why A uses 700M of swap with 500M of free memory
 
 Because at this time A isn't actively using any of that 700M.

That's a weakness of do all by automatic detection and ideal algoritm.
It's just a result of LRU algorithm, which is not always the users think
ideal.


Thanks,
-Kame

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 2/4] res_counter check usage under val

2008-07-29 Thread Pavel Emelyanov

 I get your point. Logically this lock is unnecessary.
 (And seems this patch itself is buggy..(maybe refresh miss))
 
 BTW, I'm sorry if I misunderstand. unsigned long long (on x86-32)
 can be compared safely ? 

Oops... Indeed. 
That discourages me, that we need a spinlock for simple comparisons :(
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread KAMEZAWA Hiroyuki

On Wed, 30 Jul 2008 10:17:19 +0900
KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:

 On Wed, 30 Jul 2008 01:16:17 +0100 (BST)
 Hugh Dickins [EMAIL PROTECTED] wrote:
 
  On Tue, 29 Jul 2008, KAMEZAWA Hiroyuki wrote:
   On Fri, 25 Jul 2008 17:46:45 +0100 (BST)
   Hugh Dickins [EMAIL PROTECTED] wrote:
   
IIRC Rik expressed the same by pointing out that a cgroup at its
swap limit would then be forced to grow in mem (until it hits its
mem limit): so controlling the less precious resource would increase
pressure on the more precious resource.  (Actually, that probably
bears little relation to what he said - sorry, Rik!)  I don't recall
what answer he got, perhaps I'd be persuaded if I heard it again.

   Added Nishimura to CC.
   
   IMHO, from user point of view, both of
- having 2 controls as mem controller + swap controller
- mem + swap controller
   doesn't have much difference. The users will use as they like.
  
  I'm not suggesting either one of those alternatives.
  
  I'm suggesting we have a mem controller (the thing we already have)
  and a mem+swap controller (which we don't yet have: a controller
  for the total mem+swap of a cgroup); the mem+swap controller likely
  making use of much that is in the mem controller, as Paul has said.
  
 Ah, what mem+swap controller means is limitiing mem+swap by 'a' limit ?
 It's a choice for me. From view of global LRU management, it's better.
 If we can avoid an accident that the swap is fully used by some silly program,
 anything is ok to me.
 
Hmm.

mem+swap controller means a shrink to memory resource controller 
(try_to_free_mem_cgroup_pages()) should drop only file caches.
(Because kick-out-to-swap will never changes the usage.)

right ? only global-lru can make a swap.
maybe I can add optimization to do this. Hmm. I should see how OOM works
under some situation.

Thanks,
-Kame





___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread KAMEZAWA Hiroyuki

On Wed, 30 Jul 2008 11:52:26 +0900
KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 mem+swap controller means a shrink to memory resource controller 
 (try_to_free_mem_cgroup_pages()) should drop only file caches.
 (Because kick-out-to-swap will never changes the usage.)
 
 right ? only global-lru can make a swap.
 maybe I can add optimization to do this. Hmm. I should see how OOM works
 under some situation.
 
(I'm sorry that I'm not a good writer of e-mail.)

A brief summary about changes to mem controller.

 - mem+swap controller which limits the # sum of pages and swap_entries.
 - mem+swap controller just drops file caches when it reaches limit.
 - under mem+swap controller, recaliming Anon pages make no sense.
   Then,
  - LRU for Anon is not necessary.
  - LRU for tmpfs/shmem is not necessary.
  just showing account is better.
 - we should see try_to_free_mem_cgroup() again to avoid too much OOM.
   Maybe Retries=5 is too small because we never do swap under us.
   a problem like struck-into-ext3-journal can easily make file-cache reclaim
   difficult.
 - need some changes to documentation.
 - Should we have on/off switch of taking swap into account ?
   or should we implement mem+swap contoller in different name than
   memory controller ?
   If swap is not accounted, we need to do swap-out in memory reclaiming path,
   again.
   

Thanks,
-Kame

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [RFC][PATCH 0/2] CR: save/restore a single, simple task

2008-07-29 Thread Oren Laadan


In the recent mini-summit at OLS 2008 and the following days it was
agreed to tackle the checkpoint/restart (CR) by beginning with a very
simple case: save and restore a single task, with simple memory
layout, disregarding other task state such as files, signals etc.

Following these discussions I coded a prototype that can do exactly
that, as a starter. This code adds two system calls - sys_checkpoint
and sys_restart - that a task can call to save and restore its state
respectively. It also demonstrates how the checkpoint image file can
be formatted, as well as show its nested nature (e.g. cr_write_mm()
- cr_write_vma() nesting).

The state that is saved/restored is the following:
* some of the task_struct
* some of the thread_struct and thread_info
* the cpu state (including FPU)
* the memory address space

[The patch is against commit fb2e405fc1fc8b20d9c78eaa1c7fd5a297efde43
of Linus's tree (uhhh.. don't ask why), but against tonight's head too].

In the current code, sys_checkpoint will checkpoint the current task,
although the logic exists to checkpoint other tasks (not in the
checkpointee's execution context). A simple loop will extend this to
handle multiple processes. sys_restart restarts the current tasks, and
with multiple tasks each task will call the syscall independently.
(Actually, to checkpoint outside the context of a task, it is also
necessary to also handle restart-block logic when saving/restoring the
thread data).

It takes longer to describe what isn't implemented or supported by
this prototype ... basically everything that isn't as simple as the
above.

As for containers - since we still don't have a representation for a
container, this patch has no notion of a container. The tests for
consistent namespaces (and isolation) are also omitted.

Below are two example programs: one uses checkpoint (called ckpt) and
one uses restart (called rstr). Execute like this (as a superuser):

orenl:~/test$ ./ckpt  out.1
hello, world!  (ret=1)  -- sys_checkpoint returns positive id
-- ctrl-c
orenl:~/test$ ./ckpt  out.2
hello, world!  (ret=2)
-- ctrl-c
orenl:~/test$ ./rstr  out.1
hello, world!  (ret=0)  -- sys_restart return 0

(if you check the output of ps, you'll see that rstr changed its
name to ckpt, as expected).

Hoping this will accelerate the discussion. Comments are welcome.
Let the fun begin :)

Oren.


== ckpt.c 

#define _GNU_SOURCE/* or _BSD_SOURCE or _SVID_SOURCE */

#include stdio.h
#include stdlib.h
#include errno.h
#include fcntl.h
#include unistd.h
#include asm/unistd_32.h
#include sys/syscall.h

int main(int argc, char *argv[])
{
pid_t pid = getpid();
int ret;

ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
if (ret  0)
perror(checkpoint);

fprintf(stderr, hello, world!  (ret=%d)\n, ret);

while (1)
;

return 0;
}

== rstr.c 

#define _GNU_SOURCE/* or _BSD_SOURCE or _SVID_SOURCE */

#include stdio.h
#include stdlib.h
#include errno.h
#include fcntl.h
#include unistd.h
#include asm/unistd_32.h
#include sys/syscall.h

int main(int argc, char *argv[])
{
pid_t pid = getpid();
int ret;

ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
if (ret  0)
perror(restart);

printf(should not reach here !\n);

return 0;
}
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [RFC][PATCH 1/2] CR: introduce sys_checkpoint and sys_restore

2008-07-29 Thread Oren Laadan


Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file.

First create a template for both syscalls: they take a file descriptor
(for the image file) and flags as arguments. For sys_checkpoint the
first argument identifies the target container; for sys_restart it will
identify the checkpoint image.

Signed-off-by: Oren Laadan [EMAIL PROTECTED]

---
  Makefile   |2 +-
  arch/x86/kernel/syscall_table_32.S |2 ++
  include/asm-x86/unistd_32.h|2 ++
  3 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/Makefile b/Makefile
index 4bcd1cf..6e2334c 100644
--- a/Makefile
+++ b/Makefile
@@ -611,7 +611,7 @@ export mod_strip_cmd


  ifeq ($(KBUILD_EXTMOD),)
-core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/
+core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ckpt/

  vmlinux-dirs  := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \
 $(core-y) $(core-m) $(drivers-y) $(drivers-m) \
diff --git a/arch/x86/kernel/syscall_table_32.S 
b/arch/x86/kernel/syscall_table_32.S
index d44395f..5543136 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -332,3 +332,5 @@ ENTRY(sys_call_table)
.long sys_dup3  /* 330 */
.long sys_pipe2
.long sys_inotify_init1
+   .long sys_checkpoint
+   .long sys_restart
diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h
index d739467..88bdec4 100644
--- a/include/asm-x86/unistd_32.h
+++ b/include/asm-x86/unistd_32.h
@@ -338,6 +338,8 @@
  #define __NR_dup3 330
  #define __NR_pipe2331
  #define __NR_inotify_init1332
+#define __NR_checkpoint333
+#define __NR_restart   334

  #ifdef __KERNEL__

-- 
1.5.4.3

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-29 Thread Oren Laadan


Expand the template sys_checkpoint and sys_restart to be able to dump
and restore a single task. The task's address space may consist of only
private, simple vma's - anonymous or file-mapped.

This big patch adds a mechanism to transfer data between kernel or user
space to and from the file given by the caller (sys.c), alloc/setup/free
of the checkpoint/restart context (sys.c), output wrappers and basic
checkpoint handling (checkpoint.c), memory dump (ckpt_mem.c), input
wrappers and basic restart handling (restart.c), and finally the memory
restore (rstr_mem.c).

Signed-off-by: Oren Laadan [EMAIL PROTECTED]

---
  ckpt/Makefile |1 +
  ckpt/checkpoint.c |  366 ++
  ckpt/ckpt.h   |   78 ++
  ckpt/ckpt_hdr.h   |  143 ++
  ckpt/ckpt_mem.c   |  421 +
  ckpt/ckpt_mem.h   |   32 
  ckpt/restart.c|  328 +
  ckpt/rstr_mem.c   |  415 
  ckpt/sys.c|  239 ++
  9 files changed, 2023 insertions(+), 0 deletions(-)
  create mode 100644 ckpt/Makefile
  create mode 100644 ckpt/checkpoint.c
  create mode 100644 ckpt/ckpt.h
  create mode 100644 ckpt/ckpt_hdr.h
  create mode 100644 ckpt/ckpt_mem.c
  create mode 100644 ckpt/ckpt_mem.h
  create mode 100644 ckpt/restart.c
  create mode 100644 ckpt/rstr_mem.c
  create mode 100644 ckpt/sys.c

diff --git a/ckpt/Makefile b/ckpt/Makefile
new file mode 100644
index 000..41f205d
--- /dev/null
+++ b/ckpt/Makefile
@@ -0,0 +1 @@
+obj-y += sys.o checkpoint.o restart.o ckpt_mem.o rstr_mem.o
diff --git a/ckpt/checkpoint.c b/ckpt/checkpoint.c
new file mode 100644
index 000..1698a35
--- /dev/null
+++ b/ckpt/checkpoint.c
@@ -0,0 +1,366 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include linux/version.h
+#include linux/sched.h
+#include linux/time.h
+#include linux/fs.h
+#include linux/file.h
+#include linux/dcache.h
+#include linux/mount.h
+#include asm/ptrace.h
+
+#if defined (CONFIG_X86)
+#include asm/i387.h
+#endif
+
+#include ckpt.h
+#include ckpt_hdr.h
+
+/**
+ * cr_get_fname - return pathname of a given file
+ * @file: file pointer
+ * @buf: buffer for pathname
+ * @n: buffer length (in) and pathname length (out)
+ *
+ * if the buffer provivded by the caller is too small, allocate a new
+ * buffer; caller should call cr_put_pathname() for cleanup
+ */
+char *cr_get_fname(struct path *path, struct path *root, char *buf, int *n)
+{
+   char *fname;
+
+   fname = __d_path(path, root, buf, *n);
+
+   if (IS_ERR(fname)  PTR_ERR(fname) == -ENAMETOOLONG) {
+if (!(buf = (char *) __get_free_pages(GFP_KERNEL, 0)))
+return ERR_PTR(-ENOMEM);
+   fname = __d_path(path, root, buf, PAGE_SIZE);
+   if (IS_ERR(fname))
+   free_pages((unsigned long) buf, 0);
+   }
+   if (!IS_ERR(fname))
+   *n = (buf + *n - fname);
+
+   return fname;
+}
+
+/**
+ * cr_put_fname - (possibly) cleanup pathname buffer
+ * @buf: original buffer that was given to cr_get_pathname()
+ * @fname: resulting pathname from cr_get_pathname()
+ * @n: length of original buffer
+ */
+void cr_put_fname(char *buf, char *fname, int n)
+{
+   if (fname  (fname  buf || fname = buf + n))
+   free_pages((unsigned long) buf, 0);
+}
+
+/**
+ * cr_write_obj - write a record described by a cr_hdr
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ */
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
+{
+   int ret;
+
+   if ((ret = cr_kwrite(ctx, h, sizeof(*h)))  0)
+   return ret;
+   return cr_kwrite(ctx, buf, h-len);
+}
+
+/**
+ * cr_write_str - write a string record
+ * @ctx: checkpoint context
+ * @str: string buffer
+ * @n: string length
+ */
+int cr_write_str(struct cr_ctx *ctx, char *str, int n)
+{
+   struct cr_hdr h;
+
+   h.type = CR_HDR_STR;
+   h.len = n;
+   h.id = 0;
+
+   return cr_write_obj(ctx, h, str);
+}
+
+/* write the checkpoint header */
+static int cr_write_hdr(struct cr_ctx *ctx)
+{
+   struct cr_hdr h;
+   struct cr_hdr_head *hh = ctx-tbuf;
+   struct timeval ktv;
+
+   h.type = CR_HDR_HEAD;
+   h.len = sizeof(*hh);
+   h.id = 0;
+
+   do_gettimeofday(ktv);
+
+   hh-magic = 0x00a2d200;
+   hh-major = (LINUX_VERSION_CODE  16)  0xff;
+   hh-minor = (LINUX_VERSION_CODE  8)  0xff;
+   hh-patch = (LINUX_VERSION_CODE)  0xff;
+
+   hh-version = 1;
+
+   hh-flags = ctx-flags;
+   hh-time = ktv.tv_sec;
+
+   return cr_write_obj(ctx, h, hh);
+}
+

[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread KAMEZAWA Hiroyuki

On Wed, 30 Jul 2008 12:11:15 +0900
KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:

 On Wed, 30 Jul 2008 11:52:26 +0900
 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
  mem+swap controller means a shrink to memory resource controller 
  (try_to_free_mem_cgroup_pages()) should drop only file caches.
  (Because kick-out-to-swap will never changes the usage.)
  
  right ? only global-lru can make a swap.
  maybe I can add optimization to do this. Hmm. I should see how OOM works
  under some situation.
  
 (I'm sorry that I'm not a good writer of e-mail.)
 
 A brief summary about changes to mem controller.
 
  - mem+swap controller which limits the # sum of pages and swap_entries.
  - mem+swap controller just drops file caches when it reaches limit.
  - under mem+swap controller, recaliming Anon pages make no sense.
Then,
   - LRU for Anon is not necessary.
   - LRU for tmpfs/shmem is not necessary.
   just showing account is better.
  - we should see try_to_free_mem_cgroup() again to avoid too much OOM.
Maybe Retries=5 is too small because we never do swap under us.
a problem like struck-into-ext3-journal can easily make file-cache reclaim
difficult.
  - need some changes to documentation.
  - Should we have on/off switch of taking swap into account ?
or should we implement mem+swap contoller in different name than
memory controller ?
If swap is not accounted, we need to do swap-out in memory reclaiming path,
again.

Then, mem+swap controller finally means
 - under mem+swap controller, program works with no swap. Only global LRU
   may make pages swapped-out.
 - If swap-accounting-mode is off, swap can be used unlimitedly.

Hmm, sounds a bit differenct from what I want. How about others ?

Thanks,
-Kame



 
 Thanks,
 -Kame
 
 

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread Daisuke Nishimura

On Wed, 30 Jul 2008 13:14:07 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 On Wed, 30 Jul 2008 12:11:15 +0900
 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 
  On Wed, 30 Jul 2008 11:52:26 +0900
  KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
   mem+swap controller means a shrink to memory resource controller 
   (try_to_free_mem_cgroup_pages()) should drop only file caches.
   (Because kick-out-to-swap will never changes the usage.)
   
   right ? only global-lru can make a swap.
   maybe I can add optimization to do this. Hmm. I should see how OOM works
   under some situation.
   
I'm thinking mem+swap controller in a different way: an add-on to
mem controller, just as current swap controller.
I mean adding memory.(mem+swap)_limit.

  (I'm sorry that I'm not a good writer of e-mail.)
  
  A brief summary about changes to mem controller.
  
   - mem+swap controller which limits the # sum of pages and swap_entries.
   - mem+swap controller just drops file caches when it reaches limit.
   - under mem+swap controller, recaliming Anon pages make no sense.
 Then,
- LRU for Anon is not necessary.
- LRU for tmpfs/shmem is not necessary.
just showing account is better.
   - we should see try_to_free_mem_cgroup() again to avoid too much OOM.
 Maybe Retries=5 is too small because we never do swap under us.
 a problem like struck-into-ext3-journal can easily make file-cache 
  reclaim
 difficult.
   - need some changes to documentation.
   - Should we have on/off switch of taking swap into account ?
 or should we implement mem+swap contoller in different name than
 memory controller ?
 If swap is not accounted, we need to do swap-out in memory reclaiming 
  path,
 again.
 
 Then, mem+swap controller finally means
  - under mem+swap controller, program works with no swap. Only global LRU
may make pages swapped-out.
  - If swap-accounting-mode is off, swap can be used unlimitedly.
 
 Hmm, sounds a bit differenct from what I want. How about others ?
 

Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-29 Thread KOSAKI Motohiro

Hi

 Expand the template sys_checkpoint and sys_restart to be able to dump
 and restore a single task. The task's address space may consist of only
 private, simple vma's - anonymous or file-mapped.
 
 This big patch adds a mechanism to transfer data between kernel or user
 space to and from the file given by the caller (sys.c), alloc/setup/free
 of the checkpoint/restart context (sys.c), output wrappers and basic
 checkpoint handling (checkpoint.c), memory dump (ckpt_mem.c), input
 wrappers and basic restart handling (restart.c), and finally the memory
 restore (rstr_mem.c).
 
 Signed-off-by: Oren Laadan [EMAIL PROTECTED]

please write a documentation of describe memory dump file format,
and split save and restore to two patches.


___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread KAMEZAWA Hiroyuki

On Wed, 30 Jul 2008 13:58:03 +0900
Daisuke Nishimura [EMAIL PROTECTED] wrote:

 On Wed, 30 Jul 2008 13:14:07 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] 
 wrote:
  On Wed, 30 Jul 2008 12:11:15 +0900
  KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
  
   On Wed, 30 Jul 2008 11:52:26 +0900
   KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
mem+swap controller means a shrink to memory resource controller 
(try_to_free_mem_cgroup_pages()) should drop only file caches.
(Because kick-out-to-swap will never changes the usage.)

right ? only global-lru can make a swap.
maybe I can add optimization to do this. Hmm. I should see how OOM works
under some situation.

 I'm thinking mem+swap controller in a different way: an add-on to
 mem controller, just as current swap controller.
 I mean adding memory.(mem+swap)_limit.
 
Hmm ? adding a control file other than
 - memory.limit_in_bytes
?

Thanks,
-Kame

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread KAMEZAWA Hiroyuki

Sorry for many mails ;(

I think I misunderstood something...

Following is ?

A brief summary about changes in memroy controller.
 - memory.limit_in_bytes works as it is now.
 - new parameter: memory.limit_in_bytes_includes_swap will be added.
   + memory.limit_in_bytes_includes_swap controlls the total amount of
 RAM + SWAP,
   + memory.limit_in_bytes = memory.limit_in_bytes_includes_swap

As a result.
 - memory controller works as it is but doesn't use too much swap.
 - global-lru cannot be affected by controller's parameter.


Hmm, seems reasonable. minor problem is how-to-handle 2 counts/limits ?

BTW, does anyone have good names ?
  (example) memory.memory_limits_in_bytes.  (for accounting memory) 
memory.total_limits_in_bytes.   (for accountign memory+swap)

Thanks,
-Kame


On Wed, 30 Jul 2008 12:11:15 +0900
KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 A brief summary about changes to mem controller.
 
  - mem+swap controller which limits the # sum of pages and swap_entries.
  - mem+swap controller just drops file caches when it reaches limit.
  - under mem+swap controller, recaliming Anon pages make no sense.
Then,
   - LRU for Anon is not necessary.
   - LRU for tmpfs/shmem is not necessary.
   just showing account is better.
  - we should see try_to_free_mem_cgroup() again to avoid too much OOM.
Maybe Retries=5 is too small because we never do swap under us.
a problem like struck-into-ext3-journal can easily make file-cache reclaim
difficult.
  - need some changes to documentation.
  - Should we have on/off switch of taking swap into account ?
or should we implement mem+swap contoller in different name than
memory controller ?
If swap is not accounted, we need to do swap-out in memory reclaiming path,
again.

 
 Thanks,
 -Kame
 
 

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel

[Devel] Re: memrlimit controller merge to mainline

[Devel] Re: [PATCH 2/6] user namespaces: move user_ns from nsproxy into user struct

[Devel] Re: [PATCH 4/6] user namespaces: add user_ns to super block

[Devel] Re: [PATCH 4/6] user namespaces: add user_ns to super block

[Devel] Re: [PATCH 0/2] sys_restore prototype

Re: [Devel] [PATCH 2/2] sys_restore: set the pid number

[Devel] Re: [PATCH 4/6] user namespaces: add user_ns to super block

[Devel] Re: memrlimit controller merge to mainline

[Devel] Re: memrlimit controller merge to mainline

[Devel] Re: memrlimit controller merge to mainline

[Devel] Re: memrlimit controller merge to mainline

[Devel] Re: memrlimit controller merge to mainline

[Devel] Re: [RFC][PATCH 2/4] res_counter check usage under val

[Devel] Re: memrlimit controller merge to mainline

[Devel] Re: memrlimit controller merge to mainline

[Devel] [RFC][PATCH 0/2] CR: save/restore a single, simple task

[Devel] [RFC][PATCH 1/2] CR: introduce sys_checkpoint and sys_restore

[Devel] [RFC][PATCH 2/2] CR: handle a single task with private memory maps

[Devel] Re: memrlimit controller merge to mainline

[Devel] Re: memrlimit controller merge to mainline

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

[Devel] Re: memrlimit controller merge to mainline

[Devel] Re: memrlimit controller merge to mainline

23 matches

Site Navigation

Mail list logo

Footer information