[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008 17:46:45 +0100 (BST) Hugh Dickins [EMAIL PROTECTED] wrote: IIRC Rik expressed the same by pointing out that a cgroup at its swap limit would then be forced to grow in mem (until it hits its mem limit): so controlling the less precious resource would increase pressure on the more precious resource. (Actually, that probably bears little relation to what he said - sorry, Rik!) I don't recall what answer he got, perhaps I'd be persuaded if I heard it again. Added Nishimura to CC. IMHO, from user point of view, both of - having 2 controls as mem controller + swap controller - mem + swap controller doesn't have much difference. The users will use as they like. From memory controller's point of view, treating mem+swap by the same controller makes sense. Because memory controller can check wheter we can use more swap or not, we can avoid hopeless-scanning of Anon at swap-shortage. (By split-lru, I think we can do this avoidance.) Another-Topic? In recent servers, memory is big, swap is (relatively) small. And under memory resource controller, the whole swap is easily occupied by a group. I want to avoid it. For users, swap is not precious because it's not fast. But for memory reclaiming, swap is precious resource to page out anonymous/shmem/tmpfs memory. I think usual system-admin considers swap as some emergency spare of memory. I'd like to allow this emergency spare to each cgroup. (For example, swap is used even if vm.swappiness==0. This is for avoiding OOM-Killer under some situation, this behavior is added by Rik.) == following is another use case I explained to Rik at 23/May/08 == IIRC, a man shown his motivation to controll swap in OLS2007/BOF as following. Consider following system. (and there is no swap controller.) Memory 4G. Swap 1G. with 2 cgroups A, B. state 1) swap is not used. Amemory limit to be 1G no swap usage memory_usage=0M Bmemory limit to be 1G no swap usage memory_usage=0M state 2) Run a big program on A. Amemory limit to be 1G and try to use 1.7G. uses 700MBytes of swap. memory_usage=1G swap_usage=700M Bmemory_usage=0M state 3) A some of programs ends in 'A' Amemory_usage=500M swap_usage=700M Bmemory_usage=0M. state 4) Run a big program on B. A...memory_usage=500M swap_usage=700M. B...memory_usage=1G swap_usage=300M Group B can only use 1.3G because of unfair swap use of group A. But users think why A uses 700M of swap with 500M of free memory == Thanks, -Kame ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/6] user namespaces: move user_ns from nsproxy into user struct
Quoting Eric W. Biederman ([EMAIL PROTECTED]): Serge E. Hallyn [EMAIL PROTECTED] writes: From ec5f54faf5afd16cb6cef40ebaaf3da25989d185 Mon Sep 17 00:00:00 2001 From: Serge Hallyn [EMAIL PROTECTED] Date: Thu, 24 Jul 2008 17:52:41 -0500 Subject: [PATCH 2/6] user namespaces: move user_ns from nsproxy into user struct When we get the sysfs support needed to support fair user scheduling along with user namespaces, then we will need to be able to get the user namespace from the user struct. So we need the user_ns to be a part of struct user. Once we can access it from tsk-user, we no longer have a use for tsk-nsproxy-user_ns. Is this true? Even in the general case of supporting setuid and setgid and everything else that potentially is in the user namespace? Sure. At any time we can get tsk-user-user_ns, and from that we can get tsk-user-user_ns-creator. I certainly support the cleanups you have made for the reasons you describe. I think however that there is there are no technical reasons not to have nsproxy-user_ns after the changes have been made. Well I ended up tossing it to clarify my thinking about the refcounting. With that done, I think we could safely not have nsproxy pin the user_ns, trusting the tsk-user to pin it instead. But keeping it in two places just seems needlessly complicated. I also agree that there are no technical reasons for keeping nsproxy-user_ns at the moment. What is your preference? When a user_namespace is created, the user which created it is marked as its 'creator'. The user_namespace pins the creator. Each userid in a user_ns pins the user_ns. This keeps refcounting nice and simple. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/6] user namespaces: add user_ns to super block
Quoting Matt Helsley ([EMAIL PROTECTED]): On Mon, 2008-07-28 at 14:53 -0700, Eric W. Biederman wrote: Serge E. Hallyn [EMAIL PROTECTED] writes: From 420d6e81ce29d7a6fe3ab7b43c1171e105f8b697 Mon Sep 17 00:00:00 2001 From: Serge Hallyn [EMAIL PROTECTED] Date: Thu, 24 Jul 2008 18:00:54 -0500 Subject: [PATCH 4/6] user namespaces: add user_ns to super block Add a user_ns to the super_block, and set it to the user_ns of the process which mounted the fs. In generic_permission() compare the current user_ns to that of the user_ns which mounted the inode's filesystem. I don't think this is the right approach. When we had the conversation earlier this was conceptually rejected as it prevents nfs superblock unification. We really want to store this in the vfsmount and pass the user namespace down from there to where we are going to use it if at all possible. The vfsmount also appears necessary if we are ever going to support multiple user namespaces per filesystem as the filesystem still need to know which user namespace to interpret it's data in. The filesystem can figure that out based on current's context, no? With the per-sb user_ns, the default behavior is indeed very limited, but since you want to move all the user_ns functionality into the filesystem, the fs can tag vfsmounts based on the new remount you had talked about before. Would this require passing the vfsmount to the filesystems themselves, or would they be within the VFS code only? If not wholly within the VFS I wonder if Al Viro would object to this. He's resisted past attempts to pass the vfsmount structs into more filesystem code paths and I'm guessing that could affect whether or not this approach can be implemented. Right, that's the main reason we might want to pursue the per-sb approach. Otherwise I would prefer the per-vfsmount approach. Eric, if you think the per-vfsmount fight is worth fighting, then by all means let's do it and see what happens. So in that case ignore patches 3-5 from this set :) -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/6] user namespaces: add user_ns to super block
Quoting Eric W. Biederman ([EMAIL PROTECTED]): Matt Helsley [EMAIL PROTECTED] writes: Would this require passing the vfsmount to the filesystems themselves, or would they be within the VFS code only? The interesting bit is the user_namespace contained in the vfsmount. We can pass that down. I think semantically it makes sense for a filesystem mount to only operate in a single mount namespace. If not wholly within the VFS I wonder if Al Viro would object to this. He's resisted past attempts to pass the vfsmount structs into more filesystem code paths and I'm guessing that could affect whether or not this approach can be implemented. Dave Hansen raised that concern when we were talking about it earlier. Since we just care about a property of the mount it isn't a big deal. Actually thinking about this a little farther it may be simplest to have the mnt_namespace capture the user_namespace, although that doesn't seem to map semantically very well with cloning of the filesystem. Interesting idea. I'm going to pursue that. So at a do_new_mount(), mnt-user_ns = current-user_ns. At do_loopback(), we ask the fs whether the new_mnt-user_ns can be set to current-user_ns. If not, it keeps the original, meaning that current will always receive user nobody access to the fs. Otherwise, the fs is saying that it knows how to properly convert userids from current-user-user_ns to ones which make sense in the original_mnt-user_ns. This is very much a question of how do we map the uid/gids store in the filesystem into the uids/gids in the kernel. Which user namespace do they belong in. Especially in the case of read only mounts we can safely share a filesystem between user_namespaces with no changes to the filesystem.Which I suspect is the first case we want to allow as that is a tremendous savings in space if you have lots of instances of the same distro, and people have been doing it with /usr for years. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/2] sys_restore prototype
Quoting Eric W. Biederman ([EMAIL PROTECTED]): Serge E. Hallyn [EMAIL PROTECTED] writes: We were talking this morning about what trivial patchset to begin with to get a start on checkpoint and restart. We thought that rather than start with checkpoint, maybe we should start with something that reads a checkpoint file and restarts a single task. In this case, restart means it sets the process id and executes the file which are found in the checkpoint file. So here's what we whipped up for a half hour this morning, and during some of Mark's talk this afternoon. It refuses to run if it isn't the container init, so you must unshare your pidns before calling sys_restore(). A reasonable approximation. Dave Hansen made a good point when he asked how do we graft a restored checkpoint into the rest of the system. Requiring us to unshare everything we intend to unshare before restore achieves this easily, and you are following in that model. That leads to an interesting implication. We don't need to set the pid of the first process. At most we can verify that the pid is the same. If we have unshared the pid namespace the pid will be 1 and the needed pid of the first process will be 1. More later. Good point. Sounds like our trivial prototype was still way over-featureful :) -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 2/2] sys_restore: set the pid number
Quoting Pavel Emelyanov ([EMAIL PROTECTED]): Serge E. Hallyn wrote: Set the pid number for a restored task. This is purely a toy, as it only sets the pidnr in the lowest level pid namespace. Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- kernel/fork.c |5 + kernel/pid.c | 19 +++ 2 files changed, 24 insertions(+), 0 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 0e43f69..41c46d2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1685,6 +1685,8 @@ int unshare_files(struct files_struct **displaced) char buf[CKPT_SIZE]; char exe_filename[PAGE_SIZE]; +extern int choose_pidmap(struct pid *pid, int new); + /* * userspace will already have made us a new pidns */ @@ -1717,6 +1719,9 @@ long do_restore(unsigned int fd, if (nr_scanned != 2) return -EINVAL; + ret = choose_pidmap(task_pid(current), pid); + if (!ret) + return -EAGAIN; ret = do_execve(exe_filename, argv, envp, regs); out: diff --git a/kernel/pid.c b/kernel/pid.c index 30bd5d4..88a5e2a 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -122,6 +122,25 @@ static void free_pidmap(struct upid *upid) atomic_inc(map-nr_free); } +int choose_pidmap(struct pid *pid, int new) +{ + struct pidmap *map; + int level = pid-level; + struct upid *upid = pid-numbers[level]; + struct pid_namespace *pid_ns = upid-ns; + int old = upid-nr; + + map = pid_ns-pidmap[new/BITS_PER_PAGE]; + if (!test_and_set_bit(new, map-page)) { This and... + map = pid_ns-pidmap[old/BITS_PER_PAGE]; + clear_bit(old, map-page); this are both a bit buggy, since the bit number on the page is not the same as the pid id itself ;) Oops. Yes, I see. thanks, -serge + upid-nr = new; + return 1; + } + + return 0; +} + static int alloc_pidmap(struct pid_namespace *pid_ns) { int i, offset, max_scan, pid, last = pid_ns-last_pid; ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/6] user namespaces: add user_ns to super block
Serge E. Hallyn [EMAIL PROTECTED] writes: The filesystem can figure that out based on current's context, no? With the per-sb user_ns, the default behavior is indeed very limited, but since you want to move all the user_ns functionality into the filesystem, the fs can tag vfsmounts based on the new remount you had talked about before. I guess I want the filesystem to coordinate. Would this require passing the vfsmount to the filesystems themselves, or would they be within the VFS code only? If not wholly within the VFS I wonder if Al Viro would object to this. He's resisted past attempts to pass the vfsmount structs into more filesystem code paths and I'm guessing that could affect whether or not this approach can be implemented. Right, that's the main reason we might want to pursue the per-sb approach. Otherwise I would prefer the per-vfsmount approach. Eric, if you think the per-vfsmount fight is worth fighting, then by all means let's do it and see what happens. So in that case ignore patches 3-5 from this set :) My intuitive sense is that the treating the handling of different user namespaces in the same filesystem is a trivial case of the superblock merging that nfs performs. And that we will preserve existing semantics much better if the user namespace is stored in the vfsmount. This allows mount propagation and friends to work without surprises. The practical limitation I see of storing things outside of the vfsmount is when do you setup the mapping to handle a new user namespace? So yes. I think it is worth the discussion. Let's not move the vfsmount down, and just move the user namespace pointer down as that is fundamentally what we care about. Eric ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Tue, 29 Jul 2008, KAMEZAWA Hiroyuki wrote: On Fri, 25 Jul 2008 17:46:45 +0100 (BST) Hugh Dickins [EMAIL PROTECTED] wrote: IIRC Rik expressed the same by pointing out that a cgroup at its swap limit would then be forced to grow in mem (until it hits its mem limit): so controlling the less precious resource would increase pressure on the more precious resource. (Actually, that probably bears little relation to what he said - sorry, Rik!) I don't recall what answer he got, perhaps I'd be persuaded if I heard it again. Added Nishimura to CC. IMHO, from user point of view, both of - having 2 controls as mem controller + swap controller - mem + swap controller doesn't have much difference. The users will use as they like. I'm not suggesting either one of those alternatives. I'm suggesting we have a mem controller (the thing we already have) and a mem+swap controller (which we don't yet have: a controller for the total mem+swap of a cgroup); the mem+swap controller likely making use of much that is in the mem controller, as Paul has said. (Unfortunately I don't have a good name for this mem+swap.) I happen to believe that the mem+swap controller would actually be a lot more useful than the current mem controller, and would expect many to run with mem+swap controller enabled but mem controller disabled or unlimited. How much is mem and how much is swap being left to global reclaim to decide, not imposed by any cgroup policy. What I don't like the sound of at all is a swap controller. Do you think that a mem controller (limit 1G) and a mem+swap controller (limit 2G) is equivalent to a mem controller (limit 1G) and a swap controller (limit 1G)? No: imagine memory pressure from outside the cgroup - with the mem+swap controller it can push as much as suits of the 2G out to swap; whereas with the swap controller, once 1G is out, it has to stop pushing any more of that cgroup out. I think that's absurd - but perhaps I just haven't looked, and I've totally misinterpreted the talk of a swap controller. From memory controller's point of view, treating mem+swap by the same controller makes sense. Because memory controller can check wheter we can use more swap or not, we can avoid hopeless-scanning of Anon at swap-shortage. (By split-lru, I think we can do this avoidance.) That's a detail I'm not concerned with on this level. Another-Topic? In recent servers, memory is big, swap is (relatively) small. You'll know much more about those common proportions than I do. I'd wonder why such big memory servers have any swap at all: to cope with VM management defects we should be fixing? And under memory resource controller, the whole swap is easily occupied by a group. I want to avoid it. Why? I presume because you're thinking it a precious resource. I don't think its relative smallness makes it more precious. For users, swap is not precious because it's not fast. Yes, and that's my view. But for memory reclaiming, swap is precious resource to page out anonymous/shmem/tmpfs memory. I see that makes swap a useful resource, I don't see that it makes it a precious resource. We page out to it precisely because it's less precious than the memory; both users and kernel would much prefer to keep all the data in memory, but sometimes there isn't enough memory so we go to swap. There is just one way in which I see swap as precious, and that is to get around some VM management stupidity. If, for example, on i386 there's a shortage of lowmem and lots of anonymous in lowmem that we should shift to highmem, then I think it's still the case that we have to do that balancing via writing out to and reading in from swap, because nobody has actually hooked up page migration to do that when appropriate? But that's an argument for extending page migration, not for needing a swap controller. I think usual system-admin considers swap as some emergency spare of memory. Yes, I do too. I'd like to allow this emergency spare to each cgroup. We do allow that emergency spare to each cgroup. Perhaps you're saying you want to divide it up in advance between the cgroups? But why? Sounds like a nice idea (reminds me of what Paul said about using temporary files), but a solution to what problem? (For example, swap is used even if vm.swappiness==0. This is for avoiding OOM-Killer under some situation, this behavior is added by Rik.) Sorry, I don't know what you're referring to there, but again, suspect it's a detail we don't need to be concerned with here. == following is another use case I explained to Rik at 23/May/08 == IIRC, a man shown his motivation to controll swap in OLS2007/BOF as following. Consider following system. (and there is no swap controller.) Memory 4G. Swap 1G. with 2 cgroups A, B. state 1) swap is not used. Amemory limit to be 1G no swap usage memory_usage=0M Bmemory limit to be 1G no swap usage memory_usage=0M state 2) Run a
[Devel] Re: memrlimit controller merge to mainline
On Tue, Jul 29, 2008 at 5:31 PM, Hugh Dickins [EMAIL PROTECTED] wrote: I don't see that I'm denying you a way to guarantee that (though I've been thinking more of the limits than the guarantees): I'm not saying that you cannot have a mem controller, I'm saying that you can also have a mem+swap controller; but that a swap-by-itself controller makes no sense to me. OK, fair enough. I think that works until you get to fork: shared files and private/anonymous/swap behave differently from then on. Good point. It works as long as you never do a plain fork() without immediate execve() though. Paul ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008, Paul Menage wrote: On Fri, Jul 25, 2008 at 12:46 PM, Hugh Dickins [EMAIL PROTECTED] wrote: No, I'm trying to say something stronger than that. I'm saying, as I've said before, that I cannot imagine why anyone would want to control swap itself - what they want to control is the total of mem+swap. Swap is a second-class citizen, nobody wants swap if they can have mem, so why control it separately? Scheduling jobs on to machines is much more straightforward when they request xGB of memory and yGB of swap rather than just (x+y)GB of (memory+swap). We want to be able to guarantee to jobs that they will be able to use xGB of real memory. I don't see that I'm denying you a way to guarantee that (though I've been thinking more of the limits than the guarantees): I'm not saying that you cannot have a mem controller, I'm saying that you can also have a mem+swap controller; but that a swap-by-itself controller makes no sense to me. Actually my preferred approach to swap controlling would be something like: - allow malloc to support mmaping pages from a temporary file rather than mmapping anonymous memory I think that works until you get to fork: shared files and private/anonymous/swap behave differently from then on. Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008, Balbir Singh wrote: I see what your saying. When you look at Linux right now, we control swap independent of memory, so I am not totally opposed to setting swap, instead of swap+mem. I might not want to swap from a particular cgroup, in which case, I set swap to 0 and risk OOMing, which might be an acceptable trade-off depending on my setup. I could easily change this policy on demand and add swap if OOMing was no longer OK. It's taken me a while to understand your point. I think you're saying that with a swap controller, you can set the swap limit to 0 on a cgroup if you want to keep it entirely in memory, without setting any mem limit upon it; whereas with my mem+swap controller, you'd have to set a mem limit then an equal mem+swap limit to achieve the same never go to swap effect, and maybe you don't want to set a mem limit. Hmm, but an unreachably high mem limit, and equal mem+swap limit, would achieve that effect. Sorry, I don't think I have understood (and even if the unreachably high limit didn't work, this seems more about setting a don't-swap flag than imposing a swap limit). Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Wed, 30 Jul 2008 01:16:17 +0100 (BST) Hugh Dickins [EMAIL PROTECTED] wrote: On Tue, 29 Jul 2008, KAMEZAWA Hiroyuki wrote: On Fri, 25 Jul 2008 17:46:45 +0100 (BST) Hugh Dickins [EMAIL PROTECTED] wrote: IIRC Rik expressed the same by pointing out that a cgroup at its swap limit would then be forced to grow in mem (until it hits its mem limit): so controlling the less precious resource would increase pressure on the more precious resource. (Actually, that probably bears little relation to what he said - sorry, Rik!) I don't recall what answer he got, perhaps I'd be persuaded if I heard it again. Added Nishimura to CC. IMHO, from user point of view, both of - having 2 controls as mem controller + swap controller - mem + swap controller doesn't have much difference. The users will use as they like. I'm not suggesting either one of those alternatives. I'm suggesting we have a mem controller (the thing we already have) and a mem+swap controller (which we don't yet have: a controller for the total mem+swap of a cgroup); the mem+swap controller likely making use of much that is in the mem controller, as Paul has said. Ah, what mem+swap controller means is limitiing mem+swap by 'a' limit ? It's a choice for me. From view of global LRU management, it's better. If we can avoid an accident that the swap is fully used by some silly program, anything is ok to me. How about you, Nishimura-san ? A story I talked is based on the assumption that there may be not enough swap space against memory. We can ask cutomers to equip tons of swap when memory is huge. BTW, what is the maximum swap size now ? Can we extend it if it's small ? snip state 4) Run a big program on B. A...memory_usage=500M swap_usage=700M. B...memory_usage=1G swap_usage=300M If you believe a swap controller would make that better, what limits do you suggest? If you assign A a swap limit of 700M or above, it changes nothing; if you assign A a swap limit below 700M, it cannot do all the work that it could do in the example. Of course, set A's swap_limit of 300M and get swap pages into memory and free swap entries and make A on memory. (before B starts.) But users think why A uses 700M of swap with 500M of free memory Because at this time A isn't actively using any of that 700M. That's a weakness of do all by automatic detection and ideal algoritm. It's just a result of LRU algorithm, which is not always the users think ideal. Thanks, -Kame ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH 2/4] res_counter check usage under val
I get your point. Logically this lock is unnecessary. (And seems this patch itself is buggy..(maybe refresh miss)) BTW, I'm sorry if I misunderstand. unsigned long long (on x86-32) can be compared safely ? Oops... Indeed. That discourages me, that we need a spinlock for simple comparisons :( ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Wed, 30 Jul 2008 10:17:19 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 01:16:17 +0100 (BST) Hugh Dickins [EMAIL PROTECTED] wrote: On Tue, 29 Jul 2008, KAMEZAWA Hiroyuki wrote: On Fri, 25 Jul 2008 17:46:45 +0100 (BST) Hugh Dickins [EMAIL PROTECTED] wrote: IIRC Rik expressed the same by pointing out that a cgroup at its swap limit would then be forced to grow in mem (until it hits its mem limit): so controlling the less precious resource would increase pressure on the more precious resource. (Actually, that probably bears little relation to what he said - sorry, Rik!) I don't recall what answer he got, perhaps I'd be persuaded if I heard it again. Added Nishimura to CC. IMHO, from user point of view, both of - having 2 controls as mem controller + swap controller - mem + swap controller doesn't have much difference. The users will use as they like. I'm not suggesting either one of those alternatives. I'm suggesting we have a mem controller (the thing we already have) and a mem+swap controller (which we don't yet have: a controller for the total mem+swap of a cgroup); the mem+swap controller likely making use of much that is in the mem controller, as Paul has said. Ah, what mem+swap controller means is limitiing mem+swap by 'a' limit ? It's a choice for me. From view of global LRU management, it's better. If we can avoid an accident that the swap is fully used by some silly program, anything is ok to me. Hmm. mem+swap controller means a shrink to memory resource controller (try_to_free_mem_cgroup_pages()) should drop only file caches. (Because kick-out-to-swap will never changes the usage.) right ? only global-lru can make a swap. maybe I can add optimization to do this. Hmm. I should see how OOM works under some situation. Thanks, -Kame ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Wed, 30 Jul 2008 11:52:26 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: mem+swap controller means a shrink to memory resource controller (try_to_free_mem_cgroup_pages()) should drop only file caches. (Because kick-out-to-swap will never changes the usage.) right ? only global-lru can make a swap. maybe I can add optimization to do this. Hmm. I should see how OOM works under some situation. (I'm sorry that I'm not a good writer of e-mail.) A brief summary about changes to mem controller. - mem+swap controller which limits the # sum of pages and swap_entries. - mem+swap controller just drops file caches when it reaches limit. - under mem+swap controller, recaliming Anon pages make no sense. Then, - LRU for Anon is not necessary. - LRU for tmpfs/shmem is not necessary. just showing account is better. - we should see try_to_free_mem_cgroup() again to avoid too much OOM. Maybe Retries=5 is too small because we never do swap under us. a problem like struck-into-ext3-journal can easily make file-cache reclaim difficult. - need some changes to documentation. - Should we have on/off switch of taking swap into account ? or should we implement mem+swap contoller in different name than memory controller ? If swap is not accounted, we need to do swap-out in memory reclaiming path, again. Thanks, -Kame ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 0/2] CR: save/restore a single, simple task
In the recent mini-summit at OLS 2008 and the following days it was agreed to tackle the checkpoint/restart (CR) by beginning with a very simple case: save and restore a single task, with simple memory layout, disregarding other task state such as files, signals etc. Following these discussions I coded a prototype that can do exactly that, as a starter. This code adds two system calls - sys_checkpoint and sys_restart - that a task can call to save and restore its state respectively. It also demonstrates how the checkpoint image file can be formatted, as well as show its nested nature (e.g. cr_write_mm() - cr_write_vma() nesting). The state that is saved/restored is the following: * some of the task_struct * some of the thread_struct and thread_info * the cpu state (including FPU) * the memory address space [The patch is against commit fb2e405fc1fc8b20d9c78eaa1c7fd5a297efde43 of Linus's tree (uhhh.. don't ask why), but against tonight's head too]. In the current code, sys_checkpoint will checkpoint the current task, although the logic exists to checkpoint other tasks (not in the checkpointee's execution context). A simple loop will extend this to handle multiple processes. sys_restart restarts the current tasks, and with multiple tasks each task will call the syscall independently. (Actually, to checkpoint outside the context of a task, it is also necessary to also handle restart-block logic when saving/restoring the thread data). It takes longer to describe what isn't implemented or supported by this prototype ... basically everything that isn't as simple as the above. As for containers - since we still don't have a representation for a container, this patch has no notion of a container. The tests for consistent namespaces (and isolation) are also omitted. Below are two example programs: one uses checkpoint (called ckpt) and one uses restart (called rstr). Execute like this (as a superuser): orenl:~/test$ ./ckpt out.1 hello, world! (ret=1) -- sys_checkpoint returns positive id -- ctrl-c orenl:~/test$ ./ckpt out.2 hello, world! (ret=2) -- ctrl-c orenl:~/test$ ./rstr out.1 hello, world! (ret=0) -- sys_restart return 0 (if you check the output of ps, you'll see that rstr changed its name to ckpt, as expected). Hoping this will accelerate the discussion. Comments are welcome. Let the fun begin :) Oren. == ckpt.c #define _GNU_SOURCE/* or _BSD_SOURCE or _SVID_SOURCE */ #include stdio.h #include stdlib.h #include errno.h #include fcntl.h #include unistd.h #include asm/unistd_32.h #include sys/syscall.h int main(int argc, char *argv[]) { pid_t pid = getpid(); int ret; ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0); if (ret 0) perror(checkpoint); fprintf(stderr, hello, world! (ret=%d)\n, ret); while (1) ; return 0; } == rstr.c #define _GNU_SOURCE/* or _BSD_SOURCE or _SVID_SOURCE */ #include stdio.h #include stdlib.h #include errno.h #include fcntl.h #include unistd.h #include asm/unistd_32.h #include sys/syscall.h int main(int argc, char *argv[]) { pid_t pid = getpid(); int ret; ret = syscall(__NR_restart, pid, STDIN_FILENO, 0); if (ret 0) perror(restart); printf(should not reach here !\n); return 0; } ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 1/2] CR: introduce sys_checkpoint and sys_restore
Create trivial sys_checkpoint and sys_restore system calls. They will enable to checkpoint and restart an entire container, to and from a checkpoint image file. First create a template for both syscalls: they take a file descriptor (for the image file) and flags as arguments. For sys_checkpoint the first argument identifies the target container; for sys_restart it will identify the checkpoint image. Signed-off-by: Oren Laadan [EMAIL PROTECTED] --- Makefile |2 +- arch/x86/kernel/syscall_table_32.S |2 ++ include/asm-x86/unistd_32.h|2 ++ 3 files changed, 5 insertions(+), 1 deletions(-) diff --git a/Makefile b/Makefile index 4bcd1cf..6e2334c 100644 --- a/Makefile +++ b/Makefile @@ -611,7 +611,7 @@ export mod_strip_cmd ifeq ($(KBUILD_EXTMOD),) -core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ +core-y += kernel/ mm/ fs/ ipc/ security/ crypto/ block/ ckpt/ vmlinux-dirs := $(patsubst %/,%,$(filter %/, $(init-y) $(init-m) \ $(core-y) $(core-m) $(drivers-y) $(drivers-m) \ diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index d44395f..5543136 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -332,3 +332,5 @@ ENTRY(sys_call_table) .long sys_dup3 /* 330 */ .long sys_pipe2 .long sys_inotify_init1 + .long sys_checkpoint + .long sys_restart diff --git a/include/asm-x86/unistd_32.h b/include/asm-x86/unistd_32.h index d739467..88bdec4 100644 --- a/include/asm-x86/unistd_32.h +++ b/include/asm-x86/unistd_32.h @@ -338,6 +338,8 @@ #define __NR_dup3 330 #define __NR_pipe2331 #define __NR_inotify_init1332 +#define __NR_checkpoint333 +#define __NR_restart 334 #ifdef __KERNEL__ -- 1.5.4.3 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 2/2] CR: handle a single task with private memory maps
Expand the template sys_checkpoint and sys_restart to be able to dump and restore a single task. The task's address space may consist of only private, simple vma's - anonymous or file-mapped. This big patch adds a mechanism to transfer data between kernel or user space to and from the file given by the caller (sys.c), alloc/setup/free of the checkpoint/restart context (sys.c), output wrappers and basic checkpoint handling (checkpoint.c), memory dump (ckpt_mem.c), input wrappers and basic restart handling (restart.c), and finally the memory restore (rstr_mem.c). Signed-off-by: Oren Laadan [EMAIL PROTECTED] --- ckpt/Makefile |1 + ckpt/checkpoint.c | 366 ++ ckpt/ckpt.h | 78 ++ ckpt/ckpt_hdr.h | 143 ++ ckpt/ckpt_mem.c | 421 + ckpt/ckpt_mem.h | 32 ckpt/restart.c| 328 + ckpt/rstr_mem.c | 415 ckpt/sys.c| 239 ++ 9 files changed, 2023 insertions(+), 0 deletions(-) create mode 100644 ckpt/Makefile create mode 100644 ckpt/checkpoint.c create mode 100644 ckpt/ckpt.h create mode 100644 ckpt/ckpt_hdr.h create mode 100644 ckpt/ckpt_mem.c create mode 100644 ckpt/ckpt_mem.h create mode 100644 ckpt/restart.c create mode 100644 ckpt/rstr_mem.c create mode 100644 ckpt/sys.c diff --git a/ckpt/Makefile b/ckpt/Makefile new file mode 100644 index 000..41f205d --- /dev/null +++ b/ckpt/Makefile @@ -0,0 +1 @@ +obj-y += sys.o checkpoint.o restart.o ckpt_mem.o rstr_mem.o diff --git a/ckpt/checkpoint.c b/ckpt/checkpoint.c new file mode 100644 index 000..1698a35 --- /dev/null +++ b/ckpt/checkpoint.c @@ -0,0 +1,366 @@ +/* + * Checkpoint logic and helpers + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include linux/version.h +#include linux/sched.h +#include linux/time.h +#include linux/fs.h +#include linux/file.h +#include linux/dcache.h +#include linux/mount.h +#include asm/ptrace.h + +#if defined (CONFIG_X86) +#include asm/i387.h +#endif + +#include ckpt.h +#include ckpt_hdr.h + +/** + * cr_get_fname - return pathname of a given file + * @file: file pointer + * @buf: buffer for pathname + * @n: buffer length (in) and pathname length (out) + * + * if the buffer provivded by the caller is too small, allocate a new + * buffer; caller should call cr_put_pathname() for cleanup + */ +char *cr_get_fname(struct path *path, struct path *root, char *buf, int *n) +{ + char *fname; + + fname = __d_path(path, root, buf, *n); + + if (IS_ERR(fname) PTR_ERR(fname) == -ENAMETOOLONG) { +if (!(buf = (char *) __get_free_pages(GFP_KERNEL, 0))) +return ERR_PTR(-ENOMEM); + fname = __d_path(path, root, buf, PAGE_SIZE); + if (IS_ERR(fname)) + free_pages((unsigned long) buf, 0); + } + if (!IS_ERR(fname)) + *n = (buf + *n - fname); + + return fname; +} + +/** + * cr_put_fname - (possibly) cleanup pathname buffer + * @buf: original buffer that was given to cr_get_pathname() + * @fname: resulting pathname from cr_get_pathname() + * @n: length of original buffer + */ +void cr_put_fname(char *buf, char *fname, int n) +{ + if (fname (fname buf || fname = buf + n)) + free_pages((unsigned long) buf, 0); +} + +/** + * cr_write_obj - write a record described by a cr_hdr + * @ctx: checkpoint context + * @h: record descriptor + * @buf: record buffer + */ +int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf) +{ + int ret; + + if ((ret = cr_kwrite(ctx, h, sizeof(*h))) 0) + return ret; + return cr_kwrite(ctx, buf, h-len); +} + +/** + * cr_write_str - write a string record + * @ctx: checkpoint context + * @str: string buffer + * @n: string length + */ +int cr_write_str(struct cr_ctx *ctx, char *str, int n) +{ + struct cr_hdr h; + + h.type = CR_HDR_STR; + h.len = n; + h.id = 0; + + return cr_write_obj(ctx, h, str); +} + +/* write the checkpoint header */ +static int cr_write_hdr(struct cr_ctx *ctx) +{ + struct cr_hdr h; + struct cr_hdr_head *hh = ctx-tbuf; + struct timeval ktv; + + h.type = CR_HDR_HEAD; + h.len = sizeof(*hh); + h.id = 0; + + do_gettimeofday(ktv); + + hh-magic = 0x00a2d200; + hh-major = (LINUX_VERSION_CODE 16) 0xff; + hh-minor = (LINUX_VERSION_CODE 8) 0xff; + hh-patch = (LINUX_VERSION_CODE) 0xff; + + hh-version = 1; + + hh-flags = ctx-flags; + hh-time = ktv.tv_sec; + + return cr_write_obj(ctx, h, hh); +} +
[Devel] Re: memrlimit controller merge to mainline
On Wed, 30 Jul 2008 12:11:15 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 11:52:26 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: mem+swap controller means a shrink to memory resource controller (try_to_free_mem_cgroup_pages()) should drop only file caches. (Because kick-out-to-swap will never changes the usage.) right ? only global-lru can make a swap. maybe I can add optimization to do this. Hmm. I should see how OOM works under some situation. (I'm sorry that I'm not a good writer of e-mail.) A brief summary about changes to mem controller. - mem+swap controller which limits the # sum of pages and swap_entries. - mem+swap controller just drops file caches when it reaches limit. - under mem+swap controller, recaliming Anon pages make no sense. Then, - LRU for Anon is not necessary. - LRU for tmpfs/shmem is not necessary. just showing account is better. - we should see try_to_free_mem_cgroup() again to avoid too much OOM. Maybe Retries=5 is too small because we never do swap under us. a problem like struck-into-ext3-journal can easily make file-cache reclaim difficult. - need some changes to documentation. - Should we have on/off switch of taking swap into account ? or should we implement mem+swap contoller in different name than memory controller ? If swap is not accounted, we need to do swap-out in memory reclaiming path, again. Then, mem+swap controller finally means - under mem+swap controller, program works with no swap. Only global LRU may make pages swapped-out. - If swap-accounting-mode is off, swap can be used unlimitedly. Hmm, sounds a bit differenct from what I want. How about others ? Thanks, -Kame Thanks, -Kame ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Wed, 30 Jul 2008 13:14:07 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 12:11:15 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 11:52:26 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: mem+swap controller means a shrink to memory resource controller (try_to_free_mem_cgroup_pages()) should drop only file caches. (Because kick-out-to-swap will never changes the usage.) right ? only global-lru can make a swap. maybe I can add optimization to do this. Hmm. I should see how OOM works under some situation. I'm thinking mem+swap controller in a different way: an add-on to mem controller, just as current swap controller. I mean adding memory.(mem+swap)_limit. (I'm sorry that I'm not a good writer of e-mail.) A brief summary about changes to mem controller. - mem+swap controller which limits the # sum of pages and swap_entries. - mem+swap controller just drops file caches when it reaches limit. - under mem+swap controller, recaliming Anon pages make no sense. Then, - LRU for Anon is not necessary. - LRU for tmpfs/shmem is not necessary. just showing account is better. - we should see try_to_free_mem_cgroup() again to avoid too much OOM. Maybe Retries=5 is too small because we never do swap under us. a problem like struck-into-ext3-journal can easily make file-cache reclaim difficult. - need some changes to documentation. - Should we have on/off switch of taking swap into account ? or should we implement mem+swap contoller in different name than memory controller ? If swap is not accounted, we need to do swap-out in memory reclaiming path, again. Then, mem+swap controller finally means - under mem+swap controller, program works with no swap. Only global LRU may make pages swapped-out. - If swap-accounting-mode is off, swap can be used unlimitedly. Hmm, sounds a bit differenct from what I want. How about others ? Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps
Hi Expand the template sys_checkpoint and sys_restart to be able to dump and restore a single task. The task's address space may consist of only private, simple vma's - anonymous or file-mapped. This big patch adds a mechanism to transfer data between kernel or user space to and from the file given by the caller (sys.c), alloc/setup/free of the checkpoint/restart context (sys.c), output wrappers and basic checkpoint handling (checkpoint.c), memory dump (ckpt_mem.c), input wrappers and basic restart handling (restart.c), and finally the memory restore (rstr_mem.c). Signed-off-by: Oren Laadan [EMAIL PROTECTED] please write a documentation of describe memory dump file format, and split save and restore to two patches. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Wed, 30 Jul 2008 13:58:03 +0900 Daisuke Nishimura [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 13:14:07 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 12:11:15 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 11:52:26 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: mem+swap controller means a shrink to memory resource controller (try_to_free_mem_cgroup_pages()) should drop only file caches. (Because kick-out-to-swap will never changes the usage.) right ? only global-lru can make a swap. maybe I can add optimization to do this. Hmm. I should see how OOM works under some situation. I'm thinking mem+swap controller in a different way: an add-on to mem controller, just as current swap controller. I mean adding memory.(mem+swap)_limit. Hmm ? adding a control file other than - memory.limit_in_bytes ? Thanks, -Kame ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Sorry for many mails ;( I think I misunderstood something... Following is ? A brief summary about changes in memroy controller. - memory.limit_in_bytes works as it is now. - new parameter: memory.limit_in_bytes_includes_swap will be added. + memory.limit_in_bytes_includes_swap controlls the total amount of RAM + SWAP, + memory.limit_in_bytes = memory.limit_in_bytes_includes_swap As a result. - memory controller works as it is but doesn't use too much swap. - global-lru cannot be affected by controller's parameter. Hmm, seems reasonable. minor problem is how-to-handle 2 counts/limits ? BTW, does anyone have good names ? (example) memory.memory_limits_in_bytes. (for accounting memory) memory.total_limits_in_bytes. (for accountign memory+swap) Thanks, -Kame On Wed, 30 Jul 2008 12:11:15 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: A brief summary about changes to mem controller. - mem+swap controller which limits the # sum of pages and swap_entries. - mem+swap controller just drops file caches when it reaches limit. - under mem+swap controller, recaliming Anon pages make no sense. Then, - LRU for Anon is not necessary. - LRU for tmpfs/shmem is not necessary. just showing account is better. - we should see try_to_free_mem_cgroup() again to avoid too much OOM. Maybe Retries=5 is too small because we never do swap under us. a problem like struck-into-ext3-journal can easily make file-cache reclaim difficult. - need some changes to documentation. - Should we have on/off switch of taking swap into account ? or should we implement mem+swap contoller in different name than memory controller ? If swap is not accounted, we need to do swap-out in memory reclaiming path, again. Thanks, -Kame ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel