[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008 04:14:55 -0400 Paul Menage [EMAIL PROTECTED] wrote: Hi Balbir, Andrew included the memrlimit controller in his latest set of patches to Linus for mainline. I've asked Linus to drop all 238 patches. I'll be resending them minus the offending memrlimit patches. Did I mention that conferences suck? ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008, Paul Menage wrote: So I think we'd be complicating some of the vm paths in mainline with a feature that isn't likely to get a lot of real use. What do you (and others on the containers list) think? Should we ask Andrew/Linus to hold off on this for now? My preference would be to do that until we have someone who can stand up with a concrete scenario where they want to use this in a real environment. I see Andrew has already acted, so it's now moot. But I'd like to say that I do agree with you and the conclusion to hold off for now. I was a bit alarmed earlier to see those patches sailing on through; but realized that I'd done very little to substantiate my hatred of the whole thing, and decided that I didn't feel strongly enough to stand in the way now. But I am glad you've stepped in, thank you. (Different topic, but one day I ought to get around to saying again how absurd I think a swap controller; whereas a mem+swap controller makes plenty of sense. I think Rik and others said the same.) By the way, here's a BUG I got from CONFIG_CGROUP_MEMRLIMIT_CTLR=y but no use of it, when doing swapoff a week ago. Not investigated at all, I'm afraid, but at a guess it might come from memrlimit work placing too much faith in the mm_users count - swapoff is only one of several places which have to inc/dec mm_users for some reason. BUG: unable to handle kernel paging request at 6b6b6b8b IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 *pde = Oops: [#1] PREEMPT SMP last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device thermal ac battery button Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7) EIP: 0060:[7817078f] EFLAGS: 00010206 CPU: 0 EIP is at memrlimit_cgroup_uncharge_as+0x18/0x29 EAX: 6b6b6b6b EBX: 7963215c ECX: 7c032000 EDX: 0025e000 ESI: 96902518 EDI: 9fbb1aa0 EBP: 7c033e9c ESP: 7c033e9c DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process swapoff (pid: 22500, ti=7c032000 task=907e2b70 task.ti=7c032000) Stack: 7c033edc 78161323 9fbb1aa0 025e ff77 7c033ecc 96902518 7c033ec8 0089 7963215c 9fbb1aa0 9fbb1b28 a272f040 7c033ef4 781226b1 9fbb1aa0 9fbb1aa0 790fa884 a272f0c8 7c033f80 78165ce3 Call Trace: [78161323] ? exit_mmap+0xaf/0x133 [781226b1] ? mmput+0x4c/0xba [78165ce3] ? try_to_unuse+0x20b/0x3f5 [78371534] ? _spin_unlock+0x22/0x3c [7816636a] ? sys_swapoff+0x17b/0x37c [78102d95] ? sysenter_past_esp+0x6a/0xa5 === Code: 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 ad a6 fd ff c9 c3 55 89 e5 8b 45 08 8b 55 0c 8b 80 30 02 00 00 c1 e2 0c 8b 80 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 e6 a6 fd ff 58 5a c9 c3 55 89 e5 8b EIP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 SS:ESP 0068:7c033e9c Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Paul Menage wrote: Hi Balbir, Andrew included the memrlimit controller in his latest set of patches to Linus for mainline. Although the memrlimit controller basically works as intended, my impression from the mini-summit on Tuesday is that our consensus is that this still doesn't have concrete practical use-cases yet: - avoiding swap over-use is better handled by the forthcoming swap controller - applications that can usefully handle mmap() returning NULL don't really exist yet (and since the system as a whole allows address space overcommit limits, if it was practical/useful to write such apps then presumably they would already exist) There are applications that can/need to handle overcommit, just that we are not aware of them fully. Immediately after our meeting, I was pointed to http://www.linuxfoundation.org/en/Carrier_Grade_Linux/Requirements_Alpha1#AVL.4.1_VM_Strict_Over-Commit So I think we'd be complicating some of the vm paths in mainline with a feature that isn't likely to get a lot of real use. I did disagree in the meeting and there is also the use case of the feature forming the infrastructure for other rlimit controllers. What do you (and others on the containers list) think? Should we ask Andrew/Linus to hold off on this for now? My preference would be to do that until we have someone who can stand up with a concrete scenario where they want to use this in a real environment. While we can argue about use cases, the feature needs more testing and I am OK holding off/reverting the merge to make it more stable and that would give us more time to argue on its usefulness. To say that overcommit handling is not useful is wrong. Meanwhile, I'll go back and look at the bug report that Hugh has posted and also look at building an mlock controller on top of memrlimits. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Andrew Morton wrote: On Fri, 25 Jul 2008 04:14:55 -0400 Paul Menage [EMAIL PROTECTED] wrote: Hi Balbir, Andrew included the memrlimit controller in his latest set of patches to Linus for mainline. I've asked Linus to drop all 238 patches. I'll be resending them minus the offending memrlimit patches. Sorry for making your work more harder. Did I mention that conferences suck? Not yet, but we know now :) -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Andrew Morton wrote: On Fri, 25 Jul 2008 04:14:55 -0400 Paul Menage [EMAIL PROTECTED] wrote: Hi Balbir, Andrew included the memrlimit controller in his latest set of patches to Linus for mainline. I've asked Linus to drop all 238 patches. I'll be resending them minus the offending memrlimit patches. Sorry for making your work more harder. Did I mention that conferences suck? Not yet, but we know now :) -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Hugh Dickins wrote: On Fri, 25 Jul 2008, Paul Menage wrote: So I think we'd be complicating some of the vm paths in mainline with a feature that isn't likely to get a lot of real use. What do you (and others on the containers list) think? Should we ask Andrew/Linus to hold off on this for now? My preference would be to do that until we have someone who can stand up with a concrete scenario where they want to use this in a real environment. I see Andrew has already acted, so it's now moot. But I'd like to say that I do agree with you and the conclusion to hold off for now. I was a bit alarmed earlier to see those patches sailing on through; but realized that I'd done very little to substantiate my hatred of the whole thing, and decided that I didn't feel strongly enough to stand in the way now. But I am glad you've stepped in, thank you. (Different topic, but one day I ought to get around to saying again how absurd I think a swap controller; whereas a mem+swap controller makes plenty of sense. I think Rik and others said the same.) We will have a memory+swap controller working together. By the way, here's a BUG I got from CONFIG_CGROUP_MEMRLIMIT_CTLR=y but no use of it, when doing swapoff a week ago. Not investigated at all, I'm afraid, but at a guess it might come from memrlimit work placing too much faith in the mm_users count - swapoff is only one of several places which have to inc/dec mm_users for some reason. I'll try and reproduce the problem right away. I've been running some kernbench on top of memrlimit (but not with a lot of stress or trying to swapoff the swap device). BUG: unable to handle kernel paging request at 6b6b6b8b IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 *pde = Oops: [#1] PREEMPT SMP last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device thermal ac battery button Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7) EIP: 0060:[7817078f] EFLAGS: 00010206 CPU: 0 EIP is at memrlimit_cgroup_uncharge_as+0x18/0x29 EAX: 6b6b6b6b EBX: 7963215c ECX: 7c032000 EDX: 0025e000 ESI: 96902518 EDI: 9fbb1aa0 EBP: 7c033e9c ESP: 7c033e9c DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process swapoff (pid: 22500, ti=7c032000 task=907e2b70 task.ti=7c032000) Stack: 7c033edc 78161323 9fbb1aa0 025e ff77 7c033ecc 96902518 7c033ec8 0089 7963215c 9fbb1aa0 9fbb1b28 a272f040 7c033ef4 781226b1 9fbb1aa0 9fbb1aa0 790fa884 a272f0c8 7c033f80 78165ce3 Call Trace: [78161323] ? exit_mmap+0xaf/0x133 [781226b1] ? mmput+0x4c/0xba [78165ce3] ? try_to_unuse+0x20b/0x3f5 [78371534] ? _spin_unlock+0x22/0x3c [7816636a] ? sys_swapoff+0x17b/0x37c [78102d95] ? sysenter_past_esp+0x6a/0xa5 === Code: 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 ad a6 fd ff c9 c3 55 89 e5 8b 45 08 8b 55 0c 8b 80 30 02 00 00 c1 e2 0c 8b 80 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 e6 a6 fd ff 58 5a c9 c3 55 89 e5 8b EIP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 SS:ESP 0068:7c033e9c Hugh I'll try and recreate the problem and fix it. If memrlimit_cgroup_uncharge_as() created the problem, it's most likely related to mm-owner not being correct and we are dereferencing the wrong memory. Thanks for the bug report, I'll look further. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, Jul 25, 2008 at 5:06 AM, Hugh Dickins [EMAIL PROTECTED] wrote: (Different topic, but one day I ought to get around to saying again how absurd I think a swap controller; whereas a mem+swap controller makes plenty of sense. I think Rik and others said the same.) Agreed that a swap controller without a memory controller doesn't make much sense, but a memory controller without a swap controller can make sense on machines that don't intend to use swap. So if they were separate controllers, we'd use the proposed cgroup dependency features to make the swap controller depend on the memory controller - in which case you'd only be able to mount the swap controller on a hierarchy that also had the memory controller, and the swap controller would be able to make use of the page ownership information. It's more of a modularity issue than a functionality issue, I think - the swap controller and memory controller are tracking fundamentally different things (space on disk versus pages in memory), and the only dependency between them is the memory controller tracking the ownership of a page and providing it to the swap controller. By the way, here's a BUG I got from CONFIG_CGROUP_MEMRLIMIT_CTLR=y but no use of it, when doing swapoff a week ago. Not investigated at all, I'm afraid, but at a guess it might come from memrlimit work placing too much faith in the mm_users count - swapoff is only one of several places which have to inc/dec mm_users for some reason. BUG: unable to handle kernel paging request at 6b6b6b8b Possibly the mm-owner tracking breaks in that case, if the last user exits while swapoff is occurring without relinquishing ownership? That looks as though mm-owner points to a task that had been poisoned after being freed. That could be awkward to fix :-( Paul ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, Jul 25, 2008 at 8:30 AM, Balbir Singh [EMAIL PROTECTED] wrote: There are applications that can/need to handle overcommit, just that we are not aware of them fully. Immediately after our meeting, I was pointed to http://www.linuxfoundation.org/en/Carrier_Grade_Linux/Requirements_Alpha1#AVL.4.1_VM_Strict_Over-Commit Thanks, that'll be interesting to take a look at. So I think we'd be complicating some of the vm paths in mainline with a feature that isn't likely to get a lot of real use. I did disagree in the meeting Yes, but (my impression of) the overall feeling in the meeting was that it wasn't yet the right time to push it to mainline. and there is also the use case of the feature forming the infrastructure for other rlimit controllers. Agreed, but that's something for the future. Paul ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Paul Menage wrote: On Fri, Jul 25, 2008 at 8:30 AM, Balbir Singh [EMAIL PROTECTED] wrote: There are applications that can/need to handle overcommit, just that we are not aware of them fully. Immediately after our meeting, I was pointed to http://www.linuxfoundation.org/en/Carrier_Grade_Linux/Requirements_Alpha1#AVL.4.1_VM_Strict_Over-Commit Thanks, that'll be interesting to take a look at. So I think we'd be complicating some of the vm paths in mainline with a feature that isn't likely to get a lot of real use. I did disagree in the meeting Yes, but (my impression of) the overall feeling in the meeting was that it wasn't yet the right time to push it to mainline. Yes! I need to test it more and I'll focus more on that front. and there is also the use case of the feature forming the infrastructure for other rlimit controllers. Agreed, but that's something for the future. I'll work on the mlock controller and post that as well. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008, Paul Menage wrote: On Fri, Jul 25, 2008 at 5:06 AM, Hugh Dickins [EMAIL PROTECTED] wrote: (Different topic, but one day I ought to get around to saying again how absurd I think a swap controller; whereas a mem+swap controller makes plenty of sense. I think Rik and others said the same.) Agreed that a swap controller without a memory controller doesn't make much sense, but a memory controller without a swap controller can make sense on machines that don't intend to use swap. I agree that a memory controller without a swap controller can make sense: I hope so, anyway, since that's what's in mainline. Even if swap is used, memory is a more precious resource than swap, and you were right to go about controlling memory first. So if they were separate controllers, we'd use the proposed cgroup dependency features to make the swap controller depend on the memory controller - in which case you'd only be able to mount the swap controller on a hierarchy that also had the memory controller, and the swap controller would be able to make use of the page ownership information. It's more of a modularity issue than a functionality issue, I think - the swap controller and memory controller are tracking fundamentally different things (space on disk versus pages in memory), and the only dependency between them is the memory controller tracking the ownership of a page and providing it to the swap controller. It sounds as if you're interpreting my mem+swap controller as a mem controller and a swap controller and the swap controller makes use of some of the mem controller infrastructure. No, I'm trying to say something stronger than that. I'm saying, as I've said before, that I cannot imagine why anyone would want to control swap itself - what they want to control is the total of mem+swap. Swap is a second-class citizen, nobody wants swap if they can have mem, so why control it separately? IIRC Rik expressed the same by pointing out that a cgroup at its swap limit would then be forced to grow in mem (until it hits its mem limit): so controlling the less precious resource would increase pressure on the more precious resource. (Actually, that probably bears little relation to what he said - sorry, Rik!) I don't recall what answer he got, perhaps I'd be persuaded if I heard it again. Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008, Balbir Singh wrote: I'll try and recreate the problem and fix it. If memrlimit_cgroup_uncharge_as() created the problem, it's most likely related to mm-owner not being correct and we are dereferencing the wrong memory. Thanks for the bug report, I'll look further. Good luck! I have only seen it once, on a dual-core laptop; though I don't remember to try swapoff while busy as often as I should (be sure to alternate between a couple or more of swapareas, so you can swap a new one on just before swapping an old one off, to be pretty sure of success). May be easier to find in the source: my suspicion is that a bad mm_users assumption will come into it. But I realize now that it could be entirely unrelated to memrlimit, just that uncharge_as was the one to get hit by bad refcounting elsewhere. Oh, that reminds me, I never reported back on my res_counter warnings at shutdown: never saw them again, once I added in the set of changes you came up with shortly after that - thanks. Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Hugh Dickins wrote: On Fri, 25 Jul 2008, Balbir Singh wrote: I'll try and recreate the problem and fix it. If memrlimit_cgroup_uncharge_as() created the problem, it's most likely related to mm-owner not being correct and we are dereferencing the wrong memory. Thanks for the bug report, I'll look further. Good luck! I have only seen it once, on a dual-core laptop; though I don't remember to try swapoff while busy as often as I should (be sure to alternate between a couple or more of swapareas, so you can swap a new one on just before swapping an old one off, to be pretty sure of success). Thanks, that's very useful information. I would have never tried juggling swap devices otherwise. May be easier to find in the source: my suspicion is that a bad mm_users assumption will come into it. But I realize now that it could be entirely unrelated to memrlimit, just that uncharge_as was the one to get hit by bad refcounting elsewhere. Oh, that reminds me, I never reported back on my res_counter warnings at shutdown: never saw them again, once I added in the set of changes you came up with shortly after that - thanks. I am glad those messages are gone, thanks for the bug report. I find bug fixing more exciting that kernel development on most occasions. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes (namespace naming)
Quoting Eric W. Biederman ([EMAIL PROTECTED]): Currently we have three possibilities on how to name pid namespaces. - indirect via processes - pids - names in the filesystem We discussed this a bit in the hallway track and pids are look like the way to go. Pavel has a patch in progress to help sort this out. The practical problem we have today is that we need a way to wait for the network namespace in particular and namespaces in general to exit. At a first glance waitid(P_NS, pid,) looks like a useful way to achieve this. After looking at wait a bit more it really is fundamentally just an exit status reaper of zombies, that has the option of blocking when the zombies do not yet exist. In any kind of event loop you would wait for SIGCHLD either as a signal or with signalfd. So how shall we wait for a namespace to exit? My brainstorm tonight suggests inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE); Eric I'm sorry, I'm still not quite clear on... Why? You care about when the tasks exit, and you care about when network devices, for instance, need to be deleted (which you can presumably get uevents for, when they get moved back into init_net_ns). Why do you care when the struct net actually gets deleted? -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Hugh Dickins wrote: On Fri, 25 Jul 2008, Paul Menage wrote: On Fri, Jul 25, 2008 at 5:06 AM, Hugh Dickins [EMAIL PROTECTED] wrote: (Different topic, but one day I ought to get around to saying again how absurd I think a swap controller; whereas a mem+swap controller makes plenty of sense. I think Rik and others said the same.) Agreed that a swap controller without a memory controller doesn't make much sense, but a memory controller without a swap controller can make sense on machines that don't intend to use swap. I agree that a memory controller without a swap controller can make sense: I hope so, anyway, since that's what's in mainline. Even if swap is used, memory is a more precious resource than swap, and you were right to go about controlling memory first. Yes, I agree. So if they were separate controllers, we'd use the proposed cgroup dependency features to make the swap controller depend on the memory controller - in which case you'd only be able to mount the swap controller on a hierarchy that also had the memory controller, and the swap controller would be able to make use of the page ownership information. It's more of a modularity issue than a functionality issue, I think - the swap controller and memory controller are tracking fundamentally different things (space on disk versus pages in memory), and the only dependency between them is the memory controller tracking the ownership of a page and providing it to the swap controller. It sounds as if you're interpreting my mem+swap controller as a mem controller and a swap controller and the swap controller makes use of some of the mem controller infrastructure. No, I'm trying to say something stronger than that. I'm saying, as I've said before, that I cannot imagine why anyone would want to control swap itself - what they want to control is the total of mem+swap. Swap is a second-class citizen, nobody wants swap if they can have mem, so why control it separately? IIRC Rik expressed the same by pointing out that a cgroup at its swap limit would then be forced to grow in mem (until it hits its mem limit): so controlling the less precious resource would increase pressure on the more precious resource. (Actually, that probably bears little relation to what he said - sorry, Rik!) I don't recall what answer he got, perhaps I'd be persuaded if I heard it again. I see what your saying. When you look at Linux right now, we control swap independent of memory, so I am not totally opposed to setting swap, instead of swap+mem. I might not want to swap from a particular cgroup, in which case, I set swap to 0 and risk OOMing, which might be an acceptable trade-off depending on my setup. I could easily change this policy on demand and add swap if OOMing was no longer OK. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes (namespace naming)
Serge E. Hallyn wrote: Quoting Eric W. Biederman ([EMAIL PROTECTED]): Currently we have three possibilities on how to name pid namespaces. - indirect via processes - pids - names in the filesystem We discussed this a bit in the hallway track and pids are look like the way to go. Pavel has a patch in progress to help sort this out. The practical problem we have today is that we need a way to wait for the network namespace in particular and namespaces in general to exit. At a first glance waitid(P_NS, pid,) looks like a useful way to achieve this. After looking at wait a bit more it really is fundamentally just an exit status reaper of zombies, that has the option of blocking when the zombies do not yet exist. In any kind of event loop you would wait for SIGCHLD either as a signal or with signalfd. So how shall we wait for a namespace to exit? My brainstorm tonight suggests inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE); Eric I'm sorry, I'm still not quite clear on... Why? You care about when the tasks exit, and you care about when network devices, for instance, need to be deleted (which you can presumably get uevents for, when they get moved back into init_net_ns). Why do you care when the struct net actually gets deleted? IMO, if we consider a container being an aggregation of different namespaces, we should consider the container dies when all the namespaces are dead. One good example is an application ran inside a container and doing a bulk data writing over the network. When the application finish its last call to send it will exits. At this point, there is no more processes running inside the container but we can not consider the container is dead because there are still some pending datas in the socket to be delivered to the peer. Eric will post a patch to automatically destroy the virtual devices when the netns is destroyed, so there is no way to know if a network namespace is dead or not as the uevent socket will not deliver an event outside of the container. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes (namespace naming)
Quoting Daniel Lezcano ([EMAIL PROTECTED]): Serge E. Hallyn wrote: Quoting Eric W. Biederman ([EMAIL PROTECTED]): Currently we have three possibilities on how to name pid namespaces. - indirect via processes - pids - names in the filesystem We discussed this a bit in the hallway track and pids are look like the way to go. Pavel has a patch in progress to help sort this out. The practical problem we have today is that we need a way to wait for the network namespace in particular and namespaces in general to exit. At a first glance waitid(P_NS, pid,) looks like a useful way to achieve this. After looking at wait a bit more it really is fundamentally just an exit status reaper of zombies, that has the option of blocking when the zombies do not yet exist. In any kind of event loop you would wait for SIGCHLD either as a signal or with signalfd. So how shall we wait for a namespace to exit? My brainstorm tonight suggests inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE); Eric I'm sorry, I'm still not quite clear on... Why? You care about when the tasks exit, and you care about when network devices, for instance, need to be deleted (which you can presumably get uevents for, when they get moved back into init_net_ns). Why do you care when the struct net actually gets deleted? IMO, if we consider a container being an aggregation of different namespaces, we should consider the container dies when all the namespaces are dead. One good example is an application ran inside a container and doing a bulk data writing over the network. When the application finish its last call to send it will exits. At this point, there is no more processes running inside the container but we can not consider the container is dead because there are still some pending datas in the socket to be delivered to the peer. Eric will post a patch to automatically destroy the virtual devices when the netns is destroyed, so there is no way to know if a network namespace is dead or not as the uevent socket will not deliver an event outside of the container. My question remains: who cares? -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes (namespace naming)
Serge E. Hallyn wrote: Quoting Daniel Lezcano ([EMAIL PROTECTED]): Serge E. Hallyn wrote: Quoting Eric W. Biederman ([EMAIL PROTECTED]): Currently we have three possibilities on how to name pid namespaces. - indirect via processes - pids - names in the filesystem We discussed this a bit in the hallway track and pids are look like the way to go. Pavel has a patch in progress to help sort this out. The practical problem we have today is that we need a way to wait for the network namespace in particular and namespaces in general to exit. At a first glance waitid(P_NS, pid,) looks like a useful way to achieve this. After looking at wait a bit more it really is fundamentally just an exit status reaper of zombies, that has the option of blocking when the zombies do not yet exist. In any kind of event loop you would wait for SIGCHLD either as a signal or with signalfd. So how shall we wait for a namespace to exit? My brainstorm tonight suggests inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE); Eric I'm sorry, I'm still not quite clear on... Why? You care about when the tasks exit, and you care about when network devices, for instance, need to be deleted (which you can presumably get uevents for, when they get moved back into init_net_ns). Why do you care when the struct net actually gets deleted? IMO, if we consider a container being an aggregation of different namespaces, we should consider the container dies when all the namespaces are dead. One good example is an application ran inside a container and doing a bulk data writing over the network. When the application finish its last call to send it will exits. At this point, there is no more processes running inside the container but we can not consider the container is dead because there are still some pending datas in the socket to be delivered to the peer. Eric will post a patch to automatically destroy the virtual devices when the netns is destroyed, so there is no way to know if a network namespace is dead or not as the uevent socket will not deliver an event outside of the container. My question remains: who cares? In the context of CR, you'd care if you migrate a container including its network stack. In that case, you wanna make sure that: (1) you save sockets that have data in their (send) queue but otherwise not attached to any specific process, and (2) you disable these sockets at the source machine as soon as you enable the container on the target machine. Rethinking this, Serge is probably right because one you migrate the network to the target node, you disable the network (of that container) on the source node, so you don't care about #2 there anymore... Oren. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: C/R minisummit notes (namespace naming)
Serge E. Hallyn wrote: Quoting Daniel Lezcano ([EMAIL PROTECTED]): Serge E. Hallyn wrote: Quoting Eric W. Biederman ([EMAIL PROTECTED]): Currently we have three possibilities on how to name pid namespaces. - indirect via processes - pids - names in the filesystem We discussed this a bit in the hallway track and pids are look like the way to go. Pavel has a patch in progress to help sort this out. The practical problem we have today is that we need a way to wait for the network namespace in particular and namespaces in general to exit. At a first glance waitid(P_NS, pid,) looks like a useful way to achieve this. After looking at wait a bit more it really is fundamentally just an exit status reaper of zombies, that has the option of blocking when the zombies do not yet exist. In any kind of event loop you would wait for SIGCHLD either as a signal or with signalfd. So how shall we wait for a namespace to exit? My brainstorm tonight suggests inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE); Eric I'm sorry, I'm still not quite clear on... Why? You care about when the tasks exit, and you care about when network devices, for instance, need to be deleted (which you can presumably get uevents for, when they get moved back into init_net_ns). Why do you care when the struct net actually gets deleted? IMO, if we consider a container being an aggregation of different namespaces, we should consider the container dies when all the namespaces are dead. One good example is an application ran inside a container and doing a bulk data writing over the network. When the application finish its last call to send it will exits. At this point, there is no more processes running inside the container but we can not consider the container is dead because there are still some pending datas in the socket to be delivered to the peer. Eric will post a patch to automatically destroy the virtual devices when the netns is destroyed, so there is no way to know if a network namespace is dead or not as the uevent socket will not deliver an event outside of the container. My question remains: who cares? The container implementation in userspace. Let's imagine it sets some routes outside of the container to route the traffic to the container. It should remove these routes when the container dies. And the container should be considered as dead when the network has died and not when the last process of the container exits. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/2] introduce sys_restore
Create a useless (?) sys_restore system call. All it does is read a checkpoint file :) for a pid number and a file to execute. Since we don't take things like argv and envp and registers from the checkpoint file, in order to make this easily testable, we take those things as arguments. Signed-off-by: Serge Hallyn [EMAIL PROTECTED] Signed-off-by: Dave Hansen [EMAIL PROTECTED] --- arch/x86/kernel/process_32.c | 10 arch/x86/kernel/syscall_table_32.S |2 + kernel/fork.c | 43 3 files changed, 55 insertions(+), 0 deletions(-) diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index 0c3927a..e11627d 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -688,6 +688,16 @@ out: return error; } +long do_restore(unsigned int fd, + char __user * __user *argv, +char __user * __user *envp, + struct pt_regs *regs); + +asmlinkage long sys_restore(struct pt_regs regs) +{ + return do_restore(regs.bx, regs.cx, regs.dx, regs); +} + #define top_esp(THREAD_SIZE - sizeof(unsigned long)) #define top_ebp(THREAD_SIZE - 2*sizeof(unsigned long)) diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S index adff556..019a8e4 100644 --- a/arch/x86/kernel/syscall_table_32.S +++ b/arch/x86/kernel/syscall_table_32.S @@ -326,3 +326,5 @@ ENTRY(sys_call_table) .long sys_fallocate .long sys_timerfd_settime /* 325 */ .long sys_timerfd_gettime + .long sys_restore + diff --git a/kernel/fork.c b/kernel/fork.c index adefc11..0e43f69 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1679,3 +1679,46 @@ int unshare_files(struct files_struct **displaced) task_unlock(task); return 0; } + + +#define CKPT_SIZE (PAGE_SIZE*4) +char buf[CKPT_SIZE]; +char exe_filename[PAGE_SIZE]; + +/* + * userspace will already have made us a new pidns + */ +long do_restore(unsigned int fd, + char __user * __user *argv, +char __user * __user *envp, + struct pt_regs *regs) +{ + int nr_scanned; +struct file *file; +long ret = -EBADF; +int fput_needed; + int pid; + + if (!is_container_init(current)) { + printk(I am not init\n); + return -EPERM; + } + +file = fget_light(fd, fput_needed); + if (!file) + goto out; + ret = kernel_read(file, 0, buf, CKPT_SIZE); + fput_light(file, fput_needed); + if (ret = 0) + goto out; + + nr_scanned = sscanf(buf, %d %s, pid, exe_filename); + + if (nr_scanned != 2) + return -EINVAL; + + + ret = do_execve(exe_filename, argv, envp, regs); +out: +return ret; +} -- 1.5.4.3 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/2] sys_restore: set the pid number
Set the pid number for a restored task. This is purely a toy, as it only sets the pidnr in the lowest level pid namespace. Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- kernel/fork.c |5 + kernel/pid.c | 19 +++ 2 files changed, 24 insertions(+), 0 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index 0e43f69..41c46d2 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1685,6 +1685,8 @@ int unshare_files(struct files_struct **displaced) char buf[CKPT_SIZE]; char exe_filename[PAGE_SIZE]; +extern int choose_pidmap(struct pid *pid, int new); + /* * userspace will already have made us a new pidns */ @@ -1717,6 +1719,9 @@ long do_restore(unsigned int fd, if (nr_scanned != 2) return -EINVAL; + ret = choose_pidmap(task_pid(current), pid); + if (!ret) + return -EAGAIN; ret = do_execve(exe_filename, argv, envp, regs); out: diff --git a/kernel/pid.c b/kernel/pid.c index 30bd5d4..88a5e2a 100644 --- a/kernel/pid.c +++ b/kernel/pid.c @@ -122,6 +122,25 @@ static void free_pidmap(struct upid *upid) atomic_inc(map-nr_free); } +int choose_pidmap(struct pid *pid, int new) +{ + struct pidmap *map; + int level = pid-level; + struct upid *upid = pid-numbers[level]; + struct pid_namespace *pid_ns = upid-ns; + int old = upid-nr; + + map = pid_ns-pidmap[new/BITS_PER_PAGE]; + if (!test_and_set_bit(new, map-page)) { + map = pid_ns-pidmap[old/BITS_PER_PAGE]; + clear_bit(old, map-page); + upid-nr = new; + return 1; + } + + return 0; +} + static int alloc_pidmap(struct pid_namespace *pid_ns) { int i, offset, max_scan, pid, last = pid_ns-last_pid; -- 1.5.4.3 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 0/6] user namespaces: introduction
Following is a set of user namespace patches I've been playing with this week. The first two patches are I believe fixes which should go in regardless of which direction user namespaces take. The rest of the patches are one approach to providing default cross-userns isolation for files. Any filesystem can provide its own intelligent cross-userns userid equivalence checks by defining its own permission function, which is what Eric and I have been talking about doing. The next step is probably to handle some of the task-to-task cross-userns checks. Comments appreciated. thanks, -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/6] user namespaces: move user_ns from nsproxy into user struct
From ec5f54faf5afd16cb6cef40ebaaf3da25989d185 Mon Sep 17 00:00:00 2001 From: Serge Hallyn [EMAIL PROTECTED] Date: Thu, 24 Jul 2008 17:52:41 -0500 Subject: [PATCH 2/6] user namespaces: move user_ns from nsproxy into user struct When we get the sysfs support needed to support fair user scheduling along with user namespaces, then we will need to be able to get the user namespace from the user struct. So we need the user_ns to be a part of struct user. Once we can access it from tsk-user, we no longer have a use for tsk-nsproxy-user_ns. When a user_namespace is created, the user which created it is marked as its 'creator'. The user_namespace pins the creator. Each userid in a user_ns pins the user_ns. This keeps refcounting nice and simple. At the same time, this patch simplifies the refcounting. The current user and userns locking works as follows: The task pins the user struct. The task pins the nsproxy, the nsproxy pins the user_ns. When a new user_ns is created, it creates a root user for it, and pins it. When the nsproxy releases the user_ns, the userns tries to release all its user structs. So you see that the refcounting works for now, but only because the nsproxy (and therefore usr_ns) and user structs will be freed at the same time - when the last task using them is released. Now we need to put the user_ns in the struct user. You can see that will mess up the refcounting. Fortunately, once the user_ns is available from tsk-user, we don't need it in nsproxy. So here is how the refcounting *should* be done: The task pins the user struct. The user struct pins its user namespace. The user namespace pins the struct user which created it. A user namespace now doesn't need to release its userids, because it is only released when its last user disappears. This patch makes those changes. Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- include/linux/init_task.h |1 - include/linux/key.h|3 ++ include/linux/nsproxy.h|1 - include/linux/sched.h |3 +- include/linux/user_namespace.h | 10 +++ kernel/fork.c |3 +- kernel/nsproxy.c | 10 +-- kernel/sys.c |4 +- kernel/user.c | 52 +++ kernel/user_namespace.c| 36 +++ security/keys/process_keys.c |7 - 11 files changed, 45 insertions(+), 85 deletions(-) diff --git a/include/linux/init_task.h b/include/linux/init_task.h index 93c45ac..ce87b8a 100644 --- a/include/linux/init_task.h +++ b/include/linux/init_task.h @@ -57,7 +57,6 @@ extern struct nsproxy init_nsproxy; .mnt_ns = NULL, \ INIT_NET_NS(net_ns) \ INIT_IPC_NS(ipc_ns) \ - .user_ns= init_user_ns,\ } #define INIT_SIGHAND(sighand) { \ diff --git a/include/linux/key.h b/include/linux/key.h index c45c962..ba53aef 100644 --- a/include/linux/key.h +++ b/include/linux/key.h @@ -277,6 +277,8 @@ extern ctl_table key_sysctls[]; * the userspace interface */ extern void switch_uid_keyring(struct user_struct *new_user); +extern void switch_uid_keyring_task(struct task_struct *tsk, + struct user_struct *new_user); extern int copy_keys(unsigned long clone_flags, struct task_struct *tsk); extern int copy_thread_group_keys(struct task_struct *tsk); extern void exit_keys(struct task_struct *tsk); @@ -305,6 +307,7 @@ extern void key_init(void); #define key_ref_to_ptr(k) ({ NULL; }) #define is_key_possessed(k)0 #define switch_uid_keyring(u) do { } while(0) +#define switch_uid_keyring_task(t,u) do { } while(0) #define __install_session_keyring(t, k)({ NULL; }) #define copy_keys(f,t) 0 #define copy_thread_group_keys(t) 0 diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h index 0e66b57..7c7fb20 100644 --- a/include/linux/nsproxy.h +++ b/include/linux/nsproxy.h @@ -27,7 +27,6 @@ struct nsproxy { struct ipc_namespace *ipc_ns; struct mnt_namespace *mnt_ns; struct pid_namespace *pid_ns; - struct user_namespace *user_ns; struct net *net_ns; }; extern struct nsproxy init_nsproxy; diff --git a/include/linux/sched.h b/include/linux/sched.h index cf36e14..2c2f036 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -600,7 +600,7 @@ struct user_struct { /* Hash table maintenance information */ struct hlist_node uidhash_node; uid_t uid; - struct user_namespace *user_namespace; + struct
[Devel] [PATCH 1/6] user namespaces: introduce user_struct-user_namespace relationship
From 9382f22a6c751e90baa4e7f3ba24c509e50a47a8 Mon Sep 17 00:00:00 2001 From: Serge Hallyn [EMAIL PROTECTED] Date: Tue, 22 Jul 2008 13:31:37 -0500 Subject: [PATCH 1/6] user namespaces: introduce user_struct-user_namespace relationship When a task does clone(CLONE_NEWNS), the task's user is the 'creator' of the new user_namespace, and the user_namespace is tacked onto a list of those created by this user. Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- include/linux/sched.h |1 + include/linux/user_namespace.h |1 + kernel/user.c |7 +++ kernel/user_namespace.c| 20 +++- 4 files changed, 20 insertions(+), 9 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index dc7e592..cf36e14 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -600,6 +600,7 @@ struct user_struct { /* Hash table maintenance information */ struct hlist_node uidhash_node; uid_t uid; + struct user_namespace *user_namespace; #ifdef CONFIG_USER_SCHED struct task_group *tg; diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index b5f41d4..f9477c3 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -13,6 +13,7 @@ struct user_namespace { struct kref kref; struct hlist_head uidhash_table[UIDHASH_SZ]; struct user_struct *root_user; + struct user_struct *creator; }; extern struct user_namespace init_user_ns; diff --git a/kernel/user.c b/kernel/user.c index 865ecf5..cfe8309 100644 --- a/kernel/user.c +++ b/kernel/user.c @@ -22,6 +22,7 @@ struct user_namespace init_user_ns = { .refcount = ATOMIC_INIT(2), }, .root_user = root_user, + .creator = root_user, }; EXPORT_SYMBOL_GPL(init_user_ns); @@ -53,6 +54,7 @@ struct user_struct root_user = { .files = ATOMIC_INIT(0), .sigpending = ATOMIC_INIT(0), .locked_shm = 0, + .user_namespace = init_user_ns, #ifdef CONFIG_USER_SCHED .tg = init_task_group, #endif @@ -321,6 +323,7 @@ done: */ static inline void free_user(struct user_struct *up, unsigned long flags) { + put_user_ns(up-user_namespace); /* restore back the count */ atomic_inc(up-__count); spin_unlock_irqrestore(uidhash_lock, flags); @@ -347,6 +350,7 @@ static inline void free_user(struct user_struct *up, unsigned long flags) sched_destroy_user(up); key_put(up-uid_keyring); key_put(up-session_keyring); + put_user_ns(up-user_namespace); kmem_cache_free(uid_cachep, up); } @@ -409,6 +413,8 @@ struct user_struct *alloc_uid(struct user_namespace *ns, uid_t uid) if (sched_create_user(new) 0) goto out_free_user; + new-user_namespace = get_user_ns(ns); + if (uids_user_create(new)) goto out_destoy_sched; @@ -441,6 +447,7 @@ struct user_struct *alloc_uid(struct user_namespace *ns, uid_t uid) out_destoy_sched: sched_destroy_user(new); + put_user_ns(new-user_namespace); out_free_user: kmem_cache_free(uid_cachep, new); out_unlock: diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index a9ab059..e8db443 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -19,7 +19,6 @@ static struct user_namespace *clone_user_ns(struct user_namespace *old_ns) { struct user_namespace *ns; - struct user_struct *new_user; int n; ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL); @@ -38,15 +37,17 @@ static struct user_namespace *clone_user_ns(struct user_namespace *old_ns) return ERR_PTR(-ENOMEM); } - /* Reset current-user with a new one */ - new_user = alloc_uid(ns, current-uid); - if (!new_user) { - free_uid(ns-root_user); - kfree(ns); - return ERR_PTR(-ENOMEM); - } + /* pin the creating user */ + ns-creator = current-user; + atomic_inc(ns-creator-__count); + + /* +* The alloc_uid() incremented the userns refcount, +* so drop it again +*/ + put_user_ns(ns); - switch_uid(new_user); + switch_uid(ns-root_user); return ns; } @@ -72,6 +73,7 @@ void free_user_ns(struct kref *kref) ns = container_of(kref, struct user_namespace, kref); release_uids(ns); + free_uid(ns-creator); kfree(ns); } EXPORT_SYMBOL(free_user_ns); -- 1.5.4.3 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 3/6] user namespaces: rig generic_permission for simple userns check
From f6d09b06a1106936010bffd420267f5b7ee66238 Mon Sep 17 00:00:00 2001 From: Serge Hallyn [EMAIL PROTECTED] Date: Wed, 23 Jul 2008 17:01:09 -0500 Subject: [PATCH 3/6] user namespaces: rig generic_permission for simple userns check Filesystems can provide their own permission() functions to do advanced inter-user_namespace userid equivalence checks. For those filesystems which do not support that, we will do a simple check that current's user namespace is equivalent to the user_namespace which mounted the filesystem. If it is not equivalent, then the task can only have user nobody (that is, the 'other') permissions to a file. For now, we actually just compare the user's user_ns to the init_user_ns. Next we will set the sb-user_ns to that of the task mounting a filesystem, and use inode-i_sb-user_ns instead of init_user_ns. By punting even on that, the implications, and therefore (in)correctness of this patch should be all the easier to verify. Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- fs/namei.c | 14 +- 1 files changed, 13 insertions(+), 1 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 01e67dd..d5336fd 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -31,6 +31,7 @@ #include linux/file.h #include linux/fcntl.h #include linux/device_cgroup.h +#include linux/nsproxy.h #include asm/namei.h #include asm/uaccess.h @@ -168,7 +169,7 @@ void putname(const char *name) EXPORT_SYMBOL(putname); #endif - +extern struct user_namespace init_user_ns; /** * generic_permission - check for access rights on a Posix-like filesystem * @inode: inode to check access rights for @@ -184,7 +185,15 @@ int generic_permission(struct inode *inode, int mask, int (*check_acl)(struct inode *inode, int mask)) { umode_t mode = inode-i_mode; + int same_userns = (current-user-user_ns == init_user_ns); + /* +* If we're not in the inode's user namespace, we get +* user nobody permissions, and we ignore acls +* (bc serge doesn't know how to handle acls in this case) +*/ + if (!same_userns) + goto check; if (current-fsuid == inode-i_uid) mode = 6; else { @@ -200,11 +209,14 @@ int generic_permission(struct inode *inode, int mask, mode = 3; } +check: /* * If the DACs are ok we don't need any capability check. */ if (((mode mask (MAY_READ|MAY_WRITE|MAY_EXEC)) == mask)) return 0; + if (!same_userns) + return -EACCES; check_capabilities: /* -- 1.5.4.3 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 5/6] user namespaces: refuse create in other user_ns
From 4d2c23452a67e25856893ab16fefd0f6e5aa58df Mon Sep 17 00:00:00 2001 From: Serge Hallyn [EMAIL PROTECTED] Date: Thu, 24 Jul 2008 06:37:43 -0500 Subject: [PATCH 5/6] user namespaces: refuse create in other user_ns Refuse writing to a directory in another user_ns. We can't support file creation because we wouldn't know who should own the file. This refuses file deletion as well - which I think is the sensible thing to do. File writing is still allowed if the 'user other' permissions include write. That again probably makes sense for logging and such. Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- fs/namei.c |6 ++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index adf5f1b..b39a990 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -213,6 +213,12 @@ int generic_permission(struct inode *inode, int mask, check: /* +* Can't write to a directory in another user_ns +* We wouldn't know who to make the owner! +*/ + if (!same_userns S_ISDIR(inode-i_mode) (maskMAY_WRITE)) + return -EACCES; + /* * If the DACs are ok we don't need any capability check. */ if (((mode mask (MAY_READ|MAY_WRITE|MAY_EXEC)) == mask)) -- 1.5.4.3 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 6/6] user_namespace: move put_user_ns outside lock
Oops, this one was supposed to get folded into the first patch, obviously. Sorry. -serge From daa76939c72adf146ded8a39e0211886e2bc943a Mon Sep 17 00:00:00 2001 From: Serge Hallyn [EMAIL PROTECTED] Date: Fri, 25 Jul 2008 14:24:32 -0500 Subject: [PATCH 6/6] user_namespace: move put_user_ns outside lock Since put_user_ns calls free_user on the creator, we shouldn't call it from inside the lock in free_user. Signed-off-by: Serge Hallyn [EMAIL PROTECTED] --- kernel/user.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/user.c b/kernel/user.c index 034dae5..d3dd353 100644 --- a/kernel/user.c +++ b/kernel/user.c @@ -323,11 +323,11 @@ done: */ static inline void free_user(struct user_struct *up, unsigned long flags) { - put_user_ns(up-user_ns); /* restore back the count */ atomic_inc(up-__count); spin_unlock_irqrestore(uidhash_lock, flags); + put_user_ns(up-user_ns); INIT_WORK(up-work, remove_user_sysfs_dir); schedule_work(up-work); } -- 1.5.4.3 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 0/2] sys_restore prototype
We were talking this morning about what trivial patchset to begin with to get a start on checkpoint and restart. We thought that rather than start with checkpoint, maybe we should start with something that reads a checkpoint file and restarts a single task. In this case, restart means it sets the process id and executes the file which are found in the checkpoint file. So here's what we whipped up for a half hour this morning, and during some of Mark's talk this afternoon. It refuses to run if it isn't the container init, so you must unshare your pidns before calling sys_restore(). To test, I did: [EMAIL PROTECTED] ~]# cat mycheckpoint 99 /root/whoami [EMAIL PROTECTED] ~]# cat restore.c #include stdio.h #include sys/types.h #include sys/stat.h #include fcntl.h int main() { int ret; char *argv[3]; char *envp[1]; //argv[0] = argv[1] = /bin/bash; //argv[2] = envp[0] = NULL; argv[0] = /root/whoami; argv[1] = envp[0] = NULL; int fd = open(/root/mycheckpoint, O_RDONLY); if (fd 0) perror(open checkpoint file); ret = syscall(327, fd, argv, envp); printf(syscall returned %d\n, ret); perror(syscall for checkpoint); close(fd); return ret; } [EMAIL PROTECTED] ~]# cat whoami.c #include stdio.h #include sys/types.h #include unistd.h int main() { printf(I am %d\n, getpid()); return 0; } Next, I create a new pid namespace, remount /proc, and execute 'restore' using 'exec' so that pid 1 is doing it: [EMAIL PROTECTED] ~]# /home/hallyn/cryo/utils/ns_exec -cp /bin/bash about to clone with 2002 [EMAIL PROTECTED] ~]# mount -t proc none /proc [EMAIL PROTECTED] ~]# ps -ef UIDPID PPID C STIME TTY TIME CMD root 1 0 1 18:46 pts/000:00:00 /bin/bash root26 1 0 18:46 pts/000:00:00 ps -ef [EMAIL PROTECTED] ~]# exec ./restore I am 99 Seems to work. -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 1/6] user namespaces: introduce user_struct-user_namespace relationship
On Fri, Jul 25, 2008 at 07:27:25PM -0500, Serge E. Hallyn wrote: --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -600,6 +600,7 @@ struct user_struct { /* Hash table maintenance information */ struct hlist_node uidhash_node; uid_t uid; + struct user_namespace *user_namespace; Call it user_ns please, and file user_ns.c. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
Re: [Devel] [PATCH 1/6] user namespaces: introduce user_struct-user_namespace relationship
Quoting Alexey Dobriyan ([EMAIL PROTECTED]): On Fri, Jul 25, 2008 at 07:27:25PM -0500, Serge E. Hallyn wrote: --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -600,6 +600,7 @@ struct user_struct { /* Hash table maintenance information */ struct hlist_node uidhash_node; uid_t uid; + struct user_namespace *user_namespace; Call it user_ns please, Odd, I had the same thought, but seem to have applied at the wrong patch (please see patch 2). Thanks. and file user_ns.c. I'm ok with that, but that file has existed for quite some time by now, so I'm not introducing it. If noone objects, I'll rename the file (as well as include/linux/user_namespace.h). thanks, -serge ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel