[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Andrew Morton
On Fri, 25 Jul 2008 04:14:55 -0400 Paul Menage [EMAIL PROTECTED] wrote:

 Hi Balbir,
 
 Andrew included the memrlimit controller in his latest set of patches
 to Linus for mainline.

I've asked Linus to drop all 238 patches.  I'll be resending them minus
the offending memrlimit patches.

Did I mention that conferences suck?

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Hugh Dickins
On Fri, 25 Jul 2008, Paul Menage wrote:
 
 So I think we'd be complicating some of the vm paths in mainline with
 a feature that isn't likely to get a lot of real use.
 
 What do you (and others on the containers list) think? Should we ask
 Andrew/Linus to hold off on this for now? My preference would be to do
 that until we have someone who can stand up with a concrete scenario
 where they want to use this in a real environment.

I see Andrew has already acted, so it's now moot.  But I'd like to
say that I do agree with you and the conclusion to hold off for now.

I was a bit alarmed earlier to see those patches sailing on through;
but realized that I'd done very little to substantiate my hatred of
the whole thing, and decided that I didn't feel strongly enough to
stand in the way now.  But I am glad you've stepped in, thank you.

(Different topic, but one day I ought to get around to saying again
how absurd I think a swap controller; whereas a mem+swap controller
makes plenty of sense.  I think Rik and others said the same.)

By the way, here's a BUG I got from CONFIG_CGROUP_MEMRLIMIT_CTLR=y
but no use of it, when doing swapoff a week ago.  Not investigated
at all, I'm afraid, but at a guess it might come from memrlimit work
placing too much faith in the mm_users count - swapoff is only one
of several places which have to inc/dec mm_users for some reason.

BUG: unable to handle kernel paging request at 6b6b6b8b
IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29
*pde =  
Oops:  [#1] PREEMPT SMP 
last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq 
snd_seq_device thermal ac battery button

Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7)
EIP: 0060:[7817078f] EFLAGS: 00010206 CPU: 0
EIP is at memrlimit_cgroup_uncharge_as+0x18/0x29
EAX: 6b6b6b6b EBX: 7963215c ECX: 7c032000 EDX: 0025e000
ESI: 96902518 EDI: 9fbb1aa0 EBP: 7c033e9c ESP: 7c033e9c
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process swapoff (pid: 22500, ti=7c032000 task=907e2b70 task.ti=7c032000)
Stack: 7c033edc 78161323 9fbb1aa0 025e ff77 7c033ecc 96902518  
    7c033ec8  0089 7963215c 9fbb1aa0 9fbb1b28 a272f040 
   7c033ef4 781226b1 9fbb1aa0 9fbb1aa0 790fa884 a272f0c8 7c033f80 78165ce3 
Call Trace:
 [78161323] ? exit_mmap+0xaf/0x133
 [781226b1] ? mmput+0x4c/0xba
 [78165ce3] ? try_to_unuse+0x20b/0x3f5
 [78371534] ? _spin_unlock+0x22/0x3c
 [7816636a] ? sys_swapoff+0x17b/0x37c
 [78102d95] ? sysenter_past_esp+0x6a/0xa5
 ===
Code: 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 ad a6 fd ff c9 c3 55 89 e5 8b 45 
08 8b 55 0c 8b 80 30 02 00 00 c1 e2 0c 8b 80 24 0c 00 00 8b 40 20 52 83 c0 0c 
50 e8 e6 a6 fd ff 58 5a c9 c3 55 89 e5 8b 
EIP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 SS:ESP 0068:7c033e9c

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Balbir Singh
Paul Menage wrote:
 Hi Balbir,
 
 Andrew included the memrlimit controller in his latest set of patches
 to Linus for mainline.
 
 Although the memrlimit controller basically works as intended, my
 impression from the mini-summit on Tuesday is that our consensus is
 that this still doesn't have concrete practical use-cases yet:
 
 - avoiding swap over-use is better handled by the forthcoming swap controller
 
 - applications that can usefully handle mmap() returning NULL don't
 really exist yet (and since the system as a whole allows address space
 overcommit limits, if it was practical/useful to write such apps then
 presumably they would already exist)
 

There are applications that can/need to handle overcommit, just that we are not
aware of them fully. Immediately after our meeting, I was pointed to
http://www.linuxfoundation.org/en/Carrier_Grade_Linux/Requirements_Alpha1#AVL.4.1_VM_Strict_Over-Commit

 So I think we'd be complicating some of the vm paths in mainline with
 a feature that isn't likely to get a lot of real use.
 

I did disagree in the meeting and there is also the use case of the feature
forming the infrastructure for other rlimit controllers.

 What do you (and others on the containers list) think? Should we ask
 Andrew/Linus to hold off on this for now? My preference would be to do
 that until we have someone who can stand up with a concrete scenario
 where they want to use this in a real environment.

While we can argue about use cases, the feature needs more testing and I am OK
holding off/reverting the merge to make it more stable and that would give us
more time to argue on its usefulness. To say that overcommit handling is not
useful is wrong. Meanwhile, I'll go back and look at the bug report that Hugh
has posted and also look at building an mlock controller on top of memrlimits.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Balbir Singh
Andrew Morton wrote:
 On Fri, 25 Jul 2008 04:14:55 -0400 Paul Menage [EMAIL PROTECTED] wrote:
 
 Hi Balbir,

 Andrew included the memrlimit controller in his latest set of patches
 to Linus for mainline.
 
 I've asked Linus to drop all 238 patches.  I'll be resending them minus
 the offending memrlimit patches.
 

Sorry for making your work more harder.

 Did I mention that conferences suck?

Not yet, but we know now :)

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Balbir Singh
Andrew Morton wrote:
 On Fri, 25 Jul 2008 04:14:55 -0400 Paul Menage [EMAIL PROTECTED] wrote:
 
 Hi Balbir,

 Andrew included the memrlimit controller in his latest set of patches
 to Linus for mainline.
 
 I've asked Linus to drop all 238 patches.  I'll be resending them minus
 the offending memrlimit patches.
 

Sorry for making your work more harder.

 Did I mention that conferences suck?

Not yet, but we know now :)

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Balbir Singh
Hugh Dickins wrote:
 On Fri, 25 Jul 2008, Paul Menage wrote:
 So I think we'd be complicating some of the vm paths in mainline with
 a feature that isn't likely to get a lot of real use.

 What do you (and others on the containers list) think? Should we ask
 Andrew/Linus to hold off on this for now? My preference would be to do
 that until we have someone who can stand up with a concrete scenario
 where they want to use this in a real environment.
 
 I see Andrew has already acted, so it's now moot.  But I'd like to
 say that I do agree with you and the conclusion to hold off for now.
 
 I was a bit alarmed earlier to see those patches sailing on through;
 but realized that I'd done very little to substantiate my hatred of
 the whole thing, and decided that I didn't feel strongly enough to
 stand in the way now.  But I am glad you've stepped in, thank you.
 
 (Different topic, but one day I ought to get around to saying again
 how absurd I think a swap controller; whereas a mem+swap controller
 makes plenty of sense.  I think Rik and others said the same.)
 

We will have a memory+swap controller working together.

 By the way, here's a BUG I got from CONFIG_CGROUP_MEMRLIMIT_CTLR=y
 but no use of it, when doing swapoff a week ago.  Not investigated
 at all, I'm afraid, but at a guess it might come from memrlimit work
 placing too much faith in the mm_users count - swapoff is only one
 of several places which have to inc/dec mm_users for some reason.
 

I'll try and reproduce the problem right away. I've been running some kernbench
on top of memrlimit (but not with a lot of stress or trying to swapoff the swap
device).

 BUG: unable to handle kernel paging request at 6b6b6b8b
 IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29
 *pde =  
 Oops:  [#1] PREEMPT SMP 
 last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
 Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq 
 snd_seq_device thermal ac battery button
 
 Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7)
 EIP: 0060:[7817078f] EFLAGS: 00010206 CPU: 0
 EIP is at memrlimit_cgroup_uncharge_as+0x18/0x29
 EAX: 6b6b6b6b EBX: 7963215c ECX: 7c032000 EDX: 0025e000
 ESI: 96902518 EDI: 9fbb1aa0 EBP: 7c033e9c ESP: 7c033e9c
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
 Process swapoff (pid: 22500, ti=7c032000 task=907e2b70 task.ti=7c032000)
 Stack: 7c033edc 78161323 9fbb1aa0 025e ff77 7c033ecc 96902518 
  
 7c033ec8  0089 7963215c 9fbb1aa0 9fbb1b28 
 a272f040 
7c033ef4 781226b1 9fbb1aa0 9fbb1aa0 790fa884 a272f0c8 7c033f80 
 78165ce3 
 Call Trace:
  [78161323] ? exit_mmap+0xaf/0x133
  [781226b1] ? mmput+0x4c/0xba
  [78165ce3] ? try_to_unuse+0x20b/0x3f5
  [78371534] ? _spin_unlock+0x22/0x3c
  [7816636a] ? sys_swapoff+0x17b/0x37c
  [78102d95] ? sysenter_past_esp+0x6a/0xa5
  ===
 Code: 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 ad a6 fd ff c9 c3 55 89 e5 8b 45 
 08 8b 55 0c 8b 80 30 02 00 00 c1 e2 0c 8b 80 24 0c 00 00 8b 40 20 52 83 c0 
 0c 50 e8 e6 a6 fd ff 58 5a c9 c3 55 89 e5 8b 
 EIP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 SS:ESP 0068:7c033e9c
 
 Hugh

I'll try and recreate the problem and fix it. If memrlimit_cgroup_uncharge_as()
created the problem, it's most likely related to mm-owner not being correct and
we are dereferencing the wrong memory.

Thanks for the bug report, I'll look further.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Paul Menage
On Fri, Jul 25, 2008 at 5:06 AM, Hugh Dickins [EMAIL PROTECTED] wrote:

 (Different topic, but one day I ought to get around to saying again
 how absurd I think a swap controller; whereas a mem+swap controller
 makes plenty of sense.  I think Rik and others said the same.)

Agreed that a swap controller without a memory controller doesn't make
much sense, but a memory controller without a swap controller can make
sense on machines that don't intend to use swap.

So if they were separate controllers, we'd use the proposed cgroup
dependency features to make the swap controller depend on the memory
controller - in which case you'd only be able to mount the swap
controller on a hierarchy that also had the memory controller, and the
swap controller would be able to make use of the page ownership
information.

It's more of a modularity issue than a functionality issue, I think -
the swap controller and memory controller are tracking fundamentally
different things (space on disk versus pages in memory), and the only
dependency between them is the memory controller tracking the
ownership of a page and providing it to the swap controller.


 By the way, here's a BUG I got from CONFIG_CGROUP_MEMRLIMIT_CTLR=y
 but no use of it, when doing swapoff a week ago.  Not investigated
 at all, I'm afraid, but at a guess it might come from memrlimit work
 placing too much faith in the mm_users count - swapoff is only one
 of several places which have to inc/dec mm_users for some reason.

 BUG: unable to handle kernel paging request at 6b6b6b8b

Possibly the mm-owner tracking breaks in that case, if the last user
exits while swapoff is occurring without relinquishing ownership?

That looks as though mm-owner points to a task that had been poisoned
after being freed. That could be awkward to fix :-(

Paul
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Paul Menage
On Fri, Jul 25, 2008 at 8:30 AM, Balbir Singh [EMAIL PROTECTED] wrote:

 There are applications that can/need to handle overcommit, just that we are 
 not
 aware of them fully. Immediately after our meeting, I was pointed to
 http://www.linuxfoundation.org/en/Carrier_Grade_Linux/Requirements_Alpha1#AVL.4.1_VM_Strict_Over-Commit

Thanks, that'll be interesting to take a look at.


 So I think we'd be complicating some of the vm paths in mainline with
 a feature that isn't likely to get a lot of real use.


 I did disagree in the meeting

Yes, but (my impression of) the overall feeling in the meeting was
that it wasn't yet the right time to push it to mainline.

 and there is also the use case of the feature
 forming the infrastructure for other rlimit controllers.

Agreed, but that's something for the future.

Paul
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Balbir Singh
Paul Menage wrote:
 On Fri, Jul 25, 2008 at 8:30 AM, Balbir Singh [EMAIL PROTECTED] wrote:
 There are applications that can/need to handle overcommit, just that we are 
 not
 aware of them fully. Immediately after our meeting, I was pointed to
 http://www.linuxfoundation.org/en/Carrier_Grade_Linux/Requirements_Alpha1#AVL.4.1_VM_Strict_Over-Commit
 
 Thanks, that'll be interesting to take a look at.
 
 So I think we'd be complicating some of the vm paths in mainline with
 a feature that isn't likely to get a lot of real use.

 I did disagree in the meeting
 
 Yes, but (my impression of) the overall feeling in the meeting was
 that it wasn't yet the right time to push it to mainline.
 

Yes! I need to test it more and I'll focus more on that front.

 and there is also the use case of the feature
 forming the infrastructure for other rlimit controllers.
 
 Agreed, but that's something for the future.

I'll work on the mlock controller and post that as well.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Hugh Dickins
On Fri, 25 Jul 2008, Paul Menage wrote:
 On Fri, Jul 25, 2008 at 5:06 AM, Hugh Dickins [EMAIL PROTECTED] wrote:
 
  (Different topic, but one day I ought to get around to saying again
  how absurd I think a swap controller; whereas a mem+swap controller
  makes plenty of sense.  I think Rik and others said the same.)
 
 Agreed that a swap controller without a memory controller doesn't make
 much sense, but a memory controller without a swap controller can make
 sense on machines that don't intend to use swap.

I agree that a memory controller without a swap controller can
make sense: I hope so, anyway, since that's what's in mainline.
Even if swap is used, memory is a more precious resource than swap,
and you were right to go about controlling memory first.

 
 So if they were separate controllers, we'd use the proposed cgroup
 dependency features to make the swap controller depend on the memory
 controller - in which case you'd only be able to mount the swap
 controller on a hierarchy that also had the memory controller, and the
 swap controller would be able to make use of the page ownership
 information.
 
 It's more of a modularity issue than a functionality issue, I think -
 the swap controller and memory controller are tracking fundamentally
 different things (space on disk versus pages in memory), and the only
 dependency between them is the memory controller tracking the
 ownership of a page and providing it to the swap controller.

It sounds as if you're interpreting my mem+swap controller as a
mem controller and a swap controller and the swap controller makes
use of some of the mem controller infrastructure.

No, I'm trying to say something stronger than that.  I'm saying,
as I've said before, that I cannot imagine why anyone would want
to control swap itself - what they want to control is the total
of mem+swap.  Swap is a second-class citizen, nobody wants swap
if they can have mem, so why control it separately?

IIRC Rik expressed the same by pointing out that a cgroup at its
swap limit would then be forced to grow in mem (until it hits its
mem limit): so controlling the less precious resource would increase
pressure on the more precious resource.  (Actually, that probably
bears little relation to what he said - sorry, Rik!)  I don't recall
what answer he got, perhaps I'd be persuaded if I heard it again.

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Hugh Dickins
On Fri, 25 Jul 2008, Balbir Singh wrote:
 
 I'll try and recreate the problem and fix it. If 
 memrlimit_cgroup_uncharge_as()
 created the problem, it's most likely related to mm-owner not being correct 
 and
 we are dereferencing the wrong memory.
 
 Thanks for the bug report, I'll look further.

Good luck!  I have only seen it once, on a dual-core laptop; though
I don't remember to try swapoff while busy as often as I should (be
sure to alternate between a couple or more of swapareas, so you can
swap a new one on just before swapping an old one off, to be pretty
sure of success).

May be easier to find in the source: my suspicion is that a bad
mm_users assumption will come into it.  But I realize now that it
could be entirely unrelated to memrlimit, just that uncharge_as
was the one to get hit by bad refcounting elsewhere.

Oh, that reminds me, I never reported back on my res_counter warnings
at shutdown: never saw them again, once I added in the set of changes
you came up with shortly after that - thanks.

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Balbir Singh
Hugh Dickins wrote:
 On Fri, 25 Jul 2008, Balbir Singh wrote:
 I'll try and recreate the problem and fix it. If 
 memrlimit_cgroup_uncharge_as()
 created the problem, it's most likely related to mm-owner not being correct 
 and
 we are dereferencing the wrong memory.

 Thanks for the bug report, I'll look further.
 
 Good luck!  I have only seen it once, on a dual-core laptop; though
 I don't remember to try swapoff while busy as often as I should (be
 sure to alternate between a couple or more of swapareas, so you can
 swap a new one on just before swapping an old one off, to be pretty
 sure of success).
 

Thanks, that's very useful information. I would have never tried juggling swap
devices otherwise.

 May be easier to find in the source: my suspicion is that a bad
 mm_users assumption will come into it.  But I realize now that it
 could be entirely unrelated to memrlimit, just that uncharge_as
 was the one to get hit by bad refcounting elsewhere.
 
 Oh, that reminds me, I never reported back on my res_counter warnings
 at shutdown: never saw them again, once I added in the set of changes
 you came up with shortly after that - thanks.
 

I am glad those messages are gone, thanks for the bug report. I find bug fixing
more exciting that kernel development on most occasions.


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes (namespace naming)

2008-07-25 Thread Serge E. Hallyn
Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 
 Currently we have three possibilities on how to name pid namespaces.
 - indirect via processes
 - pids
 - names in the filesystem
 
 We discussed this a bit in the hallway track and pids are look like the way
 to go. Pavel has a patch in progress to help sort this out.
 
 The practical problem we have today is that we need a way to wait for the 
 network
 namespace in particular and namespaces in general to exit.
 
 At a first glance waitid(P_NS, pid,) looks like a useful way to achieve
 this.  After looking at wait a bit more it really is fundamentally just an 
 exit
 status reaper of zombies, that has the option of blocking when the zombies
 do not yet exist.  In any kind of event loop you would wait for SIGCHLD either
 as a signal or with signalfd.
 
 So how shall we wait for a namespace to exit?  My brainstorm tonight suggests
 inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE);
 
 Eric

I'm sorry, I'm still not quite clear on...

Why?

You care about when the tasks exit, and you care about when network
devices, for instance, need to be deleted (which you can presumably
get uevents for, when they get moved back into init_net_ns).

Why do you care when the struct net actually gets deleted?

-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Balbir Singh
Hugh Dickins wrote:
 On Fri, 25 Jul 2008, Paul Menage wrote:
 On Fri, Jul 25, 2008 at 5:06 AM, Hugh Dickins [EMAIL PROTECTED] wrote:
 (Different topic, but one day I ought to get around to saying again
 how absurd I think a swap controller; whereas a mem+swap controller
 makes plenty of sense.  I think Rik and others said the same.)
 Agreed that a swap controller without a memory controller doesn't make
 much sense, but a memory controller without a swap controller can make
 sense on machines that don't intend to use swap.
 
 I agree that a memory controller without a swap controller can
 make sense: I hope so, anyway, since that's what's in mainline.
 Even if swap is used, memory is a more precious resource than swap,
 and you were right to go about controlling memory first.
 

Yes, I agree.

 So if they were separate controllers, we'd use the proposed cgroup
 dependency features to make the swap controller depend on the memory
 controller - in which case you'd only be able to mount the swap
 controller on a hierarchy that also had the memory controller, and the
 swap controller would be able to make use of the page ownership
 information.

 It's more of a modularity issue than a functionality issue, I think -
 the swap controller and memory controller are tracking fundamentally
 different things (space on disk versus pages in memory), and the only
 dependency between them is the memory controller tracking the
 ownership of a page and providing it to the swap controller.
 
 It sounds as if you're interpreting my mem+swap controller as a
 mem controller and a swap controller and the swap controller makes
 use of some of the mem controller infrastructure.
 
 No, I'm trying to say something stronger than that.  I'm saying,
 as I've said before, that I cannot imagine why anyone would want
 to control swap itself - what they want to control is the total
 of mem+swap.  Swap is a second-class citizen, nobody wants swap
 if they can have mem, so why control it separately?
 
 IIRC Rik expressed the same by pointing out that a cgroup at its
 swap limit would then be forced to grow in mem (until it hits its
 mem limit): so controlling the less precious resource would increase
 pressure on the more precious resource.  (Actually, that probably
 bears little relation to what he said - sorry, Rik!)  I don't recall
 what answer he got, perhaps I'd be persuaded if I heard it again.
 

I see what your saying. When you look at Linux right now, we control swap
independent of memory, so I am not totally opposed to setting swap, instead of
swap+mem. I might not want to swap from a particular cgroup, in which case, I
set swap to 0 and risk OOMing, which might be an acceptable trade-off depending
on my setup. I could easily change this policy on demand and add swap if OOMing
was no longer OK.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes (namespace naming)

2008-07-25 Thread Daniel Lezcano
Serge E. Hallyn wrote:
 Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Currently we have three possibilities on how to name pid namespaces.
 - indirect via processes
 - pids
 - names in the filesystem

 We discussed this a bit in the hallway track and pids are look like the way
 to go. Pavel has a patch in progress to help sort this out.

 The practical problem we have today is that we need a way to wait for the 
 network
 namespace in particular and namespaces in general to exit.

 At a first glance waitid(P_NS, pid,) looks like a useful way to achieve
 this.  After looking at wait a bit more it really is fundamentally just an 
 exit
 status reaper of zombies, that has the option of blocking when the zombies
 do not yet exist.  In any kind of event loop you would wait for SIGCHLD 
 either
 as a signal or with signalfd.

 So how shall we wait for a namespace to exit?  My brainstorm tonight suggests
 inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE);

 Eric
 
 I'm sorry, I'm still not quite clear on...
 
 Why?
 
 You care about when the tasks exit, and you care about when network
 devices, for instance, need to be deleted (which you can presumably
 get uevents for, when they get moved back into init_net_ns).
 
 Why do you care when the struct net actually gets deleted?

IMO, if we consider a container being an aggregation of different 
namespaces, we should consider the container dies when all the 
namespaces are dead.

One good example is an application ran inside a container and doing a 
bulk data writing over the network. When the application finish its last 
  call to send it will exits. At this point, there is no more 
processes running inside the container but we can not consider the 
container is dead because there are still some pending datas in the 
socket to be delivered to the peer.

Eric will post a patch to automatically destroy the virtual devices when 
the netns is destroyed, so there is no way to know if a network 
namespace is dead or not as the uevent socket will not deliver an event 
outside of the container.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes (namespace naming)

2008-07-25 Thread Serge E. Hallyn
Quoting Daniel Lezcano ([EMAIL PROTECTED]):
 Serge E. Hallyn wrote:
 Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Currently we have three possibilities on how to name pid namespaces.
 - indirect via processes
 - pids
 - names in the filesystem

 We discussed this a bit in the hallway track and pids are look like the way
 to go. Pavel has a patch in progress to help sort this out.

 The practical problem we have today is that we need a way to wait for the 
 network
 namespace in particular and namespaces in general to exit.

 At a first glance waitid(P_NS, pid,) looks like a useful way to 
 achieve
 this.  After looking at wait a bit more it really is fundamentally just an 
 exit
 status reaper of zombies, that has the option of blocking when the zombies
 do not yet exist.  In any kind of event loop you would wait for SIGCHLD 
 either
 as a signal or with signalfd.

 So how shall we wait for a namespace to exit?  My brainstorm tonight 
 suggests
 inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE);

 Eric

 I'm sorry, I'm still not quite clear on...

 Why?

 You care about when the tasks exit, and you care about when network
 devices, for instance, need to be deleted (which you can presumably
 get uevents for, when they get moved back into init_net_ns).

 Why do you care when the struct net actually gets deleted?

 IMO, if we consider a container being an aggregation of different  
 namespaces, we should consider the container dies when all the  
 namespaces are dead.

 One good example is an application ran inside a container and doing a  
 bulk data writing over the network. When the application finish its last  
  call to send it will exits. At this point, there is no more processes 
 running inside the container but we can not consider the container is 
 dead because there are still some pending datas in the socket to be 
 delivered to the peer.

 Eric will post a patch to automatically destroy the virtual devices when  
 the netns is destroyed, so there is no way to know if a network  
 namespace is dead or not as the uevent socket will not deliver an event  
 outside of the container.

My question remains: who cares?

-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes (namespace naming)

2008-07-25 Thread Oren Laadan


Serge E. Hallyn wrote:
 Quoting Daniel Lezcano ([EMAIL PROTECTED]):
 Serge E. Hallyn wrote:
 Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Currently we have three possibilities on how to name pid namespaces.
 - indirect via processes
 - pids
 - names in the filesystem

 We discussed this a bit in the hallway track and pids are look like the way
 to go. Pavel has a patch in progress to help sort this out.

 The practical problem we have today is that we need a way to wait for the 
 network
 namespace in particular and namespaces in general to exit.

 At a first glance waitid(P_NS, pid,) looks like a useful way to 
 achieve
 this.  After looking at wait a bit more it really is fundamentally just an 
 exit
 status reaper of zombies, that has the option of blocking when the zombies
 do not yet exist.  In any kind of event loop you would wait for SIGCHLD 
 either
 as a signal or with signalfd.

 So how shall we wait for a namespace to exit?  My brainstorm tonight 
 suggests
 inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE);

 Eric
 I'm sorry, I'm still not quite clear on...

 Why?

 You care about when the tasks exit, and you care about when network
 devices, for instance, need to be deleted (which you can presumably
 get uevents for, when they get moved back into init_net_ns).

 Why do you care when the struct net actually gets deleted?
 IMO, if we consider a container being an aggregation of different  
 namespaces, we should consider the container dies when all the  
 namespaces are dead.

 One good example is an application ran inside a container and doing a  
 bulk data writing over the network. When the application finish its last  
  call to send it will exits. At this point, there is no more processes 
 running inside the container but we can not consider the container is 
 dead because there are still some pending datas in the socket to be 
 delivered to the peer.

 Eric will post a patch to automatically destroy the virtual devices when  
 the netns is destroyed, so there is no way to know if a network  
 namespace is dead or not as the uevent socket will not deliver an event  
 outside of the container.
 
 My question remains: who cares?
 

In the context of CR, you'd care if you migrate a container including its
network stack. In that case, you wanna make sure that:
(1) you save sockets that have data in their (send) queue but otherwise
not attached to any specific process, and
(2) you disable these sockets at the source machine as soon as you enable
the container on the target machine.

Rethinking this, Serge is probably right because one you migrate the network
to the target node, you disable the network (of that container) on the source
node, so you don't care about #2 there anymore...

Oren.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: C/R minisummit notes (namespace naming)

2008-07-25 Thread Daniel Lezcano
Serge E. Hallyn wrote:
 Quoting Daniel Lezcano ([EMAIL PROTECTED]):
 Serge E. Hallyn wrote:
 Quoting Eric W. Biederman ([EMAIL PROTECTED]):
 Currently we have three possibilities on how to name pid namespaces.
 - indirect via processes
 - pids
 - names in the filesystem

 We discussed this a bit in the hallway track and pids are look like the way
 to go. Pavel has a patch in progress to help sort this out.

 The practical problem we have today is that we need a way to wait for the 
 network
 namespace in particular and namespaces in general to exit.

 At a first glance waitid(P_NS, pid,) looks like a useful way to 
 achieve
 this.  After looking at wait a bit more it really is fundamentally just an 
 exit
 status reaper of zombies, that has the option of blocking when the zombies
 do not yet exist.  In any kind of event loop you would wait for SIGCHLD 
 either
 as a signal or with signalfd.

 So how shall we wait for a namespace to exit?  My brainstorm tonight 
 suggests
 inotify_add_watch(ifd, /proc/ns/pid, IN_DELETE);

 Eric
 I'm sorry, I'm still not quite clear on...

 Why?

 You care about when the tasks exit, and you care about when network
 devices, for instance, need to be deleted (which you can presumably
 get uevents for, when they get moved back into init_net_ns).

 Why do you care when the struct net actually gets deleted?
 IMO, if we consider a container being an aggregation of different  
 namespaces, we should consider the container dies when all the  
 namespaces are dead.

 One good example is an application ran inside a container and doing a  
 bulk data writing over the network. When the application finish its last  
  call to send it will exits. At this point, there is no more processes 
 running inside the container but we can not consider the container is 
 dead because there are still some pending datas in the socket to be 
 delivered to the peer.

 Eric will post a patch to automatically destroy the virtual devices when  
 the netns is destroyed, so there is no way to know if a network  
 namespace is dead or not as the uevent socket will not deliver an event  
 outside of the container.
 
 My question remains: who cares?

The container implementation in userspace. Let's imagine it sets some 
routes outside of the container to route the traffic to the container. 
It should remove these routes when the container dies. And the container 
should be considered as dead when the network has died and not when the 
last process of the container exits.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/2] introduce sys_restore

2008-07-25 Thread Serge E. Hallyn
Create a useless (?) sys_restore system call.  All it does
is read a checkpoint file :) for a pid number and a file
to execute.

Since we don't take things like argv and envp and registers
from the checkpoint file, in order to make this easily
testable, we take those things as arguments.
Signed-off-by: Serge Hallyn [EMAIL PROTECTED] 
Signed-off-by: Dave Hansen [EMAIL PROTECTED]
---
 arch/x86/kernel/process_32.c   |   10 
 arch/x86/kernel/syscall_table_32.S |2 +
 kernel/fork.c  |   43 
 3 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 0c3927a..e11627d 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -688,6 +688,16 @@ out:
return error;
 }
 
+long do_restore(unsigned int fd,
+   char __user * __user *argv,
+char __user * __user *envp,
+   struct pt_regs *regs);
+
+asmlinkage long sys_restore(struct pt_regs regs)
+{
+   return do_restore(regs.bx, regs.cx, regs.dx, regs);
+}
+
 #define top_esp(THREAD_SIZE - sizeof(unsigned long))
 #define top_ebp(THREAD_SIZE - 2*sizeof(unsigned long))
 
diff --git a/arch/x86/kernel/syscall_table_32.S 
b/arch/x86/kernel/syscall_table_32.S
index adff556..019a8e4 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -326,3 +326,5 @@ ENTRY(sys_call_table)
.long sys_fallocate
.long sys_timerfd_settime   /* 325 */
.long sys_timerfd_gettime
+   .long sys_restore
+   
diff --git a/kernel/fork.c b/kernel/fork.c
index adefc11..0e43f69 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1679,3 +1679,46 @@ int unshare_files(struct files_struct **displaced)
task_unlock(task);
return 0;
 }
+
+
+#define CKPT_SIZE (PAGE_SIZE*4)
+char buf[CKPT_SIZE];
+char exe_filename[PAGE_SIZE];
+
+/*
+ * userspace will already have made us a new pidns
+ */
+long do_restore(unsigned int fd,
+   char __user * __user *argv,
+char __user * __user *envp,
+   struct pt_regs *regs)
+{
+   int nr_scanned;
+struct file *file;
+long ret = -EBADF;
+int fput_needed;
+   int pid;
+
+   if (!is_container_init(current)) {
+   printk(I am not init\n);
+   return -EPERM;
+   }
+
+file = fget_light(fd, fput_needed);
+   if (!file)
+   goto out;
+   ret = kernel_read(file, 0, buf, CKPT_SIZE);
+   fput_light(file, fput_needed);
+   if (ret = 0)
+   goto out;
+   
+   nr_scanned = sscanf(buf, %d %s, pid, exe_filename);
+
+   if (nr_scanned != 2)
+   return -EINVAL;
+
+
+   ret = do_execve(exe_filename, argv, envp, regs);
+out:
+return ret;
+}
-- 
1.5.4.3

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/2] sys_restore: set the pid number

2008-07-25 Thread Serge E. Hallyn
Set the pid number for a restored task.  This is purely a toy, as it
only sets the pidnr in the lowest level pid namespace.

Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
---
 kernel/fork.c |5 +
 kernel/pid.c  |   19 +++
 2 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 0e43f69..41c46d2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1685,6 +1685,8 @@ int unshare_files(struct files_struct **displaced)
 char buf[CKPT_SIZE];
 char exe_filename[PAGE_SIZE];
 
+extern int choose_pidmap(struct pid *pid, int new);
+
 /*
  * userspace will already have made us a new pidns
  */
@@ -1717,6 +1719,9 @@ long do_restore(unsigned int fd,
if (nr_scanned != 2)
return -EINVAL;
 
+   ret = choose_pidmap(task_pid(current), pid);
+   if (!ret)
+   return -EAGAIN;
 
ret = do_execve(exe_filename, argv, envp, regs);
 out:
diff --git a/kernel/pid.c b/kernel/pid.c
index 30bd5d4..88a5e2a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,6 +122,25 @@ static void free_pidmap(struct upid *upid)
atomic_inc(map-nr_free);
 }
 
+int choose_pidmap(struct pid *pid, int new)
+{
+   struct pidmap *map;
+   int level = pid-level;
+   struct upid *upid = pid-numbers[level];
+   struct pid_namespace *pid_ns = upid-ns;
+   int old = upid-nr;
+
+   map = pid_ns-pidmap[new/BITS_PER_PAGE];
+   if (!test_and_set_bit(new, map-page)) {
+   map = pid_ns-pidmap[old/BITS_PER_PAGE];
+   clear_bit(old, map-page);
+   upid-nr = new;
+   return 1;
+   }
+
+   return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
int i, offset, max_scan, pid, last = pid_ns-last_pid;
-- 
1.5.4.3

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 0/6] user namespaces: introduction

2008-07-25 Thread Serge E. Hallyn
Following is a set of user namespace patches I've been playing with
this week.

The first two patches are I believe fixes which should go in regardless
of which direction user namespaces take.

The rest of the patches are one approach to providing default cross-userns
isolation for files.  Any filesystem can provide its own intelligent
cross-userns userid equivalence checks by defining its own permission
function, which is what Eric and I have been talking about doing.

The next step is probably to handle some of the task-to-task
cross-userns checks.

Comments appreciated.

thanks,
-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/6] user namespaces: move user_ns from nsproxy into user struct

2008-07-25 Thread Serge E. Hallyn
From ec5f54faf5afd16cb6cef40ebaaf3da25989d185 Mon Sep 17 00:00:00 2001
From: Serge Hallyn [EMAIL PROTECTED]
Date: Thu, 24 Jul 2008 17:52:41 -0500
Subject: [PATCH 2/6] user namespaces: move user_ns from nsproxy into user struct

When we get the sysfs support needed to support fair user scheduling
along with user namespaces, then we will need to be able to get the
user namespace from the user struct.

So we need the user_ns to be a part of struct user.  Once we can
access it from tsk-user, we no longer have a use for
tsk-nsproxy-user_ns.

When a user_namespace is created, the user which created it is
marked as its 'creator'.  The user_namespace pins the creator.
Each userid in a user_ns pins the user_ns.  This keeps refcounting
nice and simple.

At the same time, this patch simplifies the refcounting.  The current
user and userns locking works as follows:

The task pins the user struct.

The task pins the nsproxy, the nsproxy pins the user_ns.

When a new user_ns is created, it creates a root user for
it, and pins it.  When the nsproxy releases the user_ns,
the userns tries to release all its user structs.

So you see that the refcounting works for now, but only
because the nsproxy (and therefore usr_ns) and user structs
will be freed at the same time - when the last task using
them is released.

Now we need to put the user_ns in the struct user.  You can
see that will mess up the refcounting.

Fortunately, once the user_ns is available from tsk-user,
we don't need it in nsproxy.

So here is how the refcounting *should* be done:

The task pins the user struct.

The user struct pins its user namespace.

The user namespace pins the struct user which created it.

A user namespace now doesn't need to release its userids,
because it is only released when its last user disappears.

This patch makes those changes.

Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
---
 include/linux/init_task.h  |1 -
 include/linux/key.h|3 ++
 include/linux/nsproxy.h|1 -
 include/linux/sched.h  |3 +-
 include/linux/user_namespace.h |   10 +++
 kernel/fork.c  |3 +-
 kernel/nsproxy.c   |   10 +--
 kernel/sys.c   |4 +-
 kernel/user.c  |   52 +++
 kernel/user_namespace.c|   36 +++
 security/keys/process_keys.c   |7 -
 11 files changed, 45 insertions(+), 85 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 93c45ac..ce87b8a 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -57,7 +57,6 @@ extern struct nsproxy init_nsproxy;
.mnt_ns = NULL, \
INIT_NET_NS(net_ns) \
INIT_IPC_NS(ipc_ns) \
-   .user_ns= init_user_ns,\
 }
 
 #define INIT_SIGHAND(sighand) {
\
diff --git a/include/linux/key.h b/include/linux/key.h
index c45c962..ba53aef 100644
--- a/include/linux/key.h
+++ b/include/linux/key.h
@@ -277,6 +277,8 @@ extern ctl_table key_sysctls[];
  * the userspace interface
  */
 extern void switch_uid_keyring(struct user_struct *new_user);
+extern void switch_uid_keyring_task(struct task_struct *tsk,
+   struct user_struct *new_user);
 extern int copy_keys(unsigned long clone_flags, struct task_struct *tsk);
 extern int copy_thread_group_keys(struct task_struct *tsk);
 extern void exit_keys(struct task_struct *tsk);
@@ -305,6 +307,7 @@ extern void key_init(void);
 #define key_ref_to_ptr(k)  ({ NULL; })
 #define is_key_possessed(k)0
 #define switch_uid_keyring(u)  do { } while(0)
+#define switch_uid_keyring_task(t,u)   do { } while(0)
 #define __install_session_keyring(t, k)({ NULL; })
 #define copy_keys(f,t) 0
 #define copy_thread_group_keys(t)  0
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index 0e66b57..7c7fb20 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -27,7 +27,6 @@ struct nsproxy {
struct ipc_namespace *ipc_ns;
struct mnt_namespace *mnt_ns;
struct pid_namespace *pid_ns;
-   struct user_namespace *user_ns;
struct net   *net_ns;
 };
 extern struct nsproxy init_nsproxy;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index cf36e14..2c2f036 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -600,7 +600,7 @@ struct user_struct {
/* Hash table maintenance information */
struct hlist_node uidhash_node;
uid_t uid;
-   struct user_namespace *user_namespace;
+   struct 

[Devel] [PATCH 1/6] user namespaces: introduce user_struct-user_namespace relationship

2008-07-25 Thread Serge E. Hallyn
From 9382f22a6c751e90baa4e7f3ba24c509e50a47a8 Mon Sep 17 00:00:00 2001
From: Serge Hallyn [EMAIL PROTECTED]
Date: Tue, 22 Jul 2008 13:31:37 -0500
Subject: [PATCH 1/6] user namespaces: introduce user_struct-user_namespace 
relationship

When a task does clone(CLONE_NEWNS), the task's user is the 'creator' of the
new user_namespace, and the user_namespace is tacked onto a list of those
created by this user.

Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
---
 include/linux/sched.h  |1 +
 include/linux/user_namespace.h |1 +
 kernel/user.c  |7 +++
 kernel/user_namespace.c|   20 +++-
 4 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dc7e592..cf36e14 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -600,6 +600,7 @@ struct user_struct {
/* Hash table maintenance information */
struct hlist_node uidhash_node;
uid_t uid;
+   struct user_namespace *user_namespace;
 
 #ifdef CONFIG_USER_SCHED
struct task_group *tg;
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index b5f41d4..f9477c3 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -13,6 +13,7 @@ struct user_namespace {
struct kref kref;
struct hlist_head   uidhash_table[UIDHASH_SZ];
struct user_struct  *root_user;
+   struct user_struct  *creator;
 };
 
 extern struct user_namespace init_user_ns;
diff --git a/kernel/user.c b/kernel/user.c
index 865ecf5..cfe8309 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -22,6 +22,7 @@ struct user_namespace init_user_ns = {
.refcount   = ATOMIC_INIT(2),
},
.root_user = root_user,
+   .creator = root_user,
 };
 EXPORT_SYMBOL_GPL(init_user_ns);
 
@@ -53,6 +54,7 @@ struct user_struct root_user = {
.files  = ATOMIC_INIT(0),
.sigpending = ATOMIC_INIT(0),
.locked_shm = 0,
+   .user_namespace = init_user_ns,
 #ifdef CONFIG_USER_SCHED
.tg = init_task_group,
 #endif
@@ -321,6 +323,7 @@ done:
  */
 static inline void free_user(struct user_struct *up, unsigned long flags)
 {
+   put_user_ns(up-user_namespace);
/* restore back the count */
atomic_inc(up-__count);
spin_unlock_irqrestore(uidhash_lock, flags);
@@ -347,6 +350,7 @@ static inline void free_user(struct user_struct *up, 
unsigned long flags)
sched_destroy_user(up);
key_put(up-uid_keyring);
key_put(up-session_keyring);
+   put_user_ns(up-user_namespace);
kmem_cache_free(uid_cachep, up);
 }
 
@@ -409,6 +413,8 @@ struct user_struct *alloc_uid(struct user_namespace *ns, 
uid_t uid)
if (sched_create_user(new)  0)
goto out_free_user;
 
+   new-user_namespace = get_user_ns(ns);
+
if (uids_user_create(new))
goto out_destoy_sched;
 
@@ -441,6 +447,7 @@ struct user_struct *alloc_uid(struct user_namespace *ns, 
uid_t uid)
 
 out_destoy_sched:
sched_destroy_user(new);
+   put_user_ns(new-user_namespace);
 out_free_user:
kmem_cache_free(uid_cachep, new);
 out_unlock:
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index a9ab059..e8db443 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -19,7 +19,6 @@
 static struct user_namespace *clone_user_ns(struct user_namespace *old_ns)
 {
struct user_namespace *ns;
-   struct user_struct *new_user;
int n;
 
ns = kmalloc(sizeof(struct user_namespace), GFP_KERNEL);
@@ -38,15 +37,17 @@ static struct user_namespace *clone_user_ns(struct 
user_namespace *old_ns)
return ERR_PTR(-ENOMEM);
}
 
-   /* Reset current-user with a new one */
-   new_user = alloc_uid(ns, current-uid);
-   if (!new_user) {
-   free_uid(ns-root_user);
-   kfree(ns);
-   return ERR_PTR(-ENOMEM);
-   }
+   /* pin the creating user */
+   ns-creator = current-user;
+   atomic_inc(ns-creator-__count);
+
+   /*
+* The alloc_uid() incremented the userns refcount,
+* so drop it again 
+*/
+   put_user_ns(ns);
 
-   switch_uid(new_user);
+   switch_uid(ns-root_user);
return ns;
 }
 
@@ -72,6 +73,7 @@ void free_user_ns(struct kref *kref)
 
ns = container_of(kref, struct user_namespace, kref);
release_uids(ns);
+   free_uid(ns-creator);
kfree(ns);
 }
 EXPORT_SYMBOL(free_user_ns);
-- 
1.5.4.3

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/6] user namespaces: rig generic_permission for simple userns check

2008-07-25 Thread Serge E. Hallyn
From f6d09b06a1106936010bffd420267f5b7ee66238 Mon Sep 17 00:00:00 2001
From: Serge Hallyn [EMAIL PROTECTED]
Date: Wed, 23 Jul 2008 17:01:09 -0500
Subject: [PATCH 3/6] user namespaces: rig generic_permission for simple userns 
check

Filesystems can provide their own permission() functions to do
advanced inter-user_namespace userid equivalence checks.

For those filesystems which do not support that, we will do
a simple check that current's user namespace is equivalent to
the user_namespace which mounted the filesystem.  If it is
not equivalent, then the task can only have user nobody (that
is, the 'other') permissions to a file.

For now, we actually just compare the user's user_ns to the
init_user_ns.  Next we will set the sb-user_ns to that of
the task mounting a filesystem, and use inode-i_sb-user_ns
instead of init_user_ns.  By punting even on that, the
implications, and therefore (in)correctness of this patch should
be all the easier to verify.

Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
---
 fs/namei.c |   14 +-
 1 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 01e67dd..d5336fd 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -31,6 +31,7 @@
 #include linux/file.h
 #include linux/fcntl.h
 #include linux/device_cgroup.h
+#include linux/nsproxy.h
 #include asm/namei.h
 #include asm/uaccess.h
 
@@ -168,7 +169,7 @@ void putname(const char *name)
 EXPORT_SYMBOL(putname);
 #endif
 
-
+extern struct user_namespace init_user_ns;
 /**
  * generic_permission  -  check for access rights on a Posix-like filesystem
  * @inode: inode to check access rights for
@@ -184,7 +185,15 @@ int generic_permission(struct inode *inode, int mask,
int (*check_acl)(struct inode *inode, int mask))
 {
umode_t mode = inode-i_mode;
+   int same_userns = (current-user-user_ns == init_user_ns);
 
+   /*
+* If we're not in the inode's user namespace, we get
+* user nobody permissions, and we ignore acls
+* (bc serge doesn't know how to handle acls in this case)
+*/
+   if (!same_userns)
+   goto check;
if (current-fsuid == inode-i_uid)
mode = 6;
else {
@@ -200,11 +209,14 @@ int generic_permission(struct inode *inode, int mask,
mode = 3;
}
 
+check:
/*
 * If the DACs are ok we don't need any capability check.
 */
if (((mode  mask  (MAY_READ|MAY_WRITE|MAY_EXEC)) == mask))
return 0;
+   if (!same_userns)
+   return -EACCES;
 
  check_capabilities:
/*
-- 
1.5.4.3

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 5/6] user namespaces: refuse create in other user_ns

2008-07-25 Thread Serge E. Hallyn
From 4d2c23452a67e25856893ab16fefd0f6e5aa58df Mon Sep 17 00:00:00 2001
From: Serge Hallyn [EMAIL PROTECTED]
Date: Thu, 24 Jul 2008 06:37:43 -0500
Subject: [PATCH 5/6] user namespaces: refuse create in other user_ns

Refuse writing to a directory in another user_ns.  We can't
support file creation because we wouldn't know who should own
the file.  This refuses file deletion as well - which I think
is the sensible thing to do.

File writing is still allowed if the 'user other' permissions
include write.  That again probably makes sense for logging
and such.

Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
---
 fs/namei.c |6 ++
 1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index adf5f1b..b39a990 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -213,6 +213,12 @@ int generic_permission(struct inode *inode, int mask,
 
 check:
/*
+* Can't write to a directory in another user_ns
+* We wouldn't know who to make the owner!
+*/
+   if (!same_userns  S_ISDIR(inode-i_mode)  (maskMAY_WRITE))
+   return -EACCES;
+   /*
 * If the DACs are ok we don't need any capability check.
 */
if (((mode  mask  (MAY_READ|MAY_WRITE|MAY_EXEC)) == mask))
-- 
1.5.4.3

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 6/6] user_namespace: move put_user_ns outside lock

2008-07-25 Thread Serge E. Hallyn
Oops, this one was supposed to get folded into the first patch,
obviously.  Sorry.

-serge

From daa76939c72adf146ded8a39e0211886e2bc943a Mon Sep 17 00:00:00 2001
From: Serge Hallyn [EMAIL PROTECTED]
Date: Fri, 25 Jul 2008 14:24:32 -0500
Subject: [PATCH 6/6] user_namespace: move put_user_ns outside lock

Since put_user_ns calls free_user on the creator, we shouldn't
call it from inside the lock in free_user.

Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
---
 kernel/user.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/user.c b/kernel/user.c
index 034dae5..d3dd353 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -323,11 +323,11 @@ done:
  */
 static inline void free_user(struct user_struct *up, unsigned long flags)
 {
-   put_user_ns(up-user_ns);
/* restore back the count */
atomic_inc(up-__count);
spin_unlock_irqrestore(uidhash_lock, flags);
 
+   put_user_ns(up-user_ns);
INIT_WORK(up-work, remove_user_sysfs_dir);
schedule_work(up-work);
 }
-- 
1.5.4.3

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 0/2] sys_restore prototype

2008-07-25 Thread Serge E. Hallyn
We were talking this morning about what trivial patchset to begin
with to get a start on checkpoint and restart.  We thought that
rather than start with checkpoint, maybe we should start with
something that reads a checkpoint file and restarts a single
task.  In this case, restart means it sets the process id and
executes the file which are found in the checkpoint file.

So here's what we whipped up for a half hour this morning,
and during some of Mark's talk this afternoon.

It refuses to run if it isn't the container init, so you must
unshare your pidns before calling sys_restore().

To test, I did:

[EMAIL PROTECTED] ~]# cat mycheckpoint 
99 /root/whoami
[EMAIL PROTECTED] ~]# cat restore.c
#include stdio.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h


int main()
{
int ret;
char *argv[3];
char *envp[1];

//argv[0] = argv[1] = /bin/bash;
//argv[2] = envp[0] = NULL;
argv[0] = /root/whoami;
argv[1] = envp[0] = NULL;
int fd = open(/root/mycheckpoint, O_RDONLY);
if (fd  0)
perror(open checkpoint file);
ret = syscall(327, fd, argv, envp);
printf(syscall returned %d\n, ret);
perror(syscall for checkpoint);
close(fd);
return ret;
}
[EMAIL PROTECTED] ~]# cat whoami.c
#include stdio.h
#include sys/types.h
#include unistd.h

int main()
{
printf(I am %d\n, getpid());
return 0;
}

Next, I create a new pid namespace, remount /proc, and
 execute 'restore' using 'exec' so that pid 1 is doing it:

[EMAIL PROTECTED] ~]# /home/hallyn/cryo/utils/ns_exec -cp /bin/bash
about to clone with 2002
[EMAIL PROTECTED] ~]# mount -t proc none /proc
[EMAIL PROTECTED] ~]# ps -ef
UIDPID  PPID  C STIME TTY  TIME CMD
root 1 0  1 18:46 pts/000:00:00 /bin/bash
root26 1  0 18:46 pts/000:00:00 ps -ef
[EMAIL PROTECTED] ~]# exec ./restore
I am 99

Seems to work.

-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 1/6] user namespaces: introduce user_struct-user_namespace relationship

2008-07-25 Thread Alexey Dobriyan
On Fri, Jul 25, 2008 at 07:27:25PM -0500, Serge E. Hallyn wrote:
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -600,6 +600,7 @@ struct user_struct {
   /* Hash table maintenance information */
   struct hlist_node uidhash_node;
   uid_t uid;
 + struct user_namespace *user_namespace;

Call it user_ns please, and file user_ns.c.

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


Re: [Devel] [PATCH 1/6] user namespaces: introduce user_struct-user_namespace relationship

2008-07-25 Thread Serge E. Hallyn
Quoting Alexey Dobriyan ([EMAIL PROTECTED]):
 On Fri, Jul 25, 2008 at 07:27:25PM -0500, Serge E. Hallyn wrote:
  --- a/include/linux/sched.h
  +++ b/include/linux/sched.h
  @@ -600,6 +600,7 @@ struct user_struct {
  /* Hash table maintenance information */
  struct hlist_node uidhash_node;
  uid_t uid;
  +   struct user_namespace *user_namespace;
 
 Call it user_ns please,

Odd, I had the same thought, but seem to have applied at the wrong
patch (please see patch 2).

Thanks.

  and file user_ns.c.

I'm ok with that, but that file has existed for quite some time
by now, so I'm not introducing it.

If noone objects, I'll rename the file (as well as
include/linux/user_namespace.h).

thanks,
-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel