[Devel] Re: [PATCH 1/2] CFS CGroup: Code cleanup

2007-10-24 Thread Ingo Molnar

* Srivatsa Vaddagiri [EMAIL PROTECTED] wrote:

 On Tue, Oct 23, 2007 at 07:32:27PM -0700, Paul Menage wrote:
  Clean up some CFS CGroup code
  
  - replace cont with cgrp in a few places in the CFS cgroup code, 
  - use write_uint rather than write for cpu.shares write function
  
  Signed-off-by: Paul Menage [EMAIL PROTECTED]
 
 Acked-by : Srivatsa Vaddagiri [EMAIL PROTECTED]

thanks, applied.

Ingo
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


Re: [Devel] code handles migration

2007-10-24 Thread Kirill Korotaev
99% of the checkpoining code is concentrated in kernel/cpt/*

Kirill


Yi Wang wrote:
 Hello,
 
 I'm new to OpenVZ. I'm interested in learning more about the way OpenVZ
 handles its VPS migration, especially the virtualized network stack.
 
 I just downloaded the linux-2.6.18-openvz source code. Can anybody help
 me get up to speed by giving some pointers on where in the source  
 tree to
 focus on?
 
 Thanks a bunch!
 Yi
 
 
 ___
 Devel mailing list
 Devel@openvz.org
 https://openvz.org/mailman/listinfo/devel
 

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Bogus KERN_ALERT on oops

2007-10-24 Thread Ingo Molnar

* Alexey Dobriyan [EMAIL PROTECTED] wrote:

 printing eip: f881b9f3 *pdpt = 3001 1*pde = 0480a067 
 *pte = 
   ^^^

thanks, added this to the x86 queue.

Ingo

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Bogus KERN_ALERT on oops

2007-10-24 Thread Alexey Dobriyan
On Wed, Oct 24, 2007 at 12:33:18PM +0200, Ingo Molnar wrote:
 
 * Pekka Enberg [EMAIL PROTECTED] wrote:
 
   -   printk(KERN_ALERT *pde = %016Lx , page);
   +   printk(*pde = %016Lx , page);
  
  Use the new KERN_CONT annotation here?
 
 indeed - i changed the patch to do that.

Might as well change comment around KERN_CONT -- for starters it lied
about early bootup phase since day one.

Proposed text:

/*
 * Annotation for a continued line of log printout (only done
 * after a line that had no enclosing \n).
 *
 * Introduced because checkpatch.pl couldn't be arsed to learn C
 * and distinguish continued printk() from the one that starts
 * new line.
 *
 * Caveat #1: Empty string-literal, so compiler can't check for
 *KERN_CONT misuse.
 * Caveat #2: Empty string-literal, so it can't be used in
 *printk(var); situations.
 * Caveat #3: takes characters on the screen, so code is harder
 *to read.
 * Caveat #4: checkpatch.pl doesn't know C, so it can't check
 *for KERN_CONT misuse, anyway.
 */

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH] Bogus KERN_ALERT on oops

2007-10-24 Thread Ingo Molnar

* Pekka Enberg [EMAIL PROTECTED] wrote:

  -   printk(KERN_ALERT *pde = %016Lx , page);
  +   printk(*pde = %016Lx , page);
 
 Use the new KERN_CONT annotation here?

indeed - i changed the patch to do that.

Ingo

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH] memory cgroup enhancements updated [2/10] force empty interface

2007-10-24 Thread Balbir Singh
KAMEZAWA Hiroyuki wrote:
 This patch adds an interface memory.force_empty.
 Any write to this file will drop all charges in this cgroup if
 there is no task under.
 
 %echo 1  /../memory.force_empty
 
 will drop all charges of memory cgroup if cgroup's tasks is empty.
 
 This is useful to invoke rmdir() against memory cgroup successfully.
 
 Tested and worked well on x86_64/fake-NUMA system.
 
 Changelog v4 - v5:
   - added comments to mem_cgroup_force_empty()
   - made mem_force_empty_read return -EINVAL
   - cleanup mem_cgroup_force_empty_list()
   - removed SWAP_CLUSTER_MAX
 
 Changelog v3 - v4:
   - adjusted to 2.6.23-mm1
   - fixed typo
   - changes buf[2]=0 to static const
 
 Changelog v2 - v3:
   - changed the name from force_reclaim to force_empty.
 
 Changelog v1 - v2:
   - added a new interface force_reclaim.
   - changes spin_lock to spin_lock_irqsave().
 
 
 Signed-off-by: KAMEZAWA Hiroyuki [EMAIL PROTECTED]
 
 
  mm/memcontrol.c |  110 
 
  1 file changed, 103 insertions(+), 7 deletions(-)
 
 Index: devel-2.6.23-mm1/mm/memcontrol.c
 ===
 --- devel-2.6.23-mm1.orig/mm/memcontrol.c
 +++ devel-2.6.23-mm1/mm/memcontrol.c
 @@ -480,6 +480,7 @@ void mem_cgroup_uncharge(struct page_cgr
   page = pc-page;
   /*
* get page-cgroup and clear it under lock.
 +  * force_empty can drop page-cgroup without checking refcnt.
*/
   if (clear_page_cgroup(page, pc) == pc) {
   mem = pc-mem_cgroup;
 @@ -489,13 +490,6 @@ void mem_cgroup_uncharge(struct page_cgr
   list_del_init(pc-lru);
   spin_unlock_irqrestore(mem-lru_lock, flags);
   kfree(pc);
 - } else {
 - /*
 -  * Note:This will be removed when force-empty patch is
 -  * applied. just show warning here.
 -  */
 - printk(KERN_ERR Race in mem_cgroup_uncharge() ?);
 - dump_stack();
   }
   }
  }
 @@ -543,6 +537,76 @@ retry:
   return;
  }
 
 +/*
 + * This routine traverse page_cgroup in given list and drop them all.
 + * This routine ignores page_cgroup-ref_cnt.
 + * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
 + */
 +#define FORCE_UNCHARGE_BATCH (128)
 +static void
 +mem_cgroup_force_empty_list(struct mem_cgroup *mem, struct list_head *list)
 +{
 + struct page_cgroup *pc;
 + struct page *page;
 + int count;
 + unsigned long flags;
 +
 +retry:
 + count = FORCE_UNCHARGE_BATCH;
 + spin_lock_irqsave(mem-lru_lock, flags);
 +
 + while (--count  !list_empty(list)) {
 + pc = list_entry(list-prev, struct page_cgroup, lru);
 + page = pc-page;
 + /* Avoid race with charge */
 + atomic_set(pc-ref_cnt, 0);
 + if (clear_page_cgroup(page, pc) == pc) {
 + css_put(mem-css);
 + res_counter_uncharge(mem-res, PAGE_SIZE);
 + list_del_init(pc-lru);
 + kfree(pc);
 + } else  /* being uncharged ? ...do relax */
 + break;
 + }
 + spin_unlock_irqrestore(mem-lru_lock, flags);
 + if (!list_empty(list)) {
 + cond_resched();
 + goto retry;
 + }
 + return;
 +}
 +

We could potentially share some of this code with the background reclaim
code being worked upon by YAMAMOTO-San.

 +/*
 + * make mem_cgroup's charge to be 0 if there is no task.
 + * This enables deleting this mem_cgroup.
 + */
 +
 +int mem_cgroup_force_empty(struct mem_cgroup *mem)
 +{
 + int ret = -EBUSY;
 + css_get(mem-css);
 + /*
 +  * page reclaim code (kswapd etc..) will move pages between
 +` * active_list - inactive_list while we don't take a lock.
 +  * So, we have to do loop here until all lists are empty.
 +  */
 + while (!(list_empty(mem-active_list) 
 +  list_empty(mem-inactive_list))) {
 + if (atomic_read(mem-css.cgroup-count)  0)
 + goto out;
 + /* drop all page_cgroup in active_list */
 + mem_cgroup_force_empty_list(mem, mem-active_list);
 + /* drop all page_cgroup in inactive_list */
 + mem_cgroup_force_empty_list(mem, mem-inactive_list);
 + }
 + ret = 0;
 +out:
 + css_put(mem-css);
 + return ret;
 +}
 +
 +
 +
  int mem_cgroup_write_strategy(char *buf, unsigned long long *tmp)
  {
   *tmp = memparse(buf, buf);
 @@ -628,6 +692,33 @@ static ssize_t mem_control_type_read(str
   ppos, buf, s - buf);
  }
 
 +
 +static ssize_t mem_force_empty_write(struct cgroup *cont,
 + struct cftype *cft, struct file *file,
 +  

[Devel] Help required regarding tool for OpenVZ

2007-10-24 Thread Khyati Sanghvi
Hi,

I'm new to OpenVZ. I'm interested in making a tool which can give
performance statistics for OpenVZ, which can help users to get details about
CPU usage, Memory usage etc for each VE.

  For this task one file which can act as source is /proc/vz/vzstat. Can
somebody guide me on which all performance information, the user or
performance analyzer might want know. It would be great if somebody can also
point out other user information tools which are required to be developed
for OpenVZ.

Thanks in advance...

With regards,
Khyati
___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH] memory cgroup enhancements updated [8/10] add pre_destroy handler

2007-10-24 Thread Balbir Singh
KAMEZAWA Hiroyuki wrote:
 My main purpose of this patch is for memory controller..
 
 This patch adds a handler pre_destroy to cgroup_subsys.
 It is called before cgroup_rmdir() checks all subsys's refcnt.
 
 I think this is useful for subsyses which have some extra refs
 even if there are no tasks in cgroup.
 
 Signed-off-by: KAMEZAWA Hiroyuki [EMAIL PROTECTED]
 
  include/linux/cgroup.h |1 +
  kernel/cgroup.c|7 +++
  2 files changed, 8 insertions(+)
 
 Index: devel-2.6.23-mm1/include/linux/cgroup.h
 ===
 --- devel-2.6.23-mm1.orig/include/linux/cgroup.h
 +++ devel-2.6.23-mm1/include/linux/cgroup.h
 @@ -233,6 +233,7 @@ int cgroup_is_descendant(const struct cg
  struct cgroup_subsys {
   struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
 struct cgroup *cont);
 + void (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cont);
   void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cont);
   int (*can_attach)(struct cgroup_subsys *ss,
 struct cgroup *cont, struct task_struct *tsk);
 Index: devel-2.6.23-mm1/kernel/cgroup.c
 ===
 --- devel-2.6.23-mm1.orig/kernel/cgroup.c
 +++ devel-2.6.23-mm1/kernel/cgroup.c
 @@ -2158,6 +2158,13 @@ static int cgroup_rmdir(struct inode *un
   parent = cont-parent;
   root = cont-root;
   sb = root-sb;
 + /*
 +  * Notify subsyses that rmdir() request comes.
 +  */
 + for_each_subsys(root, ss) {
 + if ((cont-subsys[ss-subsys_id])  ss-pre_destroy)
 + ss-pre_destroy(ss, cont);
 + }
 

Is pre_destroy really required? Can't we do what we do here in destroy?

   if (cgroup_has_css_refs(cont)) {
   mutex_unlock(cgroup_mutex);
 


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH] memory cgroup enhancements updated [3/10] remember pagecache

2007-10-24 Thread KAMEZAWA Hiroyuki
On Wed, 24 Oct 2007 19:26:34 +0530
Balbir Singh [EMAIL PROTECTED] wrote:

 Could we define
 
 enum {
   MEM_CGROUP_CHARGE_TYPE_CACHE  = 0,
   MEM_CGROUP_CHARGE_TYPE_MAPPED = 1,
 };
 
 and use the enums here and below.
 
Okay, I'll use this approach. 

Thanks,
-Kame
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH] memory cgroup enhancements updated [10/10] NUMA aware account

2007-10-24 Thread KAMEZAWA Hiroyuki
On Wed, 24 Oct 2007 20:29:08 +0530
Balbir Singh [EMAIL PROTECTED] wrote:

  +   for_each_possible_cpu(cpu) {
  +   int nid = cpu_to_node(cpu);
  +   struct mem_cgroup_stat_cpu *mcsc;
  +   if (sizeof(*mcsc)  PAGE_SIZE)
  +   mcsc = kmalloc_node(sizeof(*mcsc), GFP_KERNEL, nid);
  +   else
  +   mcsc = vmalloc_node(sizeof(*mcsc), nid);
 
 Do we need to use the vmalloc() pool? I think we might be better off
 using a dedicated slab for ourselves
 
I admit this part is complicated. But ia64's MAX_NUMNODES=1024 and stat
can be increased. we need vmalloc. I'll rewrite this part to be
better looking.

  +   memset(mcsc, 0, sizeof(*mcsc));
  +   mem-stat.cpustat[cpu] = mcsc;
  +   }
  return mem-css;
   }
  
  @@ -969,7 +1006,15 @@ static void mem_cgroup_pre_destroy(struc
   static void mem_cgroup_destroy(struct cgroup_subsys *ss,
  struct cgroup *cont)
   {
  -   kfree(mem_cgroup_from_cont(cont));
  +   struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
  +   int cpu;
  +   for_each_possible_cpu(cpu) {
  +   if (sizeof(struct mem_cgroup_stat_cpu)  PAGE_SIZE)
  +   kfree(mem-stat.cpustat[cpu]);
  +   else
  +   vfree(mem-stat.cpustat[cpu]);
  +   }
  +   kfree(mem);
   }
  
   static int mem_cgroup_populate(struct cgroup_subsys *ss,
  @@ -1021,5 +1066,5 @@ struct cgroup_subsys mem_cgroup_subsys =
  .destroy = mem_cgroup_destroy,
  .populate = mem_cgroup_populate,
  .attach = mem_cgroup_move_task,
  -   .early_init = 1,
  +   .early_init = 0,
 
 I don't understand why this change is required here?
 
If early_init = 1, we cannot call kmalloc/vmalloc at initializing 
init_mem_cgroup.
It's too early.

Thanks,
-Kame
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH] namespaces: introduce sys_hijack (v7)

2007-10-24 Thread Serge E. Hallyn
This is just a first stab at doing hijack by cgroup files.  I force
using the 'tasks' file just so that I can (a) predict and check
the name of the file, (b) make sure it's a cgroup file, and then
(c) trust that taking __d_cont(dentry parent) gives me a legitimate
container.

Seems to work at least.

Paul, does this look reasonable?  task_from_cgroup_fd() in particular.

thanks,
-serge

From 8ba6a4ae33da18b95d967a799c5edce4266d1f1f Mon Sep 17 00:00:00 2001
From: [EMAIL PROTECTED] [EMAIL PROTECTED]
Date: Tue, 16 Oct 2007 09:36:49 -0700
Subject: [PATCH] namespaces: introduce sys_hijack (v7)

Move most of do_fork() into a new do_fork_task() which acts on
a new argument, task, rather than on current.  do_fork() becomes
a call to do_fork_task(current, ...).

Introduce sys_hijack (for x86 only so far).  It is like clone, but
 in place of a stack pointer (which is assumed null) it accepts a
pid.  The process identified by that pid is the one which is
actually cloned.  Some state - include the file table, the signals
and sighand (and hence tty), and the -parent are taken from the
calling process.

A process to be hijacked may be identified by process id.
Alternatively, an open fd for a cgroup 'tasks' file may be
specified.  The first available task in that cgroup will then
be hijacked.

In order to hijack a process, the calling process must be
allowed to ptrace the target.

The effect is a sort of namespace enter.  The following program
uses sys_hijack to 'enter' all namespaces of the specified task.
For instance in one terminal, do

mount -t cgroup -ons /cgroup
hostname
  qemu
ns_exec -u /bin/sh
  hostname serge
  echo $$
1073
  cat /proc/$$/cgroup
ns:/node_1073

In another terminal then do

hostname
  qemu
cat /proc/$$/cgroup
  ns:/
hijack pid 1073
  hostname
serge
  cat /proc/$$/cgroup
ns:/node_1073
hijack cgroup /cgroup/node_1073/tasks

Changelog:
Aug 23: send a stop signal to the hijacked process
(like ptrace does).
Oct 09: Update for 2.6.23-rc8-mm2 (mainly pidns)
Don't take task_lock under rcu_read_lock
Send hijacked process to cgroup_fork() as
the first argument.
Removed some unneeded task_locks.
Oct 16: Fix bug introduced into alloc_pid.
Oct 16: Add 'int which' argument to sys_hijack to
allow later expansion to use cgroup in place
of pid to specify what to hijack.
Oct 24: Implement hijack by open cgroup file.

==
hijack.c
==
 #define _BSD_SOURCE
 #include unistd.h
 #include sys/syscall.h
 #include sys/types.h
 #include sys/wait.h
 #include sys/stat.h
 #include fcntl.h
 #include sched.h

void usage(char *me)
{
printf(Usage: %s pid pid | %s cgroup cgroup_tasks_file\n, me, me);
}

int exec_shell(void)
{
execl(/bin/sh, /bin/sh, NULL);
}

int main(int argc, char *argv[])
{
int id;
int ret;
int status;
int use_pid = 0;

if (argc  3 || !strcmp(argv[1], -h)) {
usage(argv[0]);
return 1;
}
if (strcmp(argv[1], pid) == 0)
use_pid = 1;

if (use_pid)
id = atoi(argv[2]);
else {
id = open(argv[2], O_RDONLY);
if (id == -1) {
perror(cgroup open);
return 1;
}
}

ret = syscall(327, SIGCHLD, use_pid ? 1 : 2, (unsigned long)id);

if (use_pid)
close(id);
if  (ret == 0) {
return exec_shell();
} else if (ret  0) {
perror(sys_hijack);
} else {
printf(waiting on cloned process %d\n, ret);
//  ret = waitpid(ret, status, __WALL);
while(waitpid(-1, status, __WALL) != -1)
;
printf(cloned process exited with %d (waitpid ret %d)\n,
status, ret);
}

return ret;
}
==

Signed-off-by: Serge Hallyn [EMAIL PROTECTED]
---
 arch/i386/kernel/process.c   |   87 +-
 arch/i386/kernel/syscall_table.S |1 +
 arch/s390/kernel/process.c   |   12 -
 include/asm-i386/unistd.h|3 +-
 include/linux/cgroup.h   |   10 -
 include/linux/ptrace.h   |1 +
 include/linux/sched.h|8 
 include/linux/syscalls.h |1 +
 kernel/cgroup.c  |   69 --
 kernel/fork.c|   67 +
 kernel/ptrace.c 

[Devel] Re: LSM and Containers

2007-10-24 Thread Serge E. Hallyn
Quoting Peter Dolding ([EMAIL PROTECTED]):
 On 10/25/07, Crispin Cowan [EMAIL PROTECTED] wrote:
  Peter Dolding wrote:
   The other thing you have not though of and is critical.  If LSM is the
   same LSM across all containers.  What happens if that is breached and
   tripped to disable.  You only want to loss one container to a breach
   not the whole box and dice in one hit.  Its also the reason why my
   design does not have a direct link between controllers.  No cascade
   threw system to take box and dice.
  
  Sorry, but I totally disagree.
 
  If you obtain enough privilege to disable the LSM in one container, you
  also obtain enough privilege to disable *other* LSMs that might be
  operating in different containers. This is a limitation of the
  Containers feature, not of LSM.
 
 That is not a Container feature.   If you have enough privilege does
 not mean you can.  Root user in a Container does not mean you can play
 with other containers applications.  There is a security split at the
 container edge when doing Virtual Servers what by using one LSM you
 are disregarding.
 
 Simple point if one LSM is disabled in a container it can only get the
 max rights of that Container.  So cannot see the other LSM's on the
 system bellow it.  Reason also why in my model its the same layout if
 there is 1 or 1000 stacked so attack cannot tell how deep they are in
 and if there is anything to be gained by digging.  You have to break

You're sometimes hard to parse, but here are a few basic facts within
which to constrain our discussions:

1. LSMs are a part of the kernel.  The entire kernel is in the
   same trusted computing base
2. containers all run on the same kernel
3. whether an lsm is compromised, or a tty driver, or anything
   else which is in the TCB, all containers are compromised
4. it is very explicitly NOT a goal to hide from a container
   the fact that it is in a container.  So your 'cannot tell how
   deep they are' is not a goal.

If you want to be able to 'plug' lsms in per container, by all means
feel free to write a proof of concept.  It is kind of a cool idea.  But
be clear about what you'll gain:  You allow the container admin to
contrain data access within his container in the way he chooses using
the model with which he is comfortable.  It does nothing to protect one
container from another, does nothing to protect against kernel exploits,
and absolutely does nothing to protect a container from the 'host'.

Also please keep in mind that the container security framework is not
only not yet complete, it's pretty much not started.  My own idea for
how to best do it are outlined in emails which are in the containers
list archive.  But in terms of LSM they follow the idea Crispin
outlines, namely that the LSMs support containers themselves.  And, in
a proces in a container started without CAP_NS_OVERRIDE (which does not
yet exist :) in its capability bounding set will only be able to access
files in another container as DAC user 'other' (i.e if perms are 754, it
will get read access, if 750, then none), even if it has
CAP_DAC_OVERRIDE.  (unless it gets an authorization key for the owning
user in the target namespace, but *that's* probably *years* off)

-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel