[Devel] Re: Checkpoint/Restart mini-summit

2008-07-31 Thread C. Craig Ross
Hi Serge,

I can confirm the room and we can get you some type of a paper/white
board.  I should know tomorrow about the speakerphone and Internet
access.

C.


 Hi C.,

 I'm sorry after all the emails I'm not straight on what we're actually
 getting.  We have a U-shaped room, but do we have a blackboard or
 whiteboard?  Is there a speakerphone?  Wireless or at least wired
 internet?

 thanks,
 -serge

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: mini-summit conf#

2008-07-31 Thread Vivek Goyal
On Wed, Jul 09, 2008 at 10:57:10AM -0500, Serge E. Hallyn wrote:
 Before I forget,
 
 we're planning on having a call-in for the containers mini-summit.  The
 phone number is:
   US toll: 1-770-615-1382
   US toll-free: 1-877-421-0023
 
 When asked, the passcode is: 499229
 

I was interested in joining the discussion regarding cgroup/libcg,
especially participate in discussion on how to go about automatic
placement of tasks in right cgroups based on defined rules.

Is there any plans to discuss above topic? If yes, can I join in over
phone? Do I have to register for OLS if I want to join-in over phone?

Thanks
Vivek
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-31 Thread Joe MacDonald
2008/7/25 Balbir Singh [EMAIL PROTECTED]:

 There are applications that can/need to handle overcommit, just that we are 
 not
 aware of them fully. Immediately after our meeting, I was pointed to
 http://www.linuxfoundation.org/en/Carrier_Grade_Linux/Requirements_Alpha1#AVL.4.1_VM_Strict_Over-Commit

I need to get caught up on this thread, but I did promise Balbir at
the mini-summit that I would appear soon-ish with actual use-cases on
this from some of the CGL folks.  Specifically the case I was thinking
of, other than the CGL requirement for VM Strict Overcommit, was finer
grained rlimit accounting.  It started out in the Collaboration Summit
meeting in Austin as a discussion about the SCOPE gaps document and
CGOS-4.5 (curiously called Coarse Resource Enforcement, when it's
really trying to address per-thread limits).

The full document is here in PDF form:

http://www.scope-alliance.org/pr/SCOPE_CGOS_GAPS_PROFILE_v2.pdf

I'm suspecting now, though, that after re-reading the requirement from
SCOPE and the memrlimit discussion, they may in fact be disjoint sets
of functionality.

-J.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


Re: [Devel] [RFC][PATCH 0/2] CR: save/restore a single, simple task

2008-07-31 Thread Andrey Mirkin
Hello Oren,

That is great, that you have proposed your version of checkpointing/restart.
In a few days I will send a patchset with OpenVZ checkpointing/restart.
So, we will be able to compare our approaches and take the best parts from 
both.

Regards,
Andrey

On Wednesday 30 July 2008 07:24 Oren Laadan wrote:
 In the recent mini-summit at OLS 2008 and the following days it was
 agreed to tackle the checkpoint/restart (CR) by beginning with a very
 simple case: save and restore a single task, with simple memory
 layout, disregarding other task state such as files, signals etc.

 Following these discussions I coded a prototype that can do exactly
 that, as a starter. This code adds two system calls - sys_checkpoint
 and sys_restart - that a task can call to save and restore its state
 respectively. It also demonstrates how the checkpoint image file can
 be formatted, as well as show its nested nature (e.g. cr_write_mm()
 - cr_write_vma() nesting).

 The state that is saved/restored is the following:
 * some of the task_struct
 * some of the thread_struct and thread_info
 * the cpu state (including FPU)
 * the memory address space

 [The patch is against commit fb2e405fc1fc8b20d9c78eaa1c7fd5a297efde43
 of Linus's tree (uhhh.. don't ask why), but against tonight's head too].

 In the current code, sys_checkpoint will checkpoint the current task,
 although the logic exists to checkpoint other tasks (not in the
 checkpointee's execution context). A simple loop will extend this to
 handle multiple processes. sys_restart restarts the current tasks, and
 with multiple tasks each task will call the syscall independently.
 (Actually, to checkpoint outside the context of a task, it is also
 necessary to also handle restart-block logic when saving/restoring the
 thread data).

 It takes longer to describe what isn't implemented or supported by
 this prototype ... basically everything that isn't as simple as the
 above.

 As for containers - since we still don't have a representation for a
 container, this patch has no notion of a container. The tests for
 consistent namespaces (and isolation) are also omitted.

 Below are two example programs: one uses checkpoint (called ckpt) and
 one uses restart (called rstr). Execute like this (as a superuser):

 orenl:~/test$ ./ckpt  out.1
 hello, world!  (ret=1)-- sys_checkpoint returns positive id
   -- ctrl-c
 orenl:~/test$ ./ckpt  out.2
 hello, world!  (ret=2)
   -- ctrl-c
 orenl:~/test$ ./rstr  out.1
 hello, world!  (ret=0)-- sys_restart return 0

 (if you check the output of ps, you'll see that rstr changed its
 name to ckpt, as expected).

 Hoping this will accelerate the discussion. Comments are welcome.
 Let the fun begin :)

 Oren.


 == ckpt.c 

 #define _GNU_SOURCE/* or _BSD_SOURCE or _SVID_SOURCE */

 #include stdio.h
 #include stdlib.h
 #include errno.h
 #include fcntl.h
 #include unistd.h
 #include asm/unistd_32.h
 #include sys/syscall.h

 int main(int argc, char *argv[])
 {
   pid_t pid = getpid();
   int ret;

   ret = syscall(__NR_checkpoint, pid, STDOUT_FILENO, 0);
   if (ret  0)
   perror(checkpoint);

   fprintf(stderr, hello, world!  (ret=%d)\n, ret);

   while (1)
   ;

   return 0;
 }

 == rstr.c 

 #define _GNU_SOURCE/* or _BSD_SOURCE or _SVID_SOURCE */

 #include stdio.h
 #include stdlib.h
 #include errno.h
 #include fcntl.h
 #include unistd.h
 #include asm/unistd_32.h
 #include sys/syscall.h

 int main(int argc, char *argv[])
 {
   pid_t pid = getpid();
   int ret;

   ret = syscall(__NR_restart, pid, STDIN_FILENO, 0);
   if (ret  0)
   perror(restart);

   printf(should not reach here !\n);

   return 0;
 }
 ___
 Containers mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/containers

 ___
 Devel mailing list
 Devel@openvz.org
 https://openvz.org/mailman/listinfo/devel

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 0/2] CR: save/restore a single, simple task

2008-07-31 Thread Daniel Lezcano
Oren Laadan wrote:
 Disclaimer: long reply :)
 
 Serge E. Hallyn wrote:
 Quoting Oren Laadan ([EMAIL PROTECTED]):
 In the recent mini-summit at OLS 2008 and the following days it was
 agreed to tackle the checkpoint/restart (CR) by beginning with a very
 simple case: save and restore a single task, with simple memory
 layout, disregarding other task state such as files, signals etc.

 Following these discussions I coded a prototype that can do exactly
 that, as a starter. This code adds two system calls - sys_checkpoint
 and sys_restart - that a task can call to save and restore its state
 respectively. It also demonstrates how the checkpoint image file can
 be formatted, as well as show its nested nature (e.g. cr_write_mm()
 - cr_write_vma() nesting).

 The state that is saved/restored is the following:
 * some of the task_struct
 * some of the thread_struct and thread_info
 * the cpu state (including FPU)
 * the memory address space

 [The patch is against commit fb2e405fc1fc8b20d9c78eaa1c7fd5a297efde43
 of Linus's tree (uhhh.. don't ask why), but against tonight's head too].

 In the current code, sys_checkpoint will checkpoint the current task,
 although the logic exists to checkpoint other tasks (not in the
 checkpointee's execution context). A simple loop will extend this to
 handle multiple processes. sys_restart restarts the current tasks, and
 with multiple tasks each task will call the syscall independently.
 I assume that approach worked in Zap, so there must be a simple solution
 to this, but I don't see how having each process in a container
 independently call sys_restart works for sharing.  Oh, or is that where
 
 The main reason to do that (and I thought openvz works similarly ?) is
 that I want to re-use as much as possible the existing kernel functionality.
 Restart differs from checkpoint in that you have to construct new resources
 as opposed to only inspect existing resources. To inspect - you only need
 a reference to the object and then to obtain its state by accessing it. In
 contrast, to construct, you need to create a new resource.
 
 In almost all cases, creating a resource for a process is easiest if done by
 the process itself. For instance - to restore the memory map, you want the
 process that owns the target mm to call mmap() (in particular, the lower
 level and more convenient for us do_mmap_pgoff() function). If the process
 that restores a given vma didn't own that mm, it would take much more pain
 to build the vma into a foreign mm.
 
 Thus, there is a huge advantage of doing everything in-context of the target
 process, that is - we can re-use the existing kernel code (and spirit) to
 create the resources, instead of having to hand-craft them carefully with
 specialized code.
 
 a 'container restart context' comes in?  An nsproxy has a pointer to a
 
 More or less. At a first approximation, this is how I envision it:
 
 0) in user space, a new (empty) container will be created with all the
 needed settings for the file system etc (mounts .. and the like)
 
 1) the first task (container init) will call sys_restart with the checkpoint
 image file.
 
 2) the code will verify the header, then read in the global section; it will
 create a restart-context which will be referenced from the container-object
 (one option we considered is to have the freezer-cgroup be that object).
 
 3) using the info from that section, it will create the task tree (forest)
 to be restored. In particular, new tasks will be created and each will end
 up in do_restart_task() inside the kernel.
 
 [note that in Zap, step 3 is still done in user space...]
 
 Since all tasks live in the container, they will all have access to the
 restart-context, through which all coordination is done.
 
 At first, the restart will be performed _one task at a time_, at the order
 they were dumped. So while the init task restores itself, the remaining
 tasks sleep. When the init task finishes - it will wake the next in line
 and so on. The last one will wake the init task to finalize the work. So:
 
 4) each task waits (sleeps) until it is prompted to restore its own state.
 When it completes, it wakes up the next task in line and goes to a freeze
 state.
 
 5) the init task finalized the restart, and either completes the freeze or
 unfreezes the container, depending on what the user requested.
 
 This scheme makes sense because we assume that the data is streamed. So it
 does not make much sense to try to restart the 5th job before the 2nd job
 because the data isn't there yet. Moreover, if they refer to the same shared
 object, job#5 will have to wait to job#2 to create the object, since its
 state was saved with that job.
 
 In the future, to speed the process by concurrent restarting multiple tasks,
 we'll have to read in data from the stream into a buffer (read-ahead) and
 then restarting tasks could skip data that doesn't belongs to them; while
 they may still need to wait for shared resources to be created, 

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-31 Thread Serge E. Hallyn
Quoting Louis Rilling ([EMAIL PROTECTED]):
 On Wed, Jul 30, 2008 at 10:40:35AM -0700, Dave Hansen wrote:
  On Wed, 2008-07-30 at 11:52 -0500, Serge E. Hallyn wrote:
   
   This list is getting on my nerves.  Louis, I'm sorry the threading
   is going to get messed up.
  
  I think I just cleared out the mime type filtering.
 
 Could the digital signature be the guily part of my email?

Yeah, that was Dave's guess, and it seems likely.  Dave thinks he
unset whatever setting caused the bounce, so you should be fine to
keep the signatures in there.

thanks,
-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-31 Thread Oren Laadan


Louis Rilling wrote:
 On Wed, Jul 30, 2008 at 02:27:52PM -0400, Oren Laadan wrote:

 Louis Rilling wrote:
 +/**
 + * cr_vma_fill_pgarr - fill a page-array with addr/page tuples for a vma
 + * @ctx - checkpoint context
 + * @pgarr - page-array to fill
 + * @vma - vma to scan
 + * @start - start address (updated)
 + */
 +static int cr_vma_fill_pgarr(struct cr_ctx *ctx, struct cr_pgarr *pgarr,
 +   struct vm_area_struct *vma, unsigned long *start)
 +{
 +  unsigned long end = vma-vm_end;
 +  unsigned long addr = *start;
 +  struct page **pagep;
 +  unsigned long *addrp;
 +  int cow, nr, ret = 0;
 +
 +  nr = pgarr-nleft;
 +  pagep = pgarr-pages[pgarr-nused];
 +  addrp = pgarr-addrs[pgarr-nused];
 +  cow = !!vma-vm_file;
 +
 +  while (addr  end) {
 +  struct page *page;
 +
 +  /* simplified version of get_user_pages(): already have vma,
 +  * only need FOLL_TOUCH, and (for now) ignore fault stats */
 +
 +  cond_resched();
 +  while (!(page = follow_page(vma, addr, FOLL_TOUCH))) {
 +  ret = handle_mm_fault(vma-vm_mm, vma, addr, 0);
 +  if (ret  VM_FAULT_ERROR) {
 +  if (ret  VM_FAULT_OOM)
 +  ret = -ENOMEM;
 +  else if (ret  VM_FAULT_SIGBUS)
 +  ret = -EFAULT;
 +  else
 +  BUG();
 +  break;
 +  }
 +  cond_resched();
 +  }
 I guess that 'ret' should be checked somewhere after this loop.
 yes; this is where a break(2) construct in C would come handy :)
 
 Alternatively, putting the inner loop in a separate function often helps to
 handle errors in a cleaner way.

Also true. I opted to keep it that way to keep the code as similar as
possible to get_user_pages().

Note that the logic can be optimized by, instead of traversing the page
table once for each page, we could aggregate a few pages in each round.
I wanted to keep the code simple.

 
 +
 +  if (IS_ERR(page)) {
 +  ret = PTR_ERR(page);
 +  break;
 +  }
 +
 +  if (page == ZERO_PAGE(0))
 +  page = NULL;/* zero page: ignore */
 +  else if (cow  page_mapping(page) != NULL)
 +  page = NULL;/* clean cow: ignore */
 +  else {
 +  get_page(page);
 +  *(addrp++) = addr;
 +  *(pagep++) = page;
 +  if (--nr == 0) {
 +  addr += PAGE_SIZE;
 +  break;
 +  }
 +  }
 +
 +  addr += PAGE_SIZE;
 +  }
 +
 +  if (unlikely(ret  0)) {
 +  nr = pgarr-nleft - nr;
 +  while (nr--)
 +  page_cache_release(*(--pagep));
 +  return ret;
 +  }
 +
 +  *start = addr;
 +  return (pgarr-nleft - nr);
 +}
 +
 
 
 +int cr_write_mm(struct cr_ctx *ctx, struct task_struct *t)
 +{
 +  struct cr_hdr h;
 +  struct cr_hdr_mm *hh = ctx-tbuf;
 +  struct mm_struct *mm;
 +  struct vm_area_struct *vma;
 +  int ret;
 +
 +  h.type = CR_HDR_MM;
 +  h.len = sizeof(*hh);
 +  h.id = ctx-pid;
 +
 +  mm = get_task_mm(t);
 +
 +  hh-tag = 1;/* non-zero will mean first time encounter */
 +
 +  hh-start_code = mm-start_code;
 +  hh-end_code = mm-end_code;
 +  hh-start_data = mm-start_data;
 +  hh-end_data = mm-end_data;
 +  hh-start_brk = mm-start_brk;
 +  hh-brk = mm-brk;
 +  hh-start_stack = mm-start_stack;
 +  hh-arg_start = mm-arg_start;
 +  hh-arg_end = mm-arg_end;
 +  hh-env_start = mm-env_start;
 +  hh-env_end = mm-env_end;
 +
 +  hh-map_count = mm-map_count;
 Some fields above should also be protected with mmap_sem, like -brk,
 -map_count, and possibly others (I'm not a memory expert though).
 true; keep in mind, though, that the container will be frozen during
 this time, so nothing should change at all. The only exception would
 be if, for instance, someone is killing the container while we save
 its state.
 
 Sure. So you think that taking mm-mmap_sem below is useless? I tend to 
 believe
 so, since no other task should share this mm_struct at this time, and we could
 state that ptrace should not interfere during restart. However, I'm never
 confident when ptrace considerations come in...

Not quite.

Probing the value of mm-brk is always safe, although it may turn out to
yield incorrect value. Traversing the vma's isn't safe, because - if for
instance the target task dies in the middle, it may alter the vma list.
So the mmap_sem protects against the latter.

Anyway, it won't hurt to be extra safe and take the semaphore earlier.

Ptrace, btw, cannot come in because the container is (supposedly) frozen.

Oren.

 +
 +  /* FIX: need also mm-flags */
 +
 +  ret = cr_write_obj(ctx, h, hh);
 +  if (ret  0)
 +  goto out;
 +
 +  /* write the vma's */
 +  down_read(mm-mmap_sem);
 +  for (vma = 

[Devel] Re: [RFC][PATCH 0/2] CR: save/restore a single, simple task

2008-07-31 Thread Oren Laadan


Daniel Lezcano wrote:
 Oren Laadan wrote:
 Disclaimer: long reply :)

 Serge E. Hallyn wrote:
 Quoting Oren Laadan ([EMAIL PROTECTED]):
 In the recent mini-summit at OLS 2008 and the following days it was
 agreed to tackle the checkpoint/restart (CR) by beginning with a very
 simple case: save and restore a single task, with simple memory
 layout, disregarding other task state such as files, signals etc.

 Following these discussions I coded a prototype that can do exactly
 that, as a starter. This code adds two system calls - sys_checkpoint
 and sys_restart - that a task can call to save and restore its state
 respectively. It also demonstrates how the checkpoint image file can
 be formatted, as well as show its nested nature (e.g. cr_write_mm()
 - cr_write_vma() nesting).

 The state that is saved/restored is the following:
 * some of the task_struct
 * some of the thread_struct and thread_info
 * the cpu state (including FPU)
 * the memory address space

 [The patch is against commit fb2e405fc1fc8b20d9c78eaa1c7fd5a297efde43
 of Linus's tree (uhhh.. don't ask why), but against tonight's head 
 too].

 In the current code, sys_checkpoint will checkpoint the current task,
 although the logic exists to checkpoint other tasks (not in the
 checkpointee's execution context). A simple loop will extend this to
 handle multiple processes. sys_restart restarts the current tasks, and
 with multiple tasks each task will call the syscall independently.
 I assume that approach worked in Zap, so there must be a simple solution
 to this, but I don't see how having each process in a container
 independently call sys_restart works for sharing.  Oh, or is that where

 The main reason to do that (and I thought openvz works similarly ?) is
 that I want to re-use as much as possible the existing kernel 
 functionality.
 Restart differs from checkpoint in that you have to construct new 
 resources
 as opposed to only inspect existing resources. To inspect - you only need
 a reference to the object and then to obtain its state by accessing 
 it. In
 contrast, to construct, you need to create a new resource.

 In almost all cases, creating a resource for a process is easiest if 
 done by
 the process itself. For instance - to restore the memory map, you want 
 the
 process that owns the target mm to call mmap() (in particular, the lower
 level and more convenient for us do_mmap_pgoff() function). If the 
 process
 that restores a given vma didn't own that mm, it would take much more 
 pain
 to build the vma into a foreign mm.

 Thus, there is a huge advantage of doing everything in-context of the 
 target
 process, that is - we can re-use the existing kernel code (and spirit) to
 create the resources, instead of having to hand-craft them carefully with
 specialized code.

 a 'container restart context' comes in?  An nsproxy has a pointer to a

 More or less. At a first approximation, this is how I envision it:

 0) in user space, a new (empty) container will be created with all the
 needed settings for the file system etc (mounts .. and the like)

 1) the first task (container init) will call sys_restart with the 
 checkpoint
 image file.

 2) the code will verify the header, then read in the global section; 
 it will
 create a restart-context which will be referenced from the 
 container-object
 (one option we considered is to have the freezer-cgroup be that object).

 3) using the info from that section, it will create the task tree 
 (forest)
 to be restored. In particular, new tasks will be created and each will 
 end
 up in do_restart_task() inside the kernel.

 [note that in Zap, step 3 is still done in user space...]

 Since all tasks live in the container, they will all have access to the
 restart-context, through which all coordination is done.

 At first, the restart will be performed _one task at a time_, at the 
 order
 they were dumped. So while the init task restores itself, the remaining
 tasks sleep. When the init task finishes - it will wake the next in line
 and so on. The last one will wake the init task to finalize the work. So:

 4) each task waits (sleeps) until it is prompted to restore its own 
 state.
 When it completes, it wakes up the next task in line and goes to a freeze
 state.

 5) the init task finalized the restart, and either completes the 
 freeze or
 unfreezes the container, depending on what the user requested.

 This scheme makes sense because we assume that the data is streamed. 
 So it
 does not make much sense to try to restart the 5th job before the 2nd job
 because the data isn't there yet. Moreover, if they refer to the same 
 shared
 object, job#5 will have to wait to job#2 to create the object, since its
 state was saved with that job.

 In the future, to speed the process by concurrent restarting multiple 
 tasks,
 we'll have to read in data from the stream into a buffer (read-ahead) and
 then restarting tasks could skip data that doesn't belongs to them; while
 they may still 

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-31 Thread Oren Laadan


Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 11:09:54AM -0400, Oren Laadan wrote:

 Louis Rilling wrote:
 On Wed, Jul 30, 2008 at 06:20:32PM -0400, Oren Laadan wrote:
 Serge E. Hallyn wrote:
 Quoting Oren Laadan ([EMAIL PROTECTED]):
 +int do_checkpoint(struct cr_ctx *ctx)
 +{
 +int ret;
 +
 +/* FIX: need to test whether container is checkpointable */
 +
 +ret = cr_write_hdr(ctx);
 +if (!ret)
 +ret = cr_write_task(ctx, current);
 +if (!ret)
 +ret = cr_write_tail(ctx);
 +
 +/* on success, return (unique) checkpoint identifier */
 +if (!ret)
 +ret = ctx-crid;
 Does this crid have a purpose?
 yes, at least three; both are for the future, but important to set the
 meaning of the return value of the syscall already now. The crid is
 the CR-identifier that identifies the checkpoint. Every checkpoint is
 assigned a unique number (using an atomic counter).

 1) if a checkpoint is taken and kept in memory (instead of to a file) then
 this will be the identifier with which the restart (or cleanup) would refer
 to the (in memory) checkpoint image

 2) to reduce downtime of the checkpoint, data will be aggregated on the
 checkpoint context, as well as referenced to (cow-ed) pages. This data can
 persist between calls to sys_checkpoint(), and the 'crid', again, will be
 used to identify the (in-memory-to-be-dumped-to-storage) context.

 3) for incremental checkpoint (where a successive checkpoint will only
 save what has changed since the previous checkpoint) there will be a need
 to identify the previous checkpoints (to be able to know where to take
 data from during restart). Again, a 'crid' is handy.

 [in fact, for the 3rd use, it will make sense to write that number as
 part of the checkpoint image header]

 Note that by doing so, a process that checkpoints itself (in its own
 context), can use code that is similar to the logic of fork():

...
crid = checkpoint(...);
switch (crid) {
case -1:
perror(checkpoint failed);
break;
default:
fprintf(stderr, checkpoint succeeded, CRID=%d\n, ret);
/* proceed with execution after checkpoint */
...
break;
case 0:
fprintf(stderr, returned after restart\n);
/* proceed with action required following a restart */
...
break;
}
...
 If I understand correctly, this crid can live for quite a long time. So 
 many of
 them could be generated while some container would accumulate incremental
 checkpoints on, say crid 5, and possibly crid 5 could be reused for another
 unrelated checkpoint during that time. This brings the issue of allocating 
 crids
 reliably (using something like a pidmap for instance). Moreover, if such 
 ids are
 exposed to userspace, we need to remember which ones are allocated accross
 reboots and migrations.

 I'm afraid that this becomes too complex...
 And I'm afraid I didn't explain myself well. So let me rephrase:

 CRIDs are always _local_ to a specific node. The local CRID counter is
 bumped (atomically) with each checkpoint attempt. The main use case is
 for when the checkpoint is kept is memory either shortly (until it is
 written back to disk) or for a longer time (use-cases that want to keep
 it there). It only remains valid as long as the checkpoint image is
 still in memory and have not been committed to storage/network. Think
 of it as a way to identify the operation instance.

 So they can live quite a long time, but only as long as the original
 node is still alive and the checkpoint is still kept in memory. They
 are meaningless across reboots and migrations. I don't think a wrap
 around is a concern, but we can use 64 bit if that is the case.

 Finally, the incremental checkpoint use-case: imagine a container that
 is checkpointed regularly every minutes. The first checkpoint will be
 a full checkpoint, say CRID=1. The second will be incremental with
 respect to the first, with CRID=2, and so on the third and the forth.
 Userspace could use these CRID to name the image files (for example,
 app.img.CRID). Assume that we decide (big if) that the convention is
 that the last part of the filename must be the CRID, and if we decide
 (another big if) to save the CRID as part of the checkpoint image --
 the part that describe the incremental nature of a new checkpoint.
 (That part would specify where to get state that wasn't really saved
 in the new checkpoint but instead can be retrieved from older ones).
 If that was the case, then the logic in the kernel would be fairly
 to find (and access) the actual files that hold the data. Note, that
 in this case - the CRID are guaranteed to be unique per series of
 incremental checkpoints, and incremental chekcpoint is meaningless
 across reboots (and we can require that across migration too).
 
 Letting the kernel guess where to find the missing data of an 

[Devel] Re: [RFC][PATCH 0/2] CR: save/restore a single, simple task

2008-07-31 Thread Daniel Lezcano
Oren Laadan wrote:
 
 Daniel Lezcano wrote:
 Oren Laadan wrote:
 Disclaimer: long reply :)

 Serge E. Hallyn wrote:
 Quoting Oren Laadan ([EMAIL PROTECTED]):
 In the recent mini-summit at OLS 2008 and the following days it was
 agreed to tackle the checkpoint/restart (CR) by beginning with a very
 simple case: save and restore a single task, with simple memory
 layout, disregarding other task state such as files, signals etc.

 Following these discussions I coded a prototype that can do exactly
 that, as a starter. This code adds two system calls - sys_checkpoint
 and sys_restart - that a task can call to save and restore its state
 respectively. It also demonstrates how the checkpoint image file can
 be formatted, as well as show its nested nature (e.g. cr_write_mm()
 - cr_write_vma() nesting).

 The state that is saved/restored is the following:
 * some of the task_struct
 * some of the thread_struct and thread_info
 * the cpu state (including FPU)
 * the memory address space

 [The patch is against commit fb2e405fc1fc8b20d9c78eaa1c7fd5a297efde43
 of Linus's tree (uhhh.. don't ask why), but against tonight's head 
 too].

 In the current code, sys_checkpoint will checkpoint the current task,
 although the logic exists to checkpoint other tasks (not in the
 checkpointee's execution context). A simple loop will extend this to
 handle multiple processes. sys_restart restarts the current tasks, and
 with multiple tasks each task will call the syscall independently.
 I assume that approach worked in Zap, so there must be a simple solution
 to this, but I don't see how having each process in a container
 independently call sys_restart works for sharing.  Oh, or is that where
 The main reason to do that (and I thought openvz works similarly ?) is
 that I want to re-use as much as possible the existing kernel 
 functionality.
 Restart differs from checkpoint in that you have to construct new 
 resources
 as opposed to only inspect existing resources. To inspect - you only need
 a reference to the object and then to obtain its state by accessing 
 it. In
 contrast, to construct, you need to create a new resource.

 In almost all cases, creating a resource for a process is easiest if 
 done by
 the process itself. For instance - to restore the memory map, you want 
 the
 process that owns the target mm to call mmap() (in particular, the lower
 level and more convenient for us do_mmap_pgoff() function). If the 
 process
 that restores a given vma didn't own that mm, it would take much more 
 pain
 to build the vma into a foreign mm.

 Thus, there is a huge advantage of doing everything in-context of the 
 target
 process, that is - we can re-use the existing kernel code (and spirit) to
 create the resources, instead of having to hand-craft them carefully with
 specialized code.

 a 'container restart context' comes in?  An nsproxy has a pointer to a
 More or less. At a first approximation, this is how I envision it:

 0) in user space, a new (empty) container will be created with all the
 needed settings for the file system etc (mounts .. and the like)

 1) the first task (container init) will call sys_restart with the 
 checkpoint
 image file.

 2) the code will verify the header, then read in the global section; 
 it will
 create a restart-context which will be referenced from the 
 container-object
 (one option we considered is to have the freezer-cgroup be that object).

 3) using the info from that section, it will create the task tree 
 (forest)
 to be restored. In particular, new tasks will be created and each will 
 end
 up in do_restart_task() inside the kernel.

 [note that in Zap, step 3 is still done in user space...]

 Since all tasks live in the container, they will all have access to the
 restart-context, through which all coordination is done.

 At first, the restart will be performed _one task at a time_, at the 
 order
 they were dumped. So while the init task restores itself, the remaining
 tasks sleep. When the init task finishes - it will wake the next in line
 and so on. The last one will wake the init task to finalize the work. So:

 4) each task waits (sleeps) until it is prompted to restore its own 
 state.
 When it completes, it wakes up the next task in line and goes to a freeze
 state.

 5) the init task finalized the restart, and either completes the 
 freeze or
 unfreezes the container, depending on what the user requested.

 This scheme makes sense because we assume that the data is streamed. 
 So it
 does not make much sense to try to restart the 5th job before the 2nd job
 because the data isn't there yet. Moreover, if they refer to the same 
 shared
 object, job#5 will have to wait to job#2 to create the object, since its
 state was saved with that job.

 In the future, to speed the process by concurrent restarting multiple 
 tasks,
 we'll have to read in data from the stream into a buffer (read-ahead) and
 then restarting tasks could skip data that doesn't belongs to them; while
 

[Devel] [PATCH 1/1] namespaces: introduce sys_hijack (v11)

2008-07-31 Thread Serge E. Hallyn
Hi Pavel,

Here is the 'hijack' patch that was mentioned during the namespaces
part of the containers mini-summit.  It's a proposed way of entering
namespaces.

It's been rotting for awhile as you can see by the changelog, but
hopefully I updated it sufficiently and correctly.

-serge

From 9a7e1c11cd96435d0d27d28e4508f887d6dbf7ed Mon Sep 17 00:00:00 2001
From: Serge E. Hallyn [EMAIL PROTECTED]
Date: Thu, 10 Jul 2008 11:51:38 -0500
Subject: [PATCH 1/1] namespaces: introduce sys_hijack (v11)

Move most of do_fork() into a new do_fork_task() which acts on
a new argument, cgroup, rather than on current.  The original
process actually forks and if passed a non-NULL cgroup then
the new process's cgroup and namespaces are taken from the
target cgroup specified.  If passed a NULL cgroup, fork
behaves exactly as before, thus do_fork() becomes a call to
do_fork_task(NULL, ...).

Introduce sys_hijack (for i386 and s390 only so far).  An open
fd for a cgroup 'tasks' file is specified.  The main purpose
is to allow entering an empty cgroup without having to keep a
task alive in the target cgroup.  Only the cgroup and nsproxy
are copied from the cgroup.  Security, user, and rootfs info
is not retained in the cgroups and so cannot be copied to the
child task.

In order to hijack a cgroup, you must have CAP_SYS_ADMIN and
be entering a decendent of your current cgroup.

The effect is a sort of namespace enter.  The following program
uses sys_hijack to 'enter' all namespaces of the specified
cgroup. For instance in one terminal, do

mount -t cgroup -ons cgroup /cgroup
hostname
  qemu
ns_exec -u /bin/sh
  hostname serge
  echo $$
2996
  cat /proc/$$/cgroup
ns:/node_2996

In another terminal then do

hostname
  qemu
cat /proc/$$/cgroup
  ns:/
hijack /cgroup/node_2996/tasks
  hostname
serge
  cat /proc/$$/cgroup
ns:/node_2996

Changelog:
  Jul 31 2008:  Put fs_struct in ns_cgroup, and hijack it in
addition to the nsproxy.
  Jul 10 2008:  Port to recent -mm (cope with cgroup changes)
  Aug 23 2007:  send a stop signal to the hijacked process
(like ptrace does).
  Oct 09 2007:  Update for 2.6.23-rc8-mm2 (mainly pidns)
Don't take task_lock under rcu_read_lock
Send hijacked process to cgroup_fork() as
the first argument.
Removed some unneeded task_locks.
  Oct 16 2007:  Fix bug introduced into alloc_pid.
  Oct 16 2007:  Add 'int which' argument to sys_hijack to
allow later expansion to use cgroup in place
of pid to specify what to hijack.
  Oct 24 2007:  Implement hijack by open cgroup file.
  Nov 02 2007:  Switch copying of task info: do full copy
from current, then copy relevant pieces from
hijacked task.
  Nov 06 2007:  Verbatim task_struct copy now comes from current,
after which copy_hijackable_taskinfo() copies
relevant context pieces from the hijack source.
  Nov 07 2007:  Move arch-independent hijack code to kernel/fork.c
  Nov 07 2007:  powerpc and x86_64 support (Mark Nelson)
  Nov 07 2007:  Don't allow hijacking members of same session.
  Nov 07 2007:  introduce cgroup_may_hijack, and may_hijack hook to
cgroup subsystems.  The ns subsystem uses this to
enforce the rule that one may only hijack descendent
namespaces.
  Nov 07 2007:  s390 support
  Nov 08 2007:  don't send SIGSTOP to hijack source task
  Nov 10 2007:  cache reference to nsproxy in ns cgroup for use in

hijacking an empty cgroup.
  Nov 10 2007:  allow partial hijack of empty cgroup
  Nov 13 2007:  don't double-get cgroup for hijack_ns
find_css_set() actually returns the set with a
reference already held, so cgroup_fork_fromcgroup()
by doing a get_css_set() was getting a second
reference.  Therefore after exiting the hijack
task we could not rmdir the csgroup.
  Nov 22 2007:  temporarily remove x86_64 and powerpc support
  Nov 27 2007:  rebased on 2.6.24-rc3
  Jan 09 2008:  removed hijack pid and hijack cgroup options
  Jan 11 2008:  renamed cgroup_fork_fromcgroup() to be
cgroup_fork_into_cgroup()

==
hijack.c
==
 #include stdio.h
 #include sys/types.h
 #include sys/stat.h
 #include fcntl.h
 #include signal.h
 #include sys/wait.h
 #include stdlib.h
 #include unistd.h

 #define __NR_hijack 333

/*
 *  hijack /cgroup/node_1078/tasks
 */

void usage(char *me)
{
printf(Usage: %s cgroup_tasks_file\n, me);
exit(1);
}

int exec_shell(void)
{
execl(/bin/sh, /bin/sh, NULL);
}

int main(int argc, char *argv[])
{
  

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-31 Thread Oren Laadan


Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 12:28:57PM -0400, Oren Laadan wrote:

 Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 11:09:54AM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Wed, Jul 30, 2008 at 06:20:32PM -0400, Oren Laadan wrote:
 Serge E. Hallyn wrote:
 Quoting Oren Laadan ([EMAIL PROTECTED]):
 +int do_checkpoint(struct cr_ctx *ctx)
 +{
 +  int ret;
 +
 +  /* FIX: need to test whether container is checkpointable */
 +
 +  ret = cr_write_hdr(ctx);
 +  if (!ret)
 +  ret = cr_write_task(ctx, current);
 +  if (!ret)
 +  ret = cr_write_tail(ctx);
 +
 +  /* on success, return (unique) checkpoint identifier */
 +  if (!ret)
 +  ret = ctx-crid;
 Does this crid have a purpose?
 yes, at least three; both are for the future, but important to set the
 meaning of the return value of the syscall already now. The crid is
 the CR-identifier that identifies the checkpoint. Every checkpoint is
 assigned a unique number (using an atomic counter).

 1) if a checkpoint is taken and kept in memory (instead of to a file) 
 then
 this will be the identifier with which the restart (or cleanup) would 
 refer
 to the (in memory) checkpoint image

 2) to reduce downtime of the checkpoint, data will be aggregated on the
 checkpoint context, as well as referenced to (cow-ed) pages. This data 
 can
 persist between calls to sys_checkpoint(), and the 'crid', again, will be
 used to identify the (in-memory-to-be-dumped-to-storage) context.

 3) for incremental checkpoint (where a successive checkpoint will only
 save what has changed since the previous checkpoint) there will be a need
 to identify the previous checkpoints (to be able to know where to take
 data from during restart). Again, a 'crid' is handy.

 [in fact, for the 3rd use, it will make sense to write that number as
 part of the checkpoint image header]

 Note that by doing so, a process that checkpoints itself (in its own
 context), can use code that is similar to the logic of fork():

  ...
  crid = checkpoint(...);
  switch (crid) {
  case -1:
  perror(checkpoint failed);
  break;
  default:
  fprintf(stderr, checkpoint succeeded, CRID=%d\n, ret);
  /* proceed with execution after checkpoint */
  ...
  break;
  case 0:
  fprintf(stderr, returned after restart\n);
  /* proceed with action required following a restart */
  ...
  break;
  }
  ...
 If I understand correctly, this crid can live for quite a long time. So 
 many of
 them could be generated while some container would accumulate incremental
 checkpoints on, say crid 5, and possibly crid 5 could be reused for 
 another
 unrelated checkpoint during that time. This brings the issue of 
 allocating crids
 reliably (using something like a pidmap for instance). Moreover, if such 
 ids are
 exposed to userspace, we need to remember which ones are allocated accross
 reboots and migrations.

 I'm afraid that this becomes too complex...
 And I'm afraid I didn't explain myself well. So let me rephrase:

 CRIDs are always _local_ to a specific node. The local CRID counter is
 bumped (atomically) with each checkpoint attempt. The main use case is
 for when the checkpoint is kept is memory either shortly (until it is
 written back to disk) or for a longer time (use-cases that want to keep
 it there). It only remains valid as long as the checkpoint image is
 still in memory and have not been committed to storage/network. Think
 of it as a way to identify the operation instance.

 So they can live quite a long time, but only as long as the original
 node is still alive and the checkpoint is still kept in memory. They
 are meaningless across reboots and migrations. I don't think a wrap
 around is a concern, but we can use 64 bit if that is the case.

 Finally, the incremental checkpoint use-case: imagine a container that
 is checkpointed regularly every minutes. The first checkpoint will be
 a full checkpoint, say CRID=1. The second will be incremental with
 respect to the first, with CRID=2, and so on the third and the forth.
 Userspace could use these CRID to name the image files (for example,
 app.img.CRID). Assume that we decide (big if) that the convention is
 that the last part of the filename must be the CRID, and if we decide
 (another big if) to save the CRID as part of the checkpoint image --
 the part that describe the incremental nature of a new checkpoint.
 (That part would specify where to get state that wasn't really saved
 in the new checkpoint but instead can be retrieved from older ones).
 If that was the case, then the logic in the kernel would be fairly
 to find (and access) the actual files that hold the data. Note, that
 in this case - the CRID are guaranteed to be unique per series of
 incremental checkpoints, and incremental chekcpoint is meaningless
 across reboots (and we can require that across migration too).
 Letting the kernel guess 

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-07-31 Thread Serge E. Hallyn
Quoting Oren Laadan ([EMAIL PROTECTED]):


 Serge E. Hallyn wrote:
 Quoting Oren Laadan ([EMAIL PROTECTED]):
 +int do_checkpoint(struct cr_ctx *ctx)
 +{
 +   int ret;
 +
 +   /* FIX: need to test whether container is checkpointable */
 +
 +   ret = cr_write_hdr(ctx);
 +   if (!ret)
 +   ret = cr_write_task(ctx, current);
 +   if (!ret)
 +   ret = cr_write_tail(ctx);
 +
 +   /* on success, return (unique) checkpoint identifier */
 +   if (!ret)
 +   ret = ctx-crid;

 Does this crid have a purpose?

 yes, at least three; both are for the future, but important to set the
 meaning of the return value of the syscall already now. The crid is
 the CR-identifier that identifies the checkpoint. Every checkpoint is
 assigned a unique number (using an atomic counter).

 1) if a checkpoint is taken and kept in memory (instead of to a file) then
 this will be the identifier with which the restart (or cleanup) would refer
 to the (in memory) checkpoint image

 2) to reduce downtime of the checkpoint, data will be aggregated on the
 checkpoint context, as well as referenced to (cow-ed) pages. This data can
 persist between calls to sys_checkpoint(), and the 'crid', again, will be
 used to identify the (in-memory-to-be-dumped-to-storage) context.

 3) for incremental checkpoint (where a successive checkpoint will only
 save what has changed since the previous checkpoint) there will be a need
 to identify the previous checkpoints (to be able to know where to take
 data from during restart). Again, a 'crid' is handy.

 [in fact, for the 3rd use, it will make sense to write that number as
 part of the checkpoint image header]

 Note that by doing so, a process that checkpoints itself (in its own
 context), can use code that is similar to the logic of fork():

   ...
   crid = checkpoint(...);
   switch (crid) {
   case -1:
   perror(checkpoint failed);
   break;
   default:
   fprintf(stderr, checkpoint succeeded, CRID=%d\n, ret);
   /* proceed with execution after checkpoint */
   ...
   break;
   case 0:
   fprintf(stderr, returned after restart\n);
   /* proceed with action required following a restart */
   ...
   break;
   }
   ...

Thanks - for this and the later explanations in replies to Louis.

Really I had no doubt it had a purpose :)  but wasn't sure what it was.
Quite clear now.  Thanks.

-serge
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


Re: [Devel] [RFC][PATCH 0/2] CR: save/restore a single, simple task

2008-07-31 Thread Serge E. Hallyn
Quoting Andrey Mirkin ([EMAIL PROTECTED]):
 Hello Oren,
 
 That is great, that you have proposed your version of checkpointing/restart.
 In a few days I will send a patchset with OpenVZ checkpointing/restart.
 So, we will be able to compare our approaches and take the best parts from 
 both.

Excellent, looking forward to it!  Are you going to stick to the same
limitations as Oren did?  (I think it would be best)

-serge

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] 2.6.26 panic (skb-dev==NULL), NFS support

2008-07-31 Thread Karel Tuma
Hello list,

I hope this is the right place to post.

I've recently moved to 2.6.26 git (i hope that's the bleeding edge) OpenVZ
to find out how usable it is. I'm running it on a box under fair IO load
(100-300 BIO/s). The thing panics in net/ipv4/tcp_ipv4.c:tcp_v4_send_ack()
once every couple of hours, during io spikes. Apparently skb-dev is NULL,
inside an irq context on top of that. It's very hard to pinpoint the actual
trigger. Vanilla 2.6.26 runs just fine.

Treating the symptoms, rather than the reason behind it, by testing
the pointer seems to work ok, without disrupting any service. However,
it would be nice to figure out the real cause to avoid such a hideous hack.

NFS seems to work (mounting from CT0 so far) after minor changes for
/proc interfacing (it was throwing ENOMEM). It's mounted in udp so it's
unlikely to be the cause for tcp panic.

oops, ugly fix and nfs patches are attached.

With regards,

// Karel, http://leet.cz

diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index f2a092c..77b47de 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -1502,7 +1502,7 @@ int __init nfs_fs_proc_init(void)
 {
struct proc_dir_entry *p;
 
-   proc_fs_nfs = proc_mkdir(fs/nfsfs, NULL);
+   proc_fs_nfs = proc_mkdir(fs/nfsfs, glob_proc_root);
if (!proc_fs_nfs)
goto error_0;
 
@@ -1536,7 +1536,7 @@ void nfs_fs_proc_exit(void)
 {
remove_proc_entry(volumes, proc_fs_nfs);
remove_proc_entry(servers, proc_fs_nfs);
-   remove_proc_entry(fs/nfsfs, NULL);
+   remove_proc_entry(fs/nfsfs, glob_proc_root);
 }
 
 #endif /* CONFIG_PROC_FS */

BUG: unable to handle kernel NULL pointer dereference at 03a8
IP: [804337a1] tcp_v4_send_ack+0xf1/0x170
PGD 712ca067 PUD 7eec8067 PMD 0
Oops:  [1]
CPU: 0
Modules linked in: simfs vzethdev vzmon vzdquota vzdev nfs lockd nfs_acl sunrpc 
dm_drivenc tun 8021q bridge llc ipv6 fuse dm_snapshot dm_mirror dm_log dm_mod 
netconsole loop evdev pcspkr serio_raw i2c_i801 sky2 e1000 [last unloaded: 
firmware_class]
Pid: 0, comm: swapper Not tainted 2.6.26 #6 036test001
RIP: 0010:[804337a1]  [804337a1] tcp_v4_send_ack+0xf1/0x170
RSP: 0018:805c1cd0  EFLAGS: 00010246
RAX:  RBX:  RCX: 0014
RDX: 805c1cd0 RSI: 81007efaa580 RDI: 
RBP: 810030a12800 R08: 16d0 R09: 0014
R10: 8100713ee434 R11: 5b139152 R12: 
R13: 81007efaa580 R14: 810073705340 R15: 8100711fa468
FS:  () GS:8057d000() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 03a8 CR3: 30b4 CR4: 06e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process swapper (pid: 0, veid=0, threadinfo 80584000, task 
80543340)
Stack:  805c1cf0 0014 00082ddef414 
5b1391520604bd01 d016105095bdbbf6 8100 810073705340
 80437068 a00bd050 8000
Call Trace:
IRQ  [80437068] ? tcp_check_req+0x368/0x3f0
[a00bd050] ? :bridge:br_nf_pre_routing_finish+0x0/0x300
[80434988] ? tcp_v4_do_rcv+0x178/0x240
[80436795] ? tcp_v4_rcv+0x5c5/0x6c0
[8041839d] ? ip_local_deliver_finish+0x10d/0x1e0
[80418074] ? ip_rcv_finish+0x144/0x360
[804187cf] ? ip_rcv+0x23f/0x2e0
[803f5f17] ? netif_receive_skb+0x337/0x480
[803f8c8f] ? process_backlog+0x6f/0xd0
[803f882e] ? net_rx_action+0xce/0x180
[af86] ? :e1000:e1000_set_itr+0x86/0x150
[8022ceba] ? __do_softirq+0x8a/0x120
[8020c14c] ? call_softirq+0x1c/0x30
[8020e005] ? do_softirq+0x35/0x70
[8020e381] ? do_IRQ+0x61/0xc0
[80211920] ? mwait_idle+0x0/0x50
[8020b9a1] ? ret_from_intr+0x0/0xa
EOI  [804234c0] ? tcp_poll+0x0/0x200
[8021195e] ? mwait_idle+0x3e/0x50
[8020a103] ? cpu_idle+0x33/0x60
Code: 0c 13 41 10 11 d0 83 d0 00 48 85 ff 89 44 24 10 c7 44 24 14 08 00 00 00 
74 07 8b 47 04 89 44 24 18 48 8b 46 20 48 89 e2 44 89 c9 48 8b 80 a8 03 00 00 
48 8b b8 30 01 00 00 e8 0c 9f fe  RSP 805c1cd0
CR2: 03a8
Kernel panic - not syncing: Fatal exception
Rebooting in 1 seconds..md: bindsdi2

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ca6b5d3..f7571c1 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -689,6 +689,9 @@ static void tcp_v4_send_ack(struct tcp_timewait_sock *twsk,
if (twsk)
arg.bound_dev_if = twsk-tw_sk.tw_bound_dev_if;
 
+   if (!skb-dev) { printk(Hey, skb-dev is NULL, this is bad. I'll save 
you from panic this time ... How did you trigger this anyway?); return; };
ip_send_reply(dev_net(skb-dev)-ipv4.tcp_sock, skb,
  arg, arg.iov[0].iov_len);
 

[Devel] [PATCH 0/9] Bridging and ebtables in netns!

2008-07-31 Thread Alexey Dobriyan
Hi, I'm going to send this once networking people will start accepting
new stuff.

Please, try and test on your favourite bridging setup, I'm currently
very limited in this.

To try ebtables, apply this patch first:


diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 292fa28..c4065b8 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -165,14 +165,6 @@ int nf_hook_slow(int pf, unsigned int hook, struct sk_buff 
*skb,
unsigned int verdict;
int ret = 0;
 
-#ifdef CONFIG_NET_NS
-   struct net *net;
-
-   net = indev == NULL ? dev_net(outdev) : dev_net(indev);
-   if (net != init_net)
-   return 1;
-#endif
-
/* We may already have this, but read-locks nest anyway */
rcu_read_lock();
 
diff --git a/net/netfilter/nf_sockopt.c b/net/netfilter/nf_sockopt.c
index 0148968..aa01c54 100644
--- a/net/netfilter/nf_sockopt.c
+++ b/net/netfilter/nf_sockopt.c
@@ -65,9 +65,6 @@ static struct nf_sockopt_ops *nf_sockopt_find(struct sock 
*sk, int pf,
 {
struct nf_sockopt_ops *ops;
 
-   if (!net_eq(sock_net(sk), init_net))
-   return ERR_PTR(-ENOPROTOOPT);
-
if (mutex_lock_interruptible(nf_sockopt_mutex) != 0)
return ERR_PTR(-EINTR);
 

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/9] netns bridge: netdevice part

2008-07-31 Thread Alexey Dobriyan
Allow creation of bridges in netns.

Bridge netdevice doesn't cross netns boundaries.
Additions and deletions are done with netdevices in the same netns.

Process notifications in netns, too.

Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 net/bridge/br_device.c   |3 ++-
 net/bridge/br_if.c   |   11 ++-
 net/bridge/br_ioctl.c|   20 ++--
 net/bridge/br_netlink.c  |   15 +--
 net/bridge/br_notify.c   |3 ---
 net/bridge/br_private.h  |4 ++--
 net/bridge/br_stp_bpdu.c |3 ---
 7 files changed, 25 insertions(+), 34 deletions(-)

--- a/net/bridge/br_device.c
+++ b/net/bridge/br_device.c
@@ -166,5 +166,6 @@ void br_dev_setup(struct net_device *dev)
dev-priv_flags = IFF_EBRIDGE;
 
dev-features = NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HIGHDMA |
-   NETIF_F_GSO_MASK | NETIF_F_NO_CSUM | NETIF_F_LLTX;
+   NETIF_F_GSO_MASK | NETIF_F_NO_CSUM | NETIF_F_LLTX |
+   NETIF_F_NETNS_LOCAL;
 }
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -168,7 +168,7 @@ static void del_br(struct net_bridge *br)
unregister_netdevice(br-dev);
 }
 
-static struct net_device *new_bridge_dev(const char *name)
+static struct net_device *new_bridge_dev(struct net *net, const char *name)
 {
struct net_bridge *br;
struct net_device *dev;
@@ -178,6 +178,7 @@ static struct net_device *new_bridge_dev(const char *name)
 
if (!dev)
return NULL;
+   dev_net_set(dev, net);
 
br = netdev_priv(dev);
br-dev = dev;
@@ -259,12 +260,12 @@ static struct net_bridge_port *new_nbp(struct net_bridge 
*br,
return p;
 }
 
-int br_add_bridge(const char *name)
+int br_add_bridge(struct net *net, const char *name)
 {
struct net_device *dev;
int ret;
 
-   dev = new_bridge_dev(name);
+   dev = new_bridge_dev(net, name);
if (!dev)
return -ENOMEM;
 
@@ -291,13 +292,13 @@ out_free:
goto out;
 }
 
-int br_del_bridge(const char *name)
+int br_del_bridge(struct net *net, const char *name)
 {
struct net_device *dev;
int ret = 0;
 
rtnl_lock();
-   dev = __dev_get_by_name(init_net, name);
+   dev = __dev_get_by_name(net, name);
if (dev == NULL)
ret =  -ENXIO;  /* Could not find device */
 
--- a/net/bridge/br_ioctl.c
+++ b/net/bridge/br_ioctl.c
@@ -21,12 +21,12 @@
 #include br_private.h
 
 /* called with RTNL */
-static int get_bridge_ifindices(int *indices, int num)
+static int get_bridge_ifindices(struct net *net, int *indices, int num)
 {
struct net_device *dev;
int i = 0;
 
-   for_each_netdev(init_net, dev) {
+   for_each_netdev(net, dev) {
if (i = num)
break;
if (dev-priv_flags  IFF_EBRIDGE)
@@ -89,7 +89,7 @@ static int add_del_if(struct net_bridge *br, int ifindex, int 
isadd)
if (!capable(CAP_NET_ADMIN))
return -EPERM;
 
-   dev = dev_get_by_index(init_net, ifindex);
+   dev = dev_get_by_index(dev_net(br-dev), ifindex);
if (dev == NULL)
return -EINVAL;
 
@@ -309,7 +309,7 @@ static int old_dev_ioctl(struct net_device *dev, struct 
ifreq *rq, int cmd)
return -EOPNOTSUPP;
 }
 
-static int old_deviceless(void __user *uarg)
+static int old_deviceless(struct net *net, void __user *uarg)
 {
unsigned long args[3];
 
@@ -331,7 +331,7 @@ static int old_deviceless(void __user *uarg)
if (indices == NULL)
return -ENOMEM;
 
-   args[2] = get_bridge_ifindices(indices, args[2]);
+   args[2] = get_bridge_ifindices(net, indices, args[2]);
 
ret = copy_to_user((void __user *)args[1], indices, 
args[2]*sizeof(int))
? -EFAULT : args[2];
@@ -354,9 +354,9 @@ static int old_deviceless(void __user *uarg)
buf[IFNAMSIZ-1] = 0;
 
if (args[0] == BRCTL_ADD_BRIDGE)
-   return br_add_bridge(buf);
+   return br_add_bridge(net, buf);
 
-   return br_del_bridge(buf);
+   return br_del_bridge(net, buf);
}
}
 
@@ -368,7 +368,7 @@ int br_ioctl_deviceless_stub(struct net *net, unsigned int 
cmd, void __user *uar
switch (cmd) {
case SIOCGIFBR:
case SIOCSIFBR:
-   return old_deviceless(uarg);
+   return old_deviceless(net, uarg);
 
case SIOCBRADDBR:
case SIOCBRDELBR:
@@ -383,9 +383,9 @@ int br_ioctl_deviceless_stub(struct net *net, unsigned int 
cmd, void __user *uar
 
buf[IFNAMSIZ-1] = 0;
if (cmd == SIOCBRADDBR)
-   return br_add_bridge(buf);
+   return br_add_bridge(net, buf);
 
-   return br_del_bridge(buf);
+   return br_del_bridge(net, buf);
}

[Devel] [PATCH 2/9] netns bridge: remove bridges during netns stop

2008-07-31 Thread Alexey Dobriyan
Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 net/bridge/br.c |   22 --
 net/bridge/br_if.c  |4 ++--
 net/bridge/br_private.h |2 +-
 3 files changed, 19 insertions(+), 9 deletions(-)

--- a/net/bridge/br.c
+++ b/net/bridge/br.c
@@ -28,6 +28,10 @@ static const struct stp_proto br_stp_proto = {
.rcv= br_stp_rcv,
 };
 
+static struct pernet_operations br_net_ops = {
+   .exit   = br_net_exit,
+};
+
 static int __init br_init(void)
 {
int err;
@@ -42,18 +46,22 @@ static int __init br_init(void)
if (err)
goto err_out;
 
-   err = br_netfilter_init();
+   err = register_pernet_subsys(br_net_ops);
if (err)
goto err_out1;
 
-   err = register_netdevice_notifier(br_device_notifier);
+   err = br_netfilter_init();
if (err)
goto err_out2;
 
-   err = br_netlink_init();
+   err = register_netdevice_notifier(br_device_notifier);
if (err)
goto err_out3;
 
+   err = br_netlink_init();
+   if (err)
+   goto err_out4;
+
brioctl_set(br_ioctl_deviceless_stub);
br_handle_frame_hook = br_handle_frame;
 
@@ -61,10 +69,12 @@ static int __init br_init(void)
br_fdb_put_hook = br_fdb_put;
 
return 0;
-err_out3:
+err_out4:
unregister_netdevice_notifier(br_device_notifier);
-err_out2:
+err_out3:
br_netfilter_fini();
+err_out2:
+   unregister_pernet_subsys(br_net_ops);
 err_out1:
br_fdb_fini();
 err_out:
@@ -80,7 +90,7 @@ static void __exit br_deinit(void)
unregister_netdevice_notifier(br_device_notifier);
brioctl_set(NULL);
 
-   br_cleanup_bridges();
+   unregister_pernet_subsys(br_net_ops);
 
synchronize_net();
 
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -443,13 +443,13 @@ int br_del_if(struct net_bridge *br, struct net_device 
*dev)
return 0;
 }
 
-void __exit br_cleanup_bridges(void)
+void br_net_exit(struct net *net)
 {
struct net_device *dev;
 
rtnl_lock();
 restart:
-   for_each_netdev(init_net, dev) {
+   for_each_netdev(net, dev) {
if (dev-priv_flags  IFF_EBRIDGE) {
del_br(dev-priv);
goto restart;
--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
@@ -176,7 +176,7 @@ extern void br_flood_forward(struct net_bridge *br, struct 
sk_buff *skb);
 extern void br_port_carrier_check(struct net_bridge_port *p);
 extern int br_add_bridge(struct net *net, const char *name);
 extern int br_del_bridge(struct net *net, const char *name);
-extern void br_cleanup_bridges(void);
+extern void br_net_exit(struct net *net);
 extern int br_add_if(struct net_bridge *br,
  struct net_device *dev);
 extern int br_del_if(struct net_bridge *br,
-- 
1.5.4.5


___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/9] netns ebtables: per-netns table list

2008-07-31 Thread Alexey Dobriyan
Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 include/net/net_namespace.h |4 
 include/net/netns/bridge.h  |9 +
 net/bridge/netfilter/ebtables.c |8 
 3 files changed, 17 insertions(+), 4 deletions(-)

--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -19,6 +19,7 @@
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 #include net/netns/conntrack.h
 #endif
+#include net/netns/bridge.h
 
 struct proc_dir_entry;
 struct net_device;
@@ -73,6 +74,9 @@ struct net {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
struct netns_ct ct;
 #endif
+#ifdef CONFIG_BRIDGE_NETFILTER
+   struct netns_br br;
+#endif
 #endif
struct net_generic  *gen;
 };
new file mode 100644
--- /dev/null
+++ b/include/net/netns/bridge.h
@@ -0,0 +1,9 @@
+#ifndef __NETNS_BRIDGE_H
+#define __NETNS_BRIDGE_H
+
+#include linux/list.h
+
+struct netns_br {
+   struct list_headebt_tables;
+};
+#endif
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -54,7 +54,6 @@
 
 
 static DEFINE_MUTEX(ebt_mutex);
-static LIST_HEAD(ebt_tables);
 static LIST_HEAD(ebt_targets);
 static LIST_HEAD(ebt_matches);
 static LIST_HEAD(ebt_watchers);
@@ -309,7 +308,7 @@ find_inlist_lock(struct list_head *head, const char *name, 
const char *prefix,
 static inline struct ebt_table *
 find_table_lock(const char *name, int *error, struct mutex *mutex)
 {
-   return find_inlist_lock(ebt_tables, name, ebtable_, error, mutex);
+   return find_inlist_lock(init_net.br.ebt_tables, name, ebtable_, 
error, mutex);
 }
 
 static inline struct ebt_match *
@@ -1209,7 +1208,7 @@ int ebt_register_table(struct ebt_table *table)
if (ret != 0)
goto free_chainstack;
 
-   list_for_each_entry(t, ebt_tables, list) {
+   list_for_each_entry(t, init_net.br.ebt_tables, list) {
if (strcmp(t-name, table-name) == 0) {
ret = -EEXIST;
BUGPRINT(Table name already exists\n);
@@ -1222,7 +1221,7 @@ int ebt_register_table(struct ebt_table *table)
ret = -ENOENT;
goto free_unlock;
}
-   list_add(table-list, ebt_tables);
+   list_add(table-list, init_net.br.ebt_tables);
mutex_unlock(ebt_mutex);
return 0;
 free_unlock:
@@ -1523,6 +1522,7 @@ static int __init ebtables_init(void)
mutex_unlock(ebt_mutex);
if ((ret = nf_register_sockopt(ebt_sockopts))  0)
return ret;
+   INIT_LIST_HEAD(init_net.br.ebt_tables);
 
printk(KERN_INFO Ebtables v2.0 registered\n);
return 0;
-- 
1.5.4.5


___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCh 4/9] netns ebtables: per-netns ebtables itself

2008-07-31 Thread Alexey Dobriyan
Register ebtable in netns.

For this duplicate table at the very beginning. This is done so we won't
add one table first to table list of netns#1, then overwrite list_head
when adding to table list of netns #2, and so on.

P.S.: Addition of underscored variables is temporary and will go away in
a patch ot two after corresponding module itself is made netns-ready.

Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 include/linux/netfilter_bridge/ebtables.h |2 
 net/bridge/netfilter/ebtable_broute.c |   19 +++-
 net/bridge/netfilter/ebtable_filter.c |   17 +++
 net/bridge/netfilter/ebtable_nat.c|   19 
 net/bridge/netfilter/ebtables.c   |   69 +-
 5 files changed, 78 insertions(+), 48 deletions(-)

--- a/include/linux/netfilter_bridge/ebtables.h
+++ b/include/linux/netfilter_bridge/ebtables.h
@@ -286,7 +286,7 @@ struct ebt_table
 
 #define EBT_ALIGN(s) (((s) + (__alignof__(struct ebt_replace)-1))  \
 ~(__alignof__(struct ebt_replace)-1))
-extern int ebt_register_table(struct ebt_table *table);
+extern struct ebt_table *ebt_register_table(struct net *net, struct ebt_table 
*table);
 extern void ebt_unregister_table(struct ebt_table *table);
 extern int ebt_register_match(struct ebt_match *match);
 extern void ebt_unregister_match(struct ebt_match *match);
--- a/net/bridge/netfilter/ebtable_broute.c
+++ b/net/bridge/netfilter/ebtable_broute.c
@@ -41,22 +41,23 @@ static int check(const struct ebt_table_info *info, 
unsigned int valid_hooks)
return 0;
 }
 
-static struct ebt_table broute_table =
+static struct ebt_table __broute_table =
 {
.name   = broute,
.table  = initial_table,
.valid_hooks= 1  NF_BR_BROUTING,
-   .lock   = __RW_LOCK_UNLOCKED(broute_table.lock),
+   .lock   = __RW_LOCK_UNLOCKED(__broute_table.lock),
.check  = check,
.me = THIS_MODULE,
 };
+static struct ebt_table *broute_table;
 
 static int ebt_broute(struct sk_buff *skb)
 {
int ret;
 
ret = ebt_do_table(NF_BR_BROUTING, skb, skb-dev, NULL,
-  broute_table);
+  broute_table);
if (ret == NF_DROP)
return 1; /* route it */
return 0; /* bridge it */
@@ -64,21 +65,19 @@ static int ebt_broute(struct sk_buff *skb)
 
 static int __init ebtable_broute_init(void)
 {
-   int ret;
-
-   ret = ebt_register_table(broute_table);
-   if (ret  0)
-   return ret;
+   broute_table = ebt_register_table(init_net, __broute_table);
+   if (IS_ERR(broute_table))
+   return PTR_ERR(broute_table);
/* see br_input.c */
rcu_assign_pointer(br_should_route_hook, ebt_broute);
-   return ret;
+   return 0;
 }
 
 static void __exit ebtable_broute_fini(void)
 {
rcu_assign_pointer(br_should_route_hook, NULL);
synchronize_net();
-   ebt_unregister_table(broute_table);
+   ebt_unregister_table(broute_table);
 }
 
 module_init(ebtable_broute_init);
--- a/net/bridge/netfilter/ebtable_filter.c
+++ b/net/bridge/netfilter/ebtable_filter.c
@@ -50,21 +50,22 @@ static int check(const struct ebt_table_info *info, 
unsigned int valid_hooks)
return 0;
 }
 
-static struct ebt_table frame_filter =
+static struct ebt_table __frame_filter =
 {
.name   = filter,
.table  = initial_table,
.valid_hooks= FILTER_VALID_HOOKS,
-   .lock   = __RW_LOCK_UNLOCKED(frame_filter.lock),
+   .lock   = __RW_LOCK_UNLOCKED(__frame_filter.lock),
.check  = check,
.me = THIS_MODULE,
 };
+static struct ebt_table *frame_filter;
 
 static unsigned int
 ebt_hook(unsigned int hook, struct sk_buff *skb, const struct net_device *in,
const struct net_device *out, int (*okfn)(struct sk_buff *))
 {
-   return ebt_do_table(hook, skb, in, out, frame_filter);
+   return ebt_do_table(hook, skb, in, out, frame_filter);
 }
 
 static struct nf_hook_ops ebt_ops_filter[] __read_mostly = {
@@ -95,19 +96,19 @@ static int __init ebtable_filter_init(void)
 {
int ret;
 
-   ret = ebt_register_table(frame_filter);
-   if (ret  0)
-   return ret;
+   frame_filter = ebt_register_table(init_net, __frame_filter);
+   if (IS_ERR(frame_filter))
+   return PTR_ERR(frame_filter);
ret = nf_register_hooks(ebt_ops_filter, ARRAY_SIZE(ebt_ops_filter));
if (ret  0)
-   ebt_unregister_table(frame_filter);
+   ebt_unregister_table(frame_filter);
return ret;
 }
 
 static void __exit ebtable_filter_fini(void)
 {
nf_unregister_hooks(ebt_ops_filter, ARRAY_SIZE(ebt_ops_filter));
-   ebt_unregister_table(frame_filter);
+   ebt_unregister_table(frame_filter);
 }
 
 module_init(ebtable_filter_init);
--- a/net/bridge/netfilter/ebtable_nat.c
+++ 

[Devel] [PATCH 5/9] netns ebtables: netns-aware broute table

2008-07-31 Thread Alexey Dobriyan
Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 include/net/netns/bridge.h|1 
 net/bridge/netfilter/ebtable_broute.c |   35 ++
 2 files changed, 28 insertions(+), 8 deletions(-)

--- a/include/net/netns/bridge.h
+++ b/include/net/netns/bridge.h
@@ -5,5 +5,6 @@
 
 struct netns_br {
struct list_headebt_tables;
+   struct ebt_table *broute_table;
 };
 #endif
--- a/net/bridge/netfilter/ebtable_broute.c
+++ b/net/bridge/netfilter/ebtable_broute.c
@@ -41,33 +41,52 @@ static int check(const struct ebt_table_info *info, 
unsigned int valid_hooks)
return 0;
 }
 
-static struct ebt_table __broute_table =
+static struct ebt_table broute_table =
 {
.name   = broute,
.table  = initial_table,
.valid_hooks= 1  NF_BR_BROUTING,
-   .lock   = __RW_LOCK_UNLOCKED(__broute_table.lock),
+   .lock   = __RW_LOCK_UNLOCKED(broute_table.lock),
.check  = check,
.me = THIS_MODULE,
 };
-static struct ebt_table *broute_table;
 
 static int ebt_broute(struct sk_buff *skb)
 {
int ret;
 
ret = ebt_do_table(NF_BR_BROUTING, skb, skb-dev, NULL,
-  broute_table);
+  dev_net(skb-dev)-br.broute_table);
if (ret == NF_DROP)
return 1; /* route it */
return 0; /* bridge it */
 }
 
+static int ebtable_broute_net_init(struct net *net)
+{
+   net-br.broute_table = ebt_register_table(net, broute_table);
+   if (IS_ERR(net-br.broute_table))
+   return PTR_ERR(net-br.broute_table);
+   return 0;
+}
+
+static void ebtable_broute_net_exit(struct net *net)
+{
+   ebt_unregister_table(net-br.broute_table);
+}
+
+static struct pernet_operations ebtable_broute_net_ops = {
+   .init = ebtable_broute_net_init,
+   .exit = ebtable_broute_net_exit,
+};
+
 static int __init ebtable_broute_init(void)
 {
-   broute_table = ebt_register_table(init_net, __broute_table);
-   if (IS_ERR(broute_table))
-   return PTR_ERR(broute_table);
+   int ret;
+
+   ret = register_pernet_subsys(ebtable_broute_net_ops);
+   if (ret  0)
+   return ret;
/* see br_input.c */
rcu_assign_pointer(br_should_route_hook, ebt_broute);
return 0;
@@ -77,7 +96,7 @@ static void __exit ebtable_broute_fini(void)
 {
rcu_assign_pointer(br_should_route_hook, NULL);
synchronize_net();
-   ebt_unregister_table(broute_table);
+   unregister_pernet_subsys(ebtable_broute_net_ops);
 }
 
 module_init(ebtable_broute_init);
-- 
1.5.4.5


___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 7/9] netns ebtables: netns-aware nat table

2008-07-31 Thread Alexey Dobriyan
Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 include/net/netns/bridge.h |1 
 net/bridge/netfilter/ebtable_nat.c |   47 +
 2 files changed, 33 insertions(+), 15 deletions(-)

--- a/include/net/netns/bridge.h
+++ b/include/net/netns/bridge.h
@@ -7,5 +7,6 @@ struct netns_br {
struct list_headebt_tables;
struct ebt_table *broute_table;
struct ebt_table *frame_filter;
+   struct ebt_table *frame_nat;
 };
 #endif
--- a/net/bridge/netfilter/ebtable_nat.c
+++ b/net/bridge/netfilter/ebtable_nat.c
@@ -50,48 +50,47 @@ static int check(const struct ebt_table_info *info, 
unsigned int valid_hooks)
return 0;
 }
 
-static struct ebt_table __frame_nat =
+static struct ebt_table frame_nat =
 {
.name   = nat,
.table  = initial_table,
.valid_hooks= NAT_VALID_HOOKS,
-   .lock   = __RW_LOCK_UNLOCKED(__frame_nat.lock),
+   .lock   = __RW_LOCK_UNLOCKED(frame_nat.lock),
.check  = check,
.me = THIS_MODULE,
 };
-static struct ebt_table *frame_nat;
 
 static unsigned int
-ebt_nat_dst(unsigned int hook, struct sk_buff *skb, const struct net_device *in
+ebt_nat_pre_routing(unsigned int hook, struct sk_buff *skb, const struct 
net_device *in
, const struct net_device *out, int (*okfn)(struct sk_buff *))
 {
-   return ebt_do_table(hook, skb, in, out, frame_nat);
+   return ebt_do_table(hook, skb, in, out, dev_net(in)-br.frame_nat);
 }
 
 static unsigned int
-ebt_nat_src(unsigned int hook, struct sk_buff *skb, const struct net_device *in
+ebt_nat_out(unsigned int hook, struct sk_buff *skb, const struct net_device *in
, const struct net_device *out, int (*okfn)(struct sk_buff *))
 {
-   return ebt_do_table(hook, skb, in, out, frame_nat);
+   return ebt_do_table(hook, skb, in, out, dev_net(out)-br.frame_nat);
 }
 
 static struct nf_hook_ops ebt_ops_nat[] __read_mostly = {
{
-   .hook   = ebt_nat_dst,
+   .hook   = ebt_nat_out,
.owner  = THIS_MODULE,
.pf = PF_BRIDGE,
.hooknum= NF_BR_LOCAL_OUT,
.priority   = NF_BR_PRI_NAT_DST_OTHER,
},
{
-   .hook   = ebt_nat_src,
+   .hook   = ebt_nat_out,
.owner  = THIS_MODULE,
.pf = PF_BRIDGE,
.hooknum= NF_BR_POST_ROUTING,
.priority   = NF_BR_PRI_NAT_SRC,
},
{
-   .hook   = ebt_nat_dst,
+   .hook   = ebt_nat_pre_routing,
.owner  = THIS_MODULE,
.pf = PF_BRIDGE,
.hooknum= NF_BR_PRE_ROUTING,
@@ -99,23 +98,41 @@ static struct nf_hook_ops ebt_ops_nat[] __read_mostly = {
},
 };
 
+static int frame_nat_net_init(struct net *net)
+{
+   net-br.frame_nat = ebt_register_table(net, frame_nat);
+   if (IS_ERR(net-br.frame_nat))
+   return PTR_ERR(net-br.frame_nat);
+   return 0;
+}
+
+static void frame_nat_net_exit(struct net *net)
+{
+   ebt_unregister_table(net-br.frame_nat);
+}
+
+static struct pernet_operations frame_nat_net_ops = {
+   .init = frame_nat_net_init,
+   .exit = frame_nat_net_exit,
+};
+
 static int __init ebtable_nat_init(void)
 {
int ret;
 
-   frame_nat = ebt_register_table(init_net, __frame_nat);
-   if (IS_ERR(frame_nat))
-   return PTR_ERR(frame_nat);
+   ret = register_pernet_subsys(frame_nat_net_ops);
+   if (ret  0)
+   return ret;
ret = nf_register_hooks(ebt_ops_nat, ARRAY_SIZE(ebt_ops_nat));
if (ret  0)
-   ebt_unregister_table(frame_nat);
+   unregister_pernet_subsys(frame_nat_net_ops);
return ret;
 }
 
 static void __exit ebtable_nat_fini(void)
 {
nf_unregister_hooks(ebt_ops_nat, ARRAY_SIZE(ebt_ops_nat));
-   ebt_unregister_table(frame_nat);
+   unregister_pernet_subsys(frame_nat_net_ops);
 }
 
 module_init(ebtable_nat_init);
-- 
1.5.4.5


___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 6/9] netns ebtables: netns-aware filter table

2008-07-31 Thread Alexey Dobriyan
Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 include/net/netns/bridge.h|1 
 net/bridge/netfilter/ebtable_filter.c |   50 +-
 2 files changed, 38 insertions(+), 13 deletions(-)

--- a/include/net/netns/bridge.h
+++ b/include/net/netns/bridge.h
@@ -6,5 +6,6 @@
 struct netns_br {
struct list_headebt_tables;
struct ebt_table *broute_table;
+   struct ebt_table *frame_filter;
 };
 #endif
--- a/net/bridge/netfilter/ebtable_filter.c
+++ b/net/bridge/netfilter/ebtable_filter.c
@@ -50,41 +50,47 @@ static int check(const struct ebt_table_info *info, 
unsigned int valid_hooks)
return 0;
 }
 
-static struct ebt_table __frame_filter =
+static struct ebt_table frame_filter =
 {
.name   = filter,
.table  = initial_table,
.valid_hooks= FILTER_VALID_HOOKS,
-   .lock   = __RW_LOCK_UNLOCKED(__frame_filter.lock),
+   .lock   = __RW_LOCK_UNLOCKED(frame_filter.lock),
.check  = check,
.me = THIS_MODULE,
 };
-static struct ebt_table *frame_filter;
 
 static unsigned int
-ebt_hook(unsigned int hook, struct sk_buff *skb, const struct net_device *in,
+ebt_in_hook(unsigned int hook, struct sk_buff *skb, const struct net_device 
*in,
const struct net_device *out, int (*okfn)(struct sk_buff *))
 {
-   return ebt_do_table(hook, skb, in, out, frame_filter);
+   return ebt_do_table(hook, skb, in, out, dev_net(in)-br.frame_filter);
+}
+
+static unsigned int
+ebt_out_hook(unsigned int hook, struct sk_buff *skb, const struct net_device 
*in,
+   const struct net_device *out, int (*okfn)(struct sk_buff *))
+{
+   return ebt_do_table(hook, skb, in, out, dev_net(out)-br.frame_filter);
 }
 
 static struct nf_hook_ops ebt_ops_filter[] __read_mostly = {
{
-   .hook   = ebt_hook,
+   .hook   = ebt_in_hook,
.owner  = THIS_MODULE,
.pf = PF_BRIDGE,
.hooknum= NF_BR_LOCAL_IN,
.priority   = NF_BR_PRI_FILTER_BRIDGED,
},
{
-   .hook   = ebt_hook,
+   .hook   = ebt_in_hook,
.owner  = THIS_MODULE,
.pf = PF_BRIDGE,
.hooknum= NF_BR_FORWARD,
.priority   = NF_BR_PRI_FILTER_BRIDGED,
},
{
-   .hook   = ebt_hook,
+   .hook   = ebt_out_hook,
.owner  = THIS_MODULE,
.pf = PF_BRIDGE,
.hooknum= NF_BR_LOCAL_OUT,
@@ -92,23 +98,41 @@ static struct nf_hook_ops ebt_ops_filter[] __read_mostly = {
},
 };
 
+static int frame_filter_net_init(struct net *net)
+{
+   net-br.frame_filter = ebt_register_table(net, frame_filter);
+   if (IS_ERR(net-br.frame_filter))
+   return PTR_ERR(net-br.frame_filter);
+   return 0;
+}
+
+static void frame_filter_net_exit(struct net *net)
+{
+   ebt_unregister_table(net-br.frame_filter);
+}
+
+static struct pernet_operations frame_filter_net_ops = {
+   .init = frame_filter_net_init,
+   .exit = frame_filter_net_exit,
+};
+
 static int __init ebtable_filter_init(void)
 {
int ret;
 
-   frame_filter = ebt_register_table(init_net, __frame_filter);
-   if (IS_ERR(frame_filter))
-   return PTR_ERR(frame_filter);
+   ret = register_pernet_subsys(frame_filter_net_ops);
+   if (ret  0)
+   return ret;
ret = nf_register_hooks(ebt_ops_filter, ARRAY_SIZE(ebt_ops_filter));
if (ret  0)
-   ebt_unregister_table(frame_filter);
+   unregister_pernet_subsys(frame_filter_net_ops);
return ret;
 }
 
 static void __exit ebtable_filter_fini(void)
 {
nf_unregister_hooks(ebt_ops_filter, ARRAY_SIZE(ebt_ops_filter));
-   ebt_unregister_table(frame_filter);
+   unregister_pernet_subsys(frame_filter_net_ops);
 }
 
 module_init(ebtable_filter_init);
-- 
1.5.4.5


___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 8/9] netns ebtables: deal with fake netdevices et al

2008-07-31 Thread Alexey Dobriyan
Bridge netfilter code uses fake netdevice and fake rtable. Fake means
static struct net_device. So these should be logically created in
bridge's portion of struct netns.

But!

Adding struct net_device __fake_net_device creates header circular dependency,
which is PITA to resolve. I couldn't, so fake netdevice and fake rtable
are created dynamically. :-(

Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 include/net/netns/bridge.h |   11 -
 net/bridge/br_netfilter.c  |   88 -
 2 files changed, 71 insertions(+), 28 deletions(-)

--- a/include/net/netns/bridge.h
+++ b/include/net/netns/bridge.h
@@ -3,10 +3,15 @@
 
 #include linux/list.h
 
+struct net_device;
+struct rtable;
+
 struct netns_br {
struct list_headebt_tables;
-   struct ebt_table *broute_table;
-   struct ebt_table *frame_filter;
-   struct ebt_table *frame_nat;
+   struct ebt_table*broute_table;
+   struct ebt_table*frame_filter;
+   struct ebt_table*frame_nat;
+   struct net_device   *__fake_net_device;
+   struct rtable   *__fake_rtable;
 };
 #endif
--- a/net/bridge/br_netfilter.c
+++ b/net/bridge/br_netfilter.c
@@ -109,24 +109,47 @@ static inline __be16 pppoe_proto(const struct sk_buff 
*skb)
  * refragmentation needs it, and the rt_flags entry because
  * ipt_REJECT needs it.  Future netfilter modules might
  * require us to fill additional fields. */
-static struct net_device __fake_net_device = {
-   .hard_header_len= ETH_HLEN,
+static int br_netfilter_net_init(struct net *net)
+{
+   net-br.__fake_net_device = kmalloc(sizeof(struct net_device), 
GFP_KERNEL);
+   if (!net-br.__fake_net_device)
+   return -ENOMEM;
+   net-br.__fake_rtable = kmalloc(sizeof(struct rtable), GFP_KERNEL);
+   if (!net-br.__fake_rtable) {
+   kfree(net-br.__fake_net_device);
+   return -ENOMEM;
+   }
+
+   *net-br.__fake_net_device = (struct net_device) {
+   .hard_header_len= ETH_HLEN,
 #ifdef CONFIG_NET_NS
-   .nd_net = init_net,
+   .nd_net = net,
 #endif
-};
+   };
+   *net-br.__fake_rtable = (struct rtable) {
+   .u = {
+   .dst = {
+   .__refcnt   = ATOMIC_INIT(1),
+   .dev= net-br.__fake_net_device,
+   .path   = net-br.__fake_rtable-u.dst,
+   .metrics= {[RTAX_MTU - 1] = 1500},
+   .flags  = DST_NOXFRM,
+   }
+   },
+   .rt_flags   = 0,
+   };
+   return 0;
+}
 
-static struct rtable __fake_rtable = {
-   .u = {
-   .dst = {
-   .__refcnt   = ATOMIC_INIT(1),
-   .dev= __fake_net_device,
-   .path   = __fake_rtable.u.dst,
-   .metrics= {[RTAX_MTU - 1] = 1500},
-   .flags  = DST_NOXFRM,
-   }
-   },
-   .rt_flags   = 0,
+static void br_netfilter_net_exit(struct net *net)
+{
+   kfree(net-br.__fake_rtable);
+   kfree(net-br.__fake_net_device);
+}
+
+static struct pernet_operations br_netfilter_net_ops = {
+   .init = br_netfilter_net_init,
+   .exit = br_netfilter_net_exit,
 };
 
 static inline struct net_device *bridge_parent(const struct net_device *dev)
@@ -218,6 +241,7 @@ int nf_bridge_copy_header(struct sk_buff *skb)
  * bridge PRE_ROUTING hook. */
 static int br_nf_pre_routing_finish_ipv6(struct sk_buff *skb)
 {
+   struct net *net = dev_net(skb-dev);
struct nf_bridge_info *nf_bridge = skb-nf_bridge;
 
if (nf_bridge-mask  BRNF_PKT_TYPE) {
@@ -226,8 +250,8 @@ static int br_nf_pre_routing_finish_ipv6(struct sk_buff 
*skb)
}
nf_bridge-mask ^= BRNF_NF_BRIDGE_PREROUTING;
 
-   skb-rtable = __fake_rtable;
-   dst_hold(__fake_rtable.u.dst);
+   skb-rtable = net-br.__fake_rtable;
+   dst_hold(net-br.__fake_rtable-u.dst);
 
skb-dev = nf_bridge-physindev;
nf_bridge_push_encap_header(skb);
@@ -323,6 +347,7 @@ static int br_nf_pre_routing_finish_bridge(struct sk_buff 
*skb)
 static int br_nf_pre_routing_finish(struct sk_buff *skb)
 {
struct net_device *dev = skb-dev;
+   struct net *net = dev_net(dev);
struct iphdr *iph = ip_hdr(skb);
struct nf_bridge_info *nf_bridge = skb-nf_bridge;
int err;
@@ -356,7 +381,7 @@ static int br_nf_pre_routing_finish(struct sk_buff *skb)
if (err != -EHOSTUNREACH || !in_dev || 
IN_DEV_FORWARD(in_dev))
goto free_skb;
 
-   if (!ip_route_output_key(init_net, rt, fl)) {
+   

[Devel] [PATCH 9/9] ebtables: cleanup table entries during table unregister

2008-07-31 Thread Alexey Dobriyan
So far table could be unregistered only during module unload.
Which didn't happen, because depending on table entries, module was
pinned enough times to prevent unload at all.

Now table will be unregistered during netns stop, so prevent module
refcount leaks by cleaning up table entries at table unregister time.

Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
---

 net/bridge/netfilter/ebtables.c |2 ++
 1 file changed, 2 insertions(+)

--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -1262,6 +1262,8 @@ void ebt_unregister_table(struct ebt_table *table)
mutex_lock(ebt_mutex);
list_del(table-list);
mutex_unlock(ebt_mutex);
+   EBT_ENTRY_ITERATE(table-private-entries, table-private-entries_size,
+ ebt_cleanup_entry, NULL);
vfree(table-private-entries);
if (table-private-chainstack) {
for_each_possible_cpu(i)
-- 
1.5.4.5


___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 0/6] Container Freezer: Reuse Suspend Freezer

2008-07-31 Thread Matt Helsley
This patch series introduces a cgroup subsystem that utilizes the swsusp
freezer to freeze a group of tasks. It's immediately useful for batch job
management scripts. It should also be useful in the future for implementing
container checkpoint/restart.

The freezer subsystem in the container filesystem defines a cgroup file named
freezer.state. Reading freezer.state will return the current state of the
cgroup.  Writing FROZEN to the state file will freeze all tasks in the
cgroup. Subsequently writing RUNNING will unfreeze the tasks in the cgroup. 

* Examples of usage :

   # mkdir /containers/freezer
   # mount -t cgroup -ofreezer freezer  /containers
   # mkdir /containers/0
   # echo $some_pid  /containers/0/tasks

to get status of the freezer subsystem :

   # cat /containers/0/freezer.state
   RUNNING

to freeze all tasks in the container :

   # echo FROZEN  /containers/0/freezer.state
   # cat /containers/0/freezer.state
   FREEZING
   # cat /containers/0/freezer.state
   FROZEN

to unfreeze all tasks in the container :

   # echo RUNNING  /containers/0/freezer.state
   # cat /containers/0/freezer.state
   RUNNING

Andrew, since I hear Rafael doesn't have time to review these (again) at this
time, please consider these patches for -mm.

Cheers,
-Matt Helsley

Changes since v4:
v5:
Split out write_string as a separate patch for easier merging
with trees lacking certain cgroup patches at the time.
Checked use of task alloc lock for races with swsusp freeze/thaw --
looks safe because there are explicit barriers to handle
freeze/thaw races for individual tasks, we explicitly
handle partial group freezing, and partial group thawing
should be resolved without changing swsusp's loop.
Updated the patches to Linus' git tree as of approximately
7/31/2008.
Added Pavel and Serge's Acked-by lines to Acked patches

v4 (Almost all of these changes are confined to patch 3):
Reworked the series to use task_lock() instead of RCU.
Reworked the series to use write_string() and read_seq_string()
cgroup methods.
Fixed the race Paul Menage identified.
Fixed up check_if_frozen() to do more than just test the FROZEN
flag. In some cases tasks could be stopped (T) and marked
FREEZING. When that happens we can safely assume that it
will be frozen immediately upon waking up in the kernel.
Waiting for it to get marked with PF_FROZEN in order to
transition to the FROZEN state would block unnecessarily.
Removed freezer_ prefix from static functions in cgroup_freezer.c.
Simplified STATE_ switch.
Updated the locking comments.

v3:
Ported to 2.6.26-rc5-mm2 with Rafael's freezer patches
Tested on 24 combinations of 3 architectures (x86, x86_64, ppc64)
with 8 different kernel configs varying power management
and cgroup config variables. Each patch builds and boots
in these 24 combinations.
Passes functional testing.

v2 (roughly patches 3 and 5):
Moved the kill file into a separate cgroup subsystem (signal) and
it's own patch.
Changed the name of the file from freezer.freeze to freezer.state.
Switched from taking 1 and 0 as input to the strings FROZEN and 
RUNNING, respectively. This helps keep the interface
human-usable if/when we need to more states.
Checked that stopped or interrupted is frozen enough
Since try_to_freeze() is called upon wakeup of these tasks
this should be fine. This idea comes from recent changes to
the freezer.
Checked that if (task == current) whilst freezing cgroup we're ok
Fixed bug where -EBUSY would always be returned when freezing
Added code to handle userspace retries for any remaining -EBUSY

-- 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 5/6] Container Freezer: Prevent frozen tasks or cgroups from changing

2008-07-31 Thread Matt Helsley
Don't let frozen tasks or cgroups change. This means frozen tasks can't
leave their current cgroup for another cgroup. It also means that tasks
cannot be added to or removed from a cgroup in the FROZEN state. We
enforce these rules by checking for frozen tasks and cgroups in the
can_attach() function.

Signed-off-by: Matt Helsley [EMAIL PROTECTED]
---
Changes since v4:
v5:
Checked use of task alloc lock for races with swsusp freeze/thaw --
looks safe because there are explicit barriers to handle
freeze/thaw races for individual tasks, we explicitly
handle partial group freezing, and partial group thawing
should be resolved without changing swsusp's loop. This should
answer Li Zefan's last comment re: races between freeze and
thaw.

 kernel/cgroup_freezer.c |   43 ++-
 1 file changed, 26 insertions(+), 17 deletions(-)

Index: linux-2.6.27-rc1-mm1/kernel/cgroup_freezer.c
===
--- linux-2.6.27-rc1-mm1.orig/kernel/cgroup_freezer.c
+++ linux-2.6.27-rc1-mm1/kernel/cgroup_freezer.c
@@ -90,26 +90,44 @@ static void freezer_destroy(struct cgrou
struct cgroup *cgroup)
 {
kfree(cgroup_freezer(cgroup));
 }
 
+/* Task is frozen or will freeze immediately when next it gets woken */
+static bool is_task_frozen_enough(struct task_struct *task)
+{
+   return frozen(task) ||
+   (task_is_stopped_or_traced(task)  freezing(task));
+}
 
+/*
+ * The call to cgroup_lock() in the freezer.state write method prevents
+ * a write to that file racing against an attach, and hence the
+ * can_attach() result will remain valid until the attach completes.
+ */
 static int freezer_can_attach(struct cgroup_subsys *ss,
  struct cgroup *new_cgroup,
  struct task_struct *task)
 {
struct freezer *freezer;
-   int retval = 0;
+   int retval;
+
+   /* Anything frozen can't move or be moved to/from */
+
+   if (is_task_frozen_enough(task))
+   return -EBUSY;
 
-   /*
-* The call to cgroup_lock() in the freezer.state write method prevents
-* a write to that file racing against an attach, and hence the
-* can_attach() result will remain valid until the attach completes.
-*/
freezer = cgroup_freezer(new_cgroup);
if (freezer-state == STATE_FROZEN)
+   return -EBUSY;
+
+   retval = 0;
+   task_lock(task);
+   freezer = task_freezer(task);
+   if (freezer-state == STATE_FROZEN)
retval = -EBUSY;
+   task_unlock(task);
return retval;
 }
 
 static void freezer_fork(struct cgroup_subsys *ss, struct task_struct *task)
 {
@@ -140,16 +158,11 @@ static void check_if_frozen(struct cgrou
unsigned int nfrozen = 0, ntotal = 0;
 
cgroup_iter_start(cgroup, it);
while ((task = cgroup_iter_next(cgroup, it))) {
ntotal++;
-   /*
-* Task is frozen or will freeze immediately when next it gets
-* woken
-*/
-   if (frozen(task) ||
-   (task_is_stopped_or_traced(task)  freezing(task)))
+   if (is_task_frozen_enough(task))
nfrozen++;
}
 
/*
 * Transition to FROZEN when no new tasks can be added ensures
@@ -196,15 +209,11 @@ static int try_to_freeze_cgroup(struct c
freezer-state = STATE_FREEZING;
cgroup_iter_start(cgroup, it);
while ((task = cgroup_iter_next(cgroup, it))) {
if (!freeze_task(task, true))
continue;
-   if (task_is_stopped_or_traced(task)  freezing(task))
-   /*
-* The freeze flag is set so these tasks will
-* immediately go into the fridge upon waking.
-*/
+   if (is_task_frozen_enough(task))
continue;
if (!freezing(task)  !freezer_should_skip(task))
num_cant_freeze_now++;
}
cgroup_iter_end(cgroup, it);

-- 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/6] Container Freezer: Implement freezer cgroup subsystem

2008-07-31 Thread Matt Helsley
This patch implements a new freezer subsystem in the control groups framework.
It provides a way to stop and resume execution of all tasks in a cgroup by
writing in the cgroup filesystem.

The freezer subsystem in the container filesystem defines a file named
freezer.state. Writing FROZEN to the state file will freeze all tasks in the
cgroup. Subsequently writing RUNNING will unfreeze the tasks in the cgroup.
Reading will return the current state.

* Examples of usage :

   # mkdir /containers/freezer
   # mount -t cgroup -ofreezer freezer  /containers
   # mkdir /containers/0
   # echo $some_pid  /containers/0/tasks

to get status of the freezer subsystem :

   # cat /containers/0/freezer.state
   RUNNING

to freeze all tasks in the container :

   # echo FROZEN  /containers/0/freezer.state
   # cat /containers/0/freezer.state
   FREEZING
   # cat /containers/0/freezer.state
   FROZEN

to unfreeze all tasks in the container :

   # echo RUNNING  /containers/0/freezer.state
   # cat /containers/0/freezer.state
   RUNNING

This is the basic mechanism which should do the right thing for user space task
in a simple scenario.

It's important to note that freezing can be incomplete. In that case we return
EBUSY. This means that some tasks in the cgroup are busy doing something that
prevents us from completely freezing the cgroup at this time. After EBUSY,
the cgroup will remain partially frozen -- reflected by freezer.state reporting
FREEZING when read. The state will remain FREEZING until one of these
things happens:

1) Userspace cancels the freezing operation by writing RUNNING to
the freezer.state file
2) Userspace retries the freezing operation by writing FROZEN to
the freezer.state file (writing FREEZING is not legal
and returns EIO)
3) The tasks that blocked the cgroup from entering the FROZEN
state disappear from the cgroup's set of tasks.

Signed-off-by: Cedric Le Goater [EMAIL PROTECTED]
Signed-off-by: Matt Helsley [EMAIL PROTECTED]
Acked-by: Serge E. Hallyn [EMAIL PROTECTED]
Tested-by: Matt Helsley [EMAIL PROTECTED]
---
 include/linux/cgroup_freezer.h |   71 
 include/linux/cgroup_subsys.h  |6 
 include/linux/freezer.h|   16 +-
 init/Kconfig   |7 
 kernel/Makefile|1 
 kernel/cgroup_freezer.c|  328 +
 6 files changed, 425 insertions(+), 4 deletions(-)
 create mode 100644 include/linux/cgroup_freezer.h
 create mode 100644 kernel/cgroup_freezer.c

Index: linux-2.6.27-rc1-mm1/include/linux/cgroup_freezer.h
===
--- /dev/null
+++ linux-2.6.27-rc1-mm1/include/linux/cgroup_freezer.h
@@ -0,0 +1,71 @@
+#ifndef _LINUX_CGROUP_FREEZER_H
+#define _LINUX_CGROUP_FREEZER_H
+/*
+ * cgroup_freezer.h -  control group freezer subsystem interface
+ *
+ * Copyright IBM Corporation, 2007
+ *
+ * Author : Cedric Le Goater [EMAIL PROTECTED]
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2.1 of the GNU Lesser General Public License
+ * as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it would be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ */
+
+#include linux/cgroup.h
+
+#ifdef CONFIG_CGROUP_FREEZER
+
+enum freezer_state {
+   STATE_RUNNING = 0,
+   STATE_FREEZING,
+   STATE_FROZEN,
+};
+
+struct freezer {
+   struct cgroup_subsys_state css;
+   enum freezer_state state;
+   spinlock_t lock; /* protects _writes_ to state */
+};
+
+static inline struct freezer *cgroup_freezer(
+   struct cgroup *cgroup)
+{
+   return container_of(
+   cgroup_subsys_state(cgroup, freezer_subsys_id),
+   struct freezer, css);
+}
+
+static inline struct freezer *task_freezer(struct task_struct *task)
+{
+   return container_of(task_subsys_state(task, freezer_subsys_id),
+   struct freezer, css);
+}
+
+static inline int cgroup_frozen(struct task_struct *task)
+{
+   struct freezer *freezer;
+   enum freezer_state state;
+
+   task_lock(task);
+   freezer = task_freezer(task);
+   state = freezer-state;
+   task_unlock(task);
+
+   return state == STATE_FROZEN;
+}
+
+#else /* !CONFIG_CGROUP_FREEZER */
+
+static inline int cgroup_frozen(struct task_struct *task)
+{
+   return 0;
+}
+
+#endif /* !CONFIG_CGROUP_FREEZER */
+
+#endif /* _LINUX_CGROUP_FREEZER_H */
Index: linux-2.6.27-rc1-mm1/include/linux/cgroup_subsys.h
===
--- linux-2.6.27-rc1-mm1.orig/include/linux/cgroup_subsys.h
+++ linux-2.6.27-rc1-mm1/include/linux/cgroup_subsys.h
@@ -50,5 +50,11 @@ SUBSYS(devices)
 #ifdef CONFIG_CGROUP_MEMRLIMIT_CTLR
 

[Devel] [PATCH 6/6] Container Freezer: Use cgroup write_string method

2008-07-31 Thread Matt Helsley
Use the new cgroup write_string method rather than the raw write method
because it better matches the needs of the freezer cgroup subsystem.

Signed-off-by: Matt Helsley [EMAIL PROTECTED]
---
 kernel/cgroup_freezer.c |   21 +
 1 file changed, 5 insertions(+), 16 deletions(-)

Index: linux-2.6.27-rc1-mm1/kernel/cgroup_freezer.c
===
--- linux-2.6.27-rc1-mm1.orig/kernel/cgroup_freezer.c
+++ linux-2.6.27-rc1-mm1/kernel/cgroup_freezer.c
@@ -29,11 +29,10 @@
 static const char *freezer_state_strs[] = {
RUNNING,
FREEZING,
FROZEN,
 };
-#define STATE_MAX_STRLEN 8
 
 /*
  * State diagram (transition labels in parenthesis):
  *
  *  RUNNING -(FROZEN)- FREEZING -(FROZEN)- FROZEN
@@ -278,27 +277,17 @@ out:
spin_unlock_irq(freezer-lock);
 
return retval;
 }
 
-static ssize_t freezer_write(struct cgroup *cgroup,
-struct cftype *cft,
-struct file *file,
-const char __user *userbuf,
-size_t nbytes, loff_t *unused_ppos)
+static int freezer_write(struct cgroup *cgroup,
+struct cftype *cft,
+const char *buffer)
 {
-   char buffer[STATE_MAX_STRLEN + 1];
-   int retval = 0;
+   int retval;
enum freezer_state goal_state;
 
-   if (nbytes = PATH_MAX)
-   return -E2BIG;
-   nbytes = min(sizeof(buffer) - 1, nbytes);
-   if (copy_from_user(buffer, userbuf, nbytes))
-   return -EFAULT;
-   buffer[nbytes + 1] = 0; /* nul-terminate */
-   strstrip(buffer); /* remove any trailing whitespace */
if (strcmp(buffer, freezer_state_strs[STATE_RUNNING]) == 0)
goal_state = STATE_RUNNING;
else if (strcmp(buffer, freezer_state_strs[STATE_FROZEN]) == 0)
goal_state = STATE_FROZEN;
else
@@ -313,11 +302,11 @@ static ssize_t freezer_write(struct cgro
 
 static struct cftype files[] = {
{
.name = state,
.read_seq_string = freezer_read,
-   .write = freezer_write,
+   .write_string = freezer_write,
},
 };
 
 static int freezer_populate(struct cgroup_subsys *ss, struct cgroup *cgroup)
 {

-- 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 2/6] Container Freezer: Make refrigerator always available

2008-07-31 Thread Matt Helsley
Now that the TIF_FREEZE flag is available in all architectures,
extract the refrigerator() and freeze_task() from kernel/power/process.c
and make it available to all.

The refrigerator() can now be used in a control group subsystem
implementing a control group freezer.

Signed-off-by: Cedric Le Goater [EMAIL PROTECTED]
Signed-off-by: Matt Helsley [EMAIL PROTECTED]
Acked-by: Serge E. Hallyn [EMAIL PROTECTED]
Tested-by: Matt Helsley [EMAIL PROTECTED]
---
 include/linux/freezer.h |   24 +
 kernel/Makefile |2 +-
 kernel/freezer.c|  122 +++
 kernel/power/process.c  |  116 
 4 files changed, 136 insertions(+), 128 deletions(-)
 create mode 100644 kernel/freezer.c

diff --git a/include/linux/freezer.h b/include/linux/freezer.h
index deddeed..4081768 100644
--- a/include/linux/freezer.h
+++ b/include/linux/freezer.h
@@ -6,7 +6,6 @@
 #include linux/sched.h
 #include linux/wait.h
 
-#ifdef CONFIG_PM_SLEEP
 /*
  * Check if a process has been frozen
  */
@@ -39,6 +38,11 @@ static inline void clear_freeze_flag(struct task_struct *p)
clear_tsk_thread_flag(p, TIF_FREEZE);
 }
 
+static inline bool should_send_signal(struct task_struct *p)
+{
+   return !(p-flags  PF_FREEZER_NOSIG);
+}
+
 /*
  * Wake up a frozen process
  *
@@ -63,8 +67,6 @@ static inline int thaw_process(struct task_struct *p)
 }
 
 extern void refrigerator(void);
-extern int freeze_processes(void);
-extern void thaw_processes(void);
 
 static inline int try_to_freeze(void)
 {
@@ -75,6 +77,14 @@ static inline int try_to_freeze(void)
return 0;
 }
 
+extern bool freeze_task(struct task_struct *p, bool sig_only);
+extern void cancel_freezing(struct task_struct *p);
+
+#ifdef CONFIG_PM_SLEEP
+
+extern int freeze_processes(void);
+extern void thaw_processes(void);
+
 /*
  * The PF_FREEZER_SKIP flag should be set by a vfork parent right before it
  * calls wait_for_completion(vfork) and reset right after it returns from this
@@ -167,18 +177,10 @@ static inline void set_freezable_with_signal(void)
__retval;   \
 })
 #else /* !CONFIG_PM_SLEEP */
-static inline int frozen(struct task_struct *p) { return 0; }
-static inline int freezing(struct task_struct *p) { return 0; }
-static inline void set_freeze_flag(struct task_struct *p) {}
-static inline void clear_freeze_flag(struct task_struct *p) {}
-static inline int thaw_process(struct task_struct *p) { return 1; }
 
-static inline void refrigerator(void) {}
 static inline int freeze_processes(void) { BUG(); return 0; }
 static inline void thaw_processes(void) {}
 
-static inline int try_to_freeze(void) { return 0; }
-
 static inline void freezer_do_not_count(void) {}
 static inline void freezer_count(void) {}
 static inline int freezer_should_skip(struct task_struct *p) { return 0; }
diff --git a/kernel/Makefile b/kernel/Makefile
index 4e1d7df..9844f47 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -5,7 +5,7 @@
 obj-y = sched.o fork.o exec_domain.o panic.o printk.o \
cpu.o exit.o itimer.o time.o softirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
-   signal.o sys.o kmod.o workqueue.o pid.o \
+   signal.o sys.o kmod.o workqueue.o pid.o freezer.o \
rcupdate.o extable.o params.o posix-timers.o \
kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
diff --git a/kernel/freezer.c b/kernel/freezer.c
new file mode 100644
index 000..cb0931f
--- /dev/null
+++ b/kernel/freezer.c
@@ -0,0 +1,122 @@
+/*
+ * kernel/freezer.c - Function to freeze a process
+ *
+ * Originally from kernel/power/process.c
+ */
+
+#include linux/interrupt.h
+#include linux/suspend.h
+#include linux/module.h
+#include linux/syscalls.h
+#include linux/freezer.h
+
+/*
+ * freezing is complete, mark current process as frozen
+ */
+static inline void frozen_process(void)
+{
+   if (!unlikely(current-flags  PF_NOFREEZE)) {
+   current-flags |= PF_FROZEN;
+   wmb();
+   }
+   clear_freeze_flag(current);
+}
+
+/* Refrigerator is place where frozen processes are stored :-). */
+void refrigerator(void)
+{
+   /* Hmm, should we be allowed to suspend when there are realtime
+  processes around? */
+   long save;
+
+   task_lock(current);
+   if (freezing(current)) {
+   frozen_process();
+   task_unlock(current);
+   } else {
+   task_unlock(current);
+   return;
+   }
+   save = current-state;
+   pr_debug(%s entered refrigerator\n, current-comm);
+
+   spin_lock_irq(current-sighand-siglock);
+   recalc_sigpending(); /* We sent fake signal, clean it up */
+   spin_unlock_irq(current-sighand-siglock);
+
+   for (;;) {
+   

[Devel] [PATCH 4/6] Container Freezer: Skip frozen cgroups during power management resume

2008-07-31 Thread Matt Helsley
When a system is resumed after a suspend, it will also unfreeze
frozen cgroups.

This patchs modifies the resume sequence to skip the tasks which
are part of a frozen control group.

Signed-off-by: Cedric Le Goater [EMAIL PROTECTED]
Signed-off-by: Matt Helsley [EMAIL PROTECTED]
Acked-by: Serge E. Hallyn [EMAIL PROTECTED]
Tested-by: Matt Helsley [EMAIL PROTECTED]
---
 kernel/power/process.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/kernel/power/process.c b/kernel/power/process.c
index 444cea8..ce9e280 100644
--- a/kernel/power/process.c
+++ b/kernel/power/process.c
@@ -13,6 +13,7 @@
 #include linux/module.h
 #include linux/syscalls.h
 #include linux/freezer.h
+#include linux/cgroup_freezer.h
 
 /* 
  * Timeout for stopping processes
@@ -135,6 +136,9 @@ static void thaw_tasks(bool nosig_only)
if (nosig_only  should_send_signal(p))
continue;
 
+   if (cgroup_frozen(p))
+   continue;
+
thaw_process(p);
} while_each_thread(g, p);
read_unlock(tasklist_lock);
-- 
1.5.3.7

-- 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 1/6] Container Freezer: Add TIF_FREEZE flag to all architectures

2008-07-31 Thread Matt Helsley
This patch is the first step in making the refrigerator() available
to all architectures, even for those without power management.

The purpose of such a change is to be able to use the refrigerator()
in a new control group subsystem which will implement a control group
freezer.

Signed-off-by: Cedric Le Goater [EMAIL PROTECTED]
Signed-off-by: Matt Helsley [EMAIL PROTECTED]
Acked-by: Pavel Machek [EMAIL PROTECTED]
Acked-by: Serge E. Hallyn [EMAIL PROTECTED]
Tested-by: Matt Helsley [EMAIL PROTECTED]
---
 arch/parisc/include/asm/thread_info.h   |2 ++
 arch/sparc/include/asm/thread_info_32.h |2 ++
 arch/sparc/include/asm/thread_info_64.h |2 ++
 include/asm-alpha/thread_info.h |2 ++
 include/asm-avr32/thread_info.h |1 +
 include/asm-cris/thread_info.h  |2 ++
 include/asm-h8300/thread_info.h |2 ++
 include/asm-m68k/thread_info.h  |1 +
 include/asm-m68knommu/thread_info.h |2 ++
 include/asm-s390/thread_info.h  |2 ++
 include/asm-um/thread_info.h|2 ++
 include/asm-xtensa/thread_info.h|2 ++
 12 files changed, 22 insertions(+)

Index: linux-2.6.27-rc1-mm1/arch/sparc/include/asm/thread_info_32.h
===
--- linux-2.6.27-rc1-mm1.orig/arch/sparc/include/asm/thread_info_32.h
+++ linux-2.6.27-rc1-mm1/arch/sparc/include/asm/thread_info_32.h
@@ -137,10 +137,11 @@ BTFIXUPDEF_CALL(void, free_thread_info, 
 #define TIF_USEDFPU8   /* FPU was used by this task
 * this quantum (SMP) */
 #define TIF_POLLING_NRFLAG 9   /* true if poll_idle() is polling
 * TIF_NEED_RESCHED */
 #define TIF_MEMDIE 10
+#define TIF_FREEZE 19  /* is freezing for suspend */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE (1TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME (1TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING(1TIF_SIGPENDING)
@@ -150,9 +151,10 @@ BTFIXUPDEF_CALL(void, free_thread_info, 
 #define _TIF_POLLING_NRFLAG(1TIF_POLLING_NRFLAG)
 
 #define _TIF_DO_NOTIFY_RESUME_MASK (_TIF_NOTIFY_RESUME | \
 _TIF_SIGPENDING | \
 _TIF_RESTORE_SIGMASK)
+#define TIF_FREEZE (1TIF_FREEZE)
 
 #endif /* __KERNEL__ */
 
 #endif /* _ASM_THREAD_INFO_H */
Index: linux-2.6.27-rc1-mm1/arch/sparc/include/asm/thread_info_64.h
===
--- linux-2.6.27-rc1-mm1.orig/arch/sparc/include/asm/thread_info_64.h
+++ linux-2.6.27-rc1-mm1/arch/sparc/include/asm/thread_info_64.h
@@ -235,10 +235,11 @@ register struct thread_info *current_thr
  *   an immediate value in instructions such as andcc.
  */
 #define TIF_ABI_PENDING12
 #define TIF_MEMDIE 13
 #define TIF_POLLING_NRFLAG 14
+#define TIF_FREEZE 19  /* is freezing for suspend */
 
 #define _TIF_SYSCALL_TRACE (1TIF_SYSCALL_TRACE)
 #define _TIF_NOTIFY_RESUME (1TIF_NOTIFY_RESUME)
 #define _TIF_SIGPENDING(1TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED  (1TIF_NEED_RESCHED)
@@ -247,10 +248,11 @@ register struct thread_info *current_thr
 #define _TIF_32BIT (1TIF_32BIT)
 #define _TIF_SECCOMP   (1TIF_SECCOMP)
 #define _TIF_SYSCALL_AUDIT (1TIF_SYSCALL_AUDIT)
 #define _TIF_ABI_PENDING   (1TIF_ABI_PENDING)
 #define _TIF_POLLING_NRFLAG(1TIF_POLLING_NRFLAG)
+#define _TIF_FREEZE(1TIF_FREEZE)
 
 #define _TIF_USER_WORK_MASK((0xff  TI_FLAG_WSAVED_SHIFT) | \
 _TIF_DO_NOTIFY_RESUME_MASK | \
 _TIF_NEED_RESCHED | _TIF_PERFCTR)
 #define _TIF_DO_NOTIFY_RESUME_MASK (_TIF_NOTIFY_RESUME | _TIF_SIGPENDING)
Index: linux-2.6.27-rc1-mm1/include/asm-alpha/thread_info.h
===
--- linux-2.6.27-rc1-mm1.orig/include/asm-alpha/thread_info.h
+++ linux-2.6.27-rc1-mm1/include/asm-alpha/thread_info.h
@@ -72,16 +72,18 @@ register struct thread_info *__current_t
 #define TIF_UAC_NOPRINT5   /* see sysinfo.h */
 #define TIF_UAC_NOFIX  6
 #define TIF_UAC_SIGBUS 7
 #define TIF_MEMDIE 8
 #define TIF_RESTORE_SIGMASK9   /* restore signal mask in do_signal */
+#define TIF_FREEZE 19  /* is freezing for suspend */
 
 #define _TIF_SYSCALL_TRACE (1TIF_SYSCALL_TRACE)
 #define _TIF_SIGPENDING(1TIF_SIGPENDING)
 #define _TIF_NEED_RESCHED  (1TIF_NEED_RESCHED)
 #define _TIF_POLLING_NRFLAG(1TIF_POLLING_NRFLAG)
 #define _TIF_RESTORE_SIGMASK   (1TIF_RESTORE_SIGMASK)
+#define _TIF_FREEZE(1TIF_FREEZE)
 
 /* Work to do on interrupt/exception return.  */
 #define _TIF_WORK_MASK (_TIF_SIGPENDING | _TIF_NEED_RESCHED)
 
 /* Work 

[Devel] Re: [PATCH 0/6] Container Freezer: Reuse Suspend Freezer

2008-07-31 Thread Matt Helsley

On Thu, 2008-07-31 at 22:06 -0700, Matt Helsley wrote:
 This patch series introduces a cgroup subsystem that utilizes the swsusp
 freezer to freeze a group of tasks. It's immediately useful for batch job
 management scripts. It should also be useful in the future for implementing
 container checkpoint/restart.

snip

 Andrew, since I hear Rafael doesn't have time to review these (again) at this
 time, please consider these patches for -mm.

Argh -- that (again) is ripe for misinterpretation! Rafael has
reviewed and commented on these patches often in the past but, from what
I hear, lacks the time to do so now. Hence I'm sending these to -mm.

Sorry about the wording Rafael.

Cheers,
-Matt Helsley

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel