[Devel] Re: [PATCH -mm 2/3] i/o controller infrastructure

2008-08-04 Thread Andrea Righi
Li Zefan wrote:
 Andrea Righi wrote:
 This is the core io-throttle kernel infrastructure. It creates the basic
 interfaces to cgroups and implements the I/O measurement and throttling
 functions.

 Signed-off-by: Andrea Righi [EMAIL PROTECTED]
 ---
  block/Makefile|2 ++
  include/linux/cgroup_subsys.h |6 ++
  init/Kconfig  |   10 ++
  3 files changed, 18 insertions(+), 0 deletions(-)

 
 where is block/blk-io-throttle.c and include/linux/blk-io-throttle.h?
 

mmmh.. they should have been here, I mean, in patch 2/3 but it seems
they're gone in patch 1/3 (subject: i/o controller documentation):

Documentation of the block device I/O controller: description, usage,
advantages and design.

Signed-off-by: Andrea Righi [EMAIL PROTECTED]
---
 Documentation/controllers/io-throttle.txt |  312 +
 block/blk-io-throttle.c   |  719 +
 include/linux/blk-io-throttle.h   |   41 ++
 3 files changed, 1072 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/controllers/io-throttle.txt
 create mode 100644 block/blk-io-throttle.c
 create mode 100644 include/linux/blk-io-throttle.h
...

I'm pretty sure I did a:
git-commit -s Documentation/controllers/io-throttle.txt
in my local branch (history confirms this), but, anyway, if you think
it's worth it I can fix it and post the patchset again, just let me
know.

Thanks for looking at it!
-Andrea
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 3/7] bio-cgroup: Introduction

2008-08-04 Thread Ryo Tsuruta
With this series of bio-cgruop patches, you can determine the owners of any
type of I/Os and it makes dm-ioband -- I/O bandwidth controller --
be able to control the Block I/O bandwidths even when it accepts
delayed write requests.
Dm-ioband can find the owner cgroup of each request.
It is also possible that the other people who work on the I/O
bandwidth throttling use this functionality to control asynchronous
I/Os with a little enhancement.

You have to apply the patch dm-ioband v1.4.0 before applying this series
of patches.

And you have to select the following config options when compiling kernel:
  CONFIG_CGROUPS=y
  CONFIG_CGROUP_BIO=y
And I recommend you should also select the options for cgroup memory
subsystem, because it makes it possible to give some I/O bandwidth
and some memory to a certain cgroup to control delayed write requests
and the processes in the cgroup will be able to make pages dirty only
inside the cgroup even when the given bandwidth is narrow.
  CONFIG_RESOURCE_COUNTERS=y
  CONFIG_CGROUP_MEM_RES_CTLR=y

This code is based on some part of the memory subsystem of cgroup
and I don't think the accuracy and overhead of the subsystem can be ignored
at this time, so we need to keep tuning it up.

 

The following shows how to use dm-ioband with cgroups.
Please assume that you want make two cgroups, which we call bio cgroup
here, to track down block I/Os and assign them to ioband device ioband1.

First, mount the bio cgroup filesystem.

 # mount -t cgroup -o bio none /cgroup/bio

Then, make new bio cgroups and put some processes in them.

 # mkdir /cgroup/bio/bgroup1
 # mkdir /cgroup/bio/bgroup2
 # echo 1234  /cgroup/bio/bgroup1/tasks
 # echo 5678  /cgroup/bio/bgroup1/tasks

Now, check the ID of each bio cgroup which is just created.

 # cat /cgroup/bio/bgroup1/bio.id
   1
 # cat /cgroup/bio/bgroup2/bio.id
   2

Finally, attach the cgroups to ioband1 and assign them weights.

 # dmsetup message ioband1 0 type cgroup
 # dmsetup message ioband1 0 attach 1
 # dmsetup message ioband1 0 attach 2
 # dmsetup message ioband1 0 weight 1:30
 # dmsetup message ioband1 0 weight 2:60

You can also make use of the dm-ioband administration tool if you want.
The tool will be found here:
http://people.valinux.co.jp/~kaizuka/dm-ioband/iobandctl/manual.html
You can set up the device with the tool as follows.
In this case, you don't need to know the IDs of the cgroups.

 # iobandctl.py group /dev/mapper/ioband1 cgroup /cgroup/bio/bgroup1:30 
/cgroup/bio/bgroup2:60
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts

2008-08-04 Thread Ryo Tsuruta
This patch splits the cgroup memory subsystem into two parts.
One is for tracking pages to find out the owners. The other is
for controlling how much amount of memory should be assigned to
each cgroup.

With this patch, you can use the page tracking mechanism even if
the memory subsystem is off.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta [EMAIL PROTECTED]
Signed-off-by: Hirokazu Takahashi [EMAIL PROTECTED]

diff -Ndupr linux-2.6.27-rc1-mm1-ioband/include/linux/memcontrol.h 
linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h
--- linux-2.6.27-rc1-mm1-ioband/include/linux/memcontrol.h  2008-08-01 
12:18:28.0 +0900
+++ linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h 2008-08-01 
19:03:21.0 +0900
@@ -20,12 +20,62 @@
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
 
+#include linux/rcupdate.h
+#include linux/mm.h
+#include linux/smp.h
+#include linux/bit_spinlock.h
+
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+#ifdef CONFIG_CGROUP_PAGE
+/*
+ * We use the lower bit of the page-page_cgroup pointer as a bit spin
+ * lock.  We need to ensure that page-page_cgroup is at least two
+ * byte aligned (based on comments from Nick Piggin).  But since
+ * bit_spin_lock doesn't actually set that lock bit in a non-debug
+ * uniprocessor kernel, we should avoid setting it here too.
+ */
+#define PAGE_CGROUP_LOCK_BIT0x0
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK)
+#define PAGE_CGROUP_LOCK(1  PAGE_CGROUP_LOCK_BIT)
+#else
+#define PAGE_CGROUP_LOCK0x0
+#endif
+
+/*
+ * A page_cgroup page is associated with every page descriptor. The
+ * page_cgroup helps us identify information about the cgroup
+ */
+struct page_cgroup {
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
+   struct list_head lru;   /* per cgroup LRU list */
+   struct mem_cgroup *mem_cgroup;
+#endif /* CONFIG_CGROUP_MEM_RES_CTLR */
+   struct page *page;
+   int flags;
+};
+#define PAGE_CGROUP_FLAG_CACHE (0x1)   /* charged as cache */
+#define PAGE_CGROUP_FLAG_ACTIVE(0x2)   /* page is active in this 
cgroup */
+#define PAGE_CGROUP_FLAG_FILE  (0x4)   /* page is file system backed */
+#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8) /* page is unevictableable */
+
+static inline void lock_page_cgroup(struct page *page)
+{
+   bit_spin_lock(PAGE_CGROUP_LOCK_BIT, page-page_cgroup);
+}
+
+static inline int try_lock_page_cgroup(struct page *page)
+{
+   return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, page-page_cgroup);
+}
+
+static inline void unlock_page_cgroup(struct page *page)
+{
+   bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, page-page_cgroup);
+}
 
 #define page_reset_bad_cgroup(page)((page)-page_cgroup = 0)
 
@@ -34,45 +84,15 @@ extern int mem_cgroup_charge(struct page
gfp_t gfp_mask);
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask);
-extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
 extern void mem_cgroup_uncharge_page(struct page *page);
 extern void mem_cgroup_uncharge_cache_page(struct page *page);
-extern int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask);
-
-extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
-   struct list_head *dst,
-   unsigned long *scanned, int order,
-   int mode, struct zone *z,
-   struct mem_cgroup *mem_cont,
-   int active, int file);
-extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask);
-int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem);
-
-extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
-
-#define mm_match_cgroup(mm, cgroup)\
-   ((cgroup) == mem_cgroup_from_task((mm)-owner))
 
 extern int
 mem_cgroup_prepare_migration(struct page *page, struct page *newpage);
 extern void mem_cgroup_end_migration(struct page *page);
+extern void page_cgroup_init(void);
 
-/*
- * For memory reclaim.
- */
-extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem);
-extern long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem);
-
-extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem);
-extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem,
-   int priority);
-extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem,
-   int priority);
-
-extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone,
-   int priority, enum lru_list lru);
-
-#else /* CONFIG_CGROUP_MEM_RES_CTLR */
+#else /* CONFIG_CGROUP_PAGE */
 static inline void page_reset_bad_cgroup(struct page *page)
 {
 }
@@ 

[Devel] [PATCH 2/7] dm-ioband: Documentation of design overview, installation, command reference and examples

2008-08-04 Thread Ryo Tsuruta
Here is the documentation of design overview, installation, command
reference and examples.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta [EMAIL PROTECTED]
Signed-off-by: Hirokazu Takahashi [EMAIL PROTECTED]

diff -uprN linux-2.6.27-rc1-mm1.orig/Documentation/device-mapper/ioband.txt 
linux-2.6.27-rc1-mm1/Documentation/device-mapper/ioband.txt
--- linux-2.6.27-rc1-mm1.orig/Documentation/device-mapper/ioband.txt
1970-01-01 09:00:00.0 +0900
+++ linux-2.6.27-rc1-mm1/Documentation/device-mapper/ioband.txt 2008-08-01 
16:44:02.0 +0900
@@ -0,0 +1,937 @@
+ Block I/O bandwidth control: dm-ioband
+
+---
+
+   Table of Contents
+
+   [1]What's dm-ioband all about?
+
+   [2]Differences from the CFQ I/O scheduler
+
+   [3]How dm-ioband works.
+
+   [4]Setup and Installation
+
+   [5]Getting started
+
+   [6]Command Reference
+
+   [7]Examples
+
+What's dm-ioband all about?
+
+ dm-ioband is an I/O bandwidth controller implemented as a device-mapper
+   driver. Several jobs using the same physical device have to share the
+   bandwidth of the device. dm-ioband gives bandwidth to each job according
+   to its weight, which each job can set its own value to.
+
+ A job is a group of processes with the same pid or pgrp or uid or a
+   virtual machine such as KVM or Xen. A job can also be a cgroup by applying
+   the bio-cgroup patch, which can be found at
+   http://people.valinux.co.jp/~ryov/bio-cgroup/.
+
+ +--+ +--+ +--+   +--+ +--+ +--+
+ |cgroup| |cgroup| | the  |   | pid  | | pid  | | the  |  jobs
+ |  A   | |  B   | |others|   |  X   | |  Y   | |others|
+ +--|---+ +--|---+ +--|---+   +--|---+ +--|---+ +--|---+
+ +--V+---V---+V---+   +--V+---V---+V---+
+ | group | group | default|   | group | group | default|  ioband groups
+ |   |   |  group |   |   |   |  group |
+ +---+---++   +---+---++
+ |ioband1 |   |   ioband2  |  ioband devices
+ +---|+   +---|+
+ +---V--+-V+
+ |  |  |
+ |  sdb1|   sdb2   |  physical devices
+ +--+--+
+
+
+   --
+
+Differences from the CFQ I/O scheduler
+
+ Dm-ioband is flexible to configure the bandwidth settings.
+
+ Dm-ioband can work with any type of I/O scheduler such as the NOOP
+   scheduler, which is often chosen for high-end storages, since it is
+   implemented outside the I/O scheduling layer. It allows both of partition
+   based bandwidth control and job --- a group of processes --- based
+   control. In addition, it can set different configuration on each physical
+   device to control its bandwidth.
+
+ Meanwhile the current implementation of the CFQ scheduler has 8 IO
+   priority levels and all jobs whose processes have the same IO priority
+   share the bandwidth assigned to this level between them. And IO priority
+   is an attribute of a process so that it equally effects to all block
+   devices.
+
+   --
+
+How dm-ioband works.
+
+ Every ioband device has one ioband group, which by default is called the
+   default group.
+
+ Ioband devices can also have extra ioband groups in them. Each ioband
+   group has a job to support and a weight. Proportional to the weight,
+   dm-ioband gives tokens to the group.
+
+ A group passes on I/O requests that its job issues to the underlying
+   layer so long as it has tokens left, while requests are blocked if there
+   aren't any tokens left in the group. Tokens are refilled once all of
+   groups that have requests on a given physical device use up their tokens.
+
+ There are two policies for token consumption. One is that a token is
+   consumed for each I/O request. The other is that a token is consumed for
+   each I/O sector, for example, one I/O request which consists of
+   4Kbytes(512bytes * 8 sectors) read consumes 8 tokens. A user can choose
+   either policy.
+
+ With this approach, a job running on an ioband group with large weight
+   is guaranteed a wide I/O bandwidth.
+
+   --
+
+Setup and Installation
+
+ Build a kernel with these options enabled:
+
+ CONFIG_MD
+ CONFIG_BLK_DEV_DM
+ CONFIG_DM_IOBAND
+
+
+ If compiled as module, use modprobe to load dm-ioband.
+
+ # make modules
+ # make modules_install
+ # depmod -a
+ # modprobe dm-ioband
+
+
+ dmsetup targets command shows all available device-mapper targets.
+   ioband is displayed if 

[Devel] [PATCH 0/7] I/O bandwidth controller and BIO tracking

2008-08-04 Thread Ryo Tsuruta
Hi everyone,

This series of patches of dm-ioband now includes The bio tracking mechanism,
which has been posted individually to this mailing list.
This makes it easy for anybody to control the I/O bandwidth even when
the I/O is one of delayed-write requests.
Have fun!

This series of patches consists of two parts:
  1. dm-ioband
Dm-ioband is an I/O bandwidth controller implemented as a
device-mapper driver, which gives specified bandwidth to each job
running on the same physical device. A job is a group of processes
with the same pid or pgrp or uid or a virtual machine such as KVM
or Xen. A job can also be a cgroup by applying the bio-cgroup patch.
  2. bio-cgroup
Bio-cgroup is a BIO tracking mechanism, which is implemented on
the cgroup memory subsystem. With the mechanism, it is able to
determine which cgroup each of bio belongs to, even when the bio
is one of delayed-write requests issued from a kernel thread
such as pdflush.

The above two parts have been posted individually to this mailing list
until now, but after this time we would release them all together.

  [PATCH 1/7] dm-ioband: Patch of device-mapper driver
  [PATCH 2/7] dm-ioband: Documentation of design overview, installation,
 command reference and examples.
  [PATCH 3/7] bio-cgroup: Introduction
  [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts
  [PATCH 5/7] bio-cgroup: Remove a lot of #ifdefs
  [PATCH 6/7] bio-cgroup: Implement the bio-cgroup
  [PATCH 7/7] bio-cgroup: Add a cgroup support to dm-ioband

Please see the following site for more information:
  Linux Block I/O Bandwidth Control Project
  http://people.valinux.co.jp/~ryov/bwctl/

Thanks,
Ryo Tsuruta
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 7/7] bio-cgroup: Add a cgroup support to dm-ioband

2008-08-04 Thread Ryo Tsuruta
With this patch, dm-ioband can work with the bio cgroup.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta [EMAIL PROTECTED]
Signed-off-by: Hirokazu Takahashi [EMAIL PROTECTED]

diff -Ndupr linux-2.6.27-rc1-mm1.cg2/drivers/md/dm-ioband-type.c 
linux-2.6.27-rc1-mm1.cg3/drivers/md/dm-ioband-type.c
--- linux-2.6.27-rc1-mm1.cg2/drivers/md/dm-ioband-type.c2008-08-01 
16:53:57.0 +0900
+++ linux-2.6.27-rc1-mm1.cg3/drivers/md/dm-ioband-type.c2008-08-01 
19:44:36.0 +0900
@@ -6,6 +6,7 @@
  * This file is released under the GPL.
  */
 #include linux/bio.h
+#include linux/biocontrol.h
 #include dm.h
 #include dm-bio-list.h
 #include dm-ioband.h
@@ -53,13 +54,13 @@ static int ioband_node(struct bio *bio)
 
 static int ioband_cgroup(struct bio *bio)
 {
-  /*
-   * This function should return the ID of the cgroup which issued bio.
-   * The ID of the cgroup which the current process belongs to won't be
-   * suitable ID for this purpose, since some BIOs will be handled by kernel
-   * threads like aio or pdflush on behalf of the process requesting the BIOs.
-   */
-   return 0;   /* not implemented yet */
+   struct io_context *ioc = get_bio_cgroup_iocontext(bio);
+   int id = 0;
+   if (ioc) {
+   id = ioc-id;
+   put_io_context(ioc);
+   }
+   return id;
 }
 
 struct group_type dm_ioband_group_type[] = {
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [PATCH 6/7] bio-cgroup: Implement the bio-cgroup

2008-08-04 Thread Ryo Tsuruta
This patch implements the bio cgroup on the memory cgroup.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta [EMAIL PROTECTED]
Signed-off-by: Hirokazu Takahashi [EMAIL PROTECTED]

diff -Ndupr linux-2.6.27-rc1-mm1.cg1/block/blk-ioc.c 
linux-2.6.27-rc1-mm1.cg2/block/blk-ioc.c
--- linux-2.6.27-rc1-mm1.cg1/block/blk-ioc.c2008-07-29 11:40:31.0 
+0900
+++ linux-2.6.27-rc1-mm1.cg2/block/blk-ioc.c2008-08-01 19:18:38.0 
+0900
@@ -84,24 +84,28 @@ void exit_io_context(void)
}
 }
 
+void init_io_context(struct io_context *ioc)
+{
+   atomic_set(ioc-refcount, 1);
+   atomic_set(ioc-nr_tasks, 1);
+   spin_lock_init(ioc-lock);
+   ioc-ioprio_changed = 0;
+   ioc-ioprio = 0;
+   ioc-last_waited = jiffies; /* doesn't matter... */
+   ioc-nr_batch_requests = 0; /* because this is 0 */
+   ioc-aic = NULL;
+   INIT_RADIX_TREE(ioc-radix_root, GFP_ATOMIC | __GFP_HIGH);
+   INIT_HLIST_HEAD(ioc-cic_list);
+   ioc-ioc_data = NULL;
+}
+
 struct io_context *alloc_io_context(gfp_t gfp_flags, int node)
 {
struct io_context *ret;
 
ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node);
-   if (ret) {
-   atomic_set(ret-refcount, 1);
-   atomic_set(ret-nr_tasks, 1);
-   spin_lock_init(ret-lock);
-   ret-ioprio_changed = 0;
-   ret-ioprio = 0;
-   ret-last_waited = jiffies; /* doesn't matter... */
-   ret-nr_batch_requests = 0; /* because this is 0 */
-   ret-aic = NULL;
-   INIT_RADIX_TREE(ret-radix_root, GFP_ATOMIC | __GFP_HIGH);
-   INIT_HLIST_HEAD(ret-cic_list);
-   ret-ioc_data = NULL;
-   }
+   if (ret)
+   init_io_context(ret);
 
return ret;
 }
diff -Ndupr linux-2.6.27-rc1-mm1.cg1/include/linux/biocontrol.h 
linux-2.6.27-rc1-mm1.cg2/include/linux/biocontrol.h
--- linux-2.6.27-rc1-mm1.cg1/include/linux/biocontrol.h 1970-01-01 
09:00:00.0 +0900
+++ linux-2.6.27-rc1-mm1.cg2/include/linux/biocontrol.h 2008-08-01 
19:21:56.0 +0900
@@ -0,0 +1,159 @@
+#include linux/cgroup.h
+#include linux/mm.h
+#include linux/memcontrol.h
+
+#ifndef _LINUX_BIOCONTROL_H
+#define _LINUX_BIOCONTROL_H
+
+#ifdef CONFIG_CGROUP_BIO
+
+struct io_context;
+struct block_device;
+
+struct bio_cgroup {
+   struct cgroup_subsys_state css;
+   int id;
+   struct io_context *io_context;  /* default io_context */
+/* struct radix_tree_root io_context_root; per device io_context */
+   spinlock_t  page_list_lock;
+   struct list_headpage_list;
+};
+
+static inline int bio_cgroup_disabled(void)
+{
+   return bio_cgroup_subsys.disabled;
+}
+
+static inline struct bio_cgroup *bio_cgroup_from_task(struct task_struct *p)
+{
+   return container_of(task_subsys_state(p, bio_cgroup_subsys_id),
+   struct bio_cgroup, css);
+}
+
+static inline void __bio_cgroup_add_page(struct page_cgroup *pc)
+{
+   struct bio_cgroup *biog = pc-bio_cgroup;
+   list_add(pc-blist, biog-page_list);
+}
+
+static inline void bio_cgroup_add_page(struct page_cgroup *pc)
+{
+   struct bio_cgroup *biog = pc-bio_cgroup;
+   unsigned long flags;
+   spin_lock_irqsave(biog-page_list_lock, flags);
+   __bio_cgroup_add_page(pc);
+   spin_unlock_irqrestore(biog-page_list_lock, flags);
+}
+
+static inline void __bio_cgroup_remove_page(struct page_cgroup *pc)
+{
+   list_del_init(pc-blist);
+}
+
+static inline void bio_cgroup_remove_page(struct page_cgroup *pc)
+{
+   struct bio_cgroup *biog = pc-bio_cgroup;
+   unsigned long flags;
+   spin_lock_irqsave(biog-page_list_lock, flags);
+   __bio_cgroup_remove_page(pc);
+   spin_unlock_irqrestore(biog-page_list_lock, flags);
+}
+
+static inline void get_bio_cgroup(struct bio_cgroup *biog)
+{
+   css_get(biog-css);
+}
+
+static inline void put_bio_cgroup(struct bio_cgroup *biog)
+{
+   css_put(biog-css);
+}
+
+static inline void set_bio_cgroup(struct page_cgroup *pc,
+   struct bio_cgroup *biog)
+{
+   pc-bio_cgroup = biog;
+}
+
+static inline void clear_bio_cgroup(struct page_cgroup *pc)
+{
+   struct bio_cgroup *biog = pc-bio_cgroup;
+   pc-bio_cgroup = NULL;
+   put_bio_cgroup(biog);
+}
+
+static inline struct bio_cgroup *get_bio_page_cgroup(struct page_cgroup *pc)
+{
+   struct bio_cgroup *biog = pc-bio_cgroup;
+   css_get(biog-css);
+   return biog;
+}
+
+/* This sould be called in an RCU-protected section. */
+static inline struct bio_cgroup *mm_get_bio_cgroup(struct mm_struct *mm)
+{
+   struct bio_cgroup *biog;
+   biog = bio_cgroup_from_task(rcu_dereference(mm-owner));
+   get_bio_cgroup(biog);
+   return biog;
+}
+
+extern struct io_context *get_bio_cgroup_iocontext(struct bio *bio);
+
+#else  /* CONFIG_CGROUP_BIO */
+
+struct bio_cgroup;
+
+static inline int 

[Devel] [PATCH 5/7] bio-cgroup: Remove a lot of ifdefs

2008-08-04 Thread Ryo Tsuruta
This patch is for cleaning up the code of the cgroup memory subsystem
to remove a lot of #ifdefs.

Based on 2.6.27-rc1-mm1
Signed-off-by: Ryo Tsuruta [EMAIL PROTECTED]
Signed-off-by: Hirokazu Takahashi [EMAIL PROTECTED]

diff -Ndupr linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c 
linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c
--- linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c2008-08-01 19:48:55.0 
+0900
+++ linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c2008-08-01 19:49:38.0 
+0900
@@ -228,6 +228,47 @@ struct mem_cgroup *mem_cgroup_from_task(
struct mem_cgroup, css);
 }
 
+static inline void get_mem_cgroup(struct mem_cgroup *mem)
+{
+   css_get(mem-css);
+}
+
+static inline void put_mem_cgroup(struct mem_cgroup *mem)
+{
+   css_put(mem-css);
+}
+
+static inline void set_mem_cgroup(struct page_cgroup *pc,
+   struct mem_cgroup *mem)
+{
+   pc-mem_cgroup = mem;
+}
+
+static inline void clear_mem_cgroup(struct page_cgroup *pc)
+{
+   struct mem_cgroup *mem = pc-mem_cgroup;
+   res_counter_uncharge(mem-res, PAGE_SIZE);
+   pc-mem_cgroup = NULL;
+   put_mem_cgroup(mem);
+}
+
+static inline struct mem_cgroup *get_mem_page_cgroup(struct page_cgroup *pc)
+{
+   struct mem_cgroup *mem = pc-mem_cgroup;
+   css_get(mem-css);
+   return mem;
+}
+
+/* This sould be called in an RCU-protected section. */
+static inline struct mem_cgroup *mm_get_mem_cgroup(struct mm_struct *mm)
+{
+   struct mem_cgroup *mem;
+
+   mem = mem_cgroup_from_task(rcu_dereference(mm-owner));
+   get_mem_cgroup(mem);
+   return mem;
+}
+
 static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz,
struct page_cgroup *pc)
 {
@@ -297,6 +338,26 @@ static void __mem_cgroup_move_lists(stru
list_move(pc-lru, mz-lists[lru]);
 }
 
+static inline void mem_cgroup_add_page(struct page_cgroup *pc)
+{
+   struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+   unsigned long flags;
+
+   spin_lock_irqsave(mz-lru_lock, flags);
+   __mem_cgroup_add_list(mz, pc);
+   spin_unlock_irqrestore(mz-lru_lock, flags);
+}
+
+static inline void mem_cgroup_remove_page(struct page_cgroup *pc)
+{
+   struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc);
+   unsigned long flags;
+
+   spin_lock_irqsave(mz-lru_lock, flags);
+   __mem_cgroup_remove_list(mz, pc);
+   spin_unlock_irqrestore(mz-lru_lock, flags);
+}
+
 int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem)
 {
int ret;
@@ -339,6 +400,36 @@ void mem_cgroup_move_lists(struct page *
unlock_page_cgroup(page);
 }
 
+static inline int mem_cgroup_try_to_allocate(struct mem_cgroup *mem,
+   gfp_t gfp_mask)
+{
+   unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+
+   while (res_counter_charge(mem-res, PAGE_SIZE)) {
+   if (!(gfp_mask  __GFP_WAIT))
+   return -1;
+
+   if (try_to_free_mem_cgroup_pages(mem, gfp_mask))
+   continue;
+
+   /*
+* try_to_free_mem_cgroup_pages() might not give us a full
+* picture of reclaim. Some pages are reclaimed and might be
+* moved to swap cache or just unmapped from the cgroup.
+* Check the limit again to see if the reclaim reduced the
+* current usage of the cgroup before giving up
+*/
+   if (res_counter_check_under_limit(mem-res))
+   continue;
+
+   if (!nr_retries--) {
+   mem_cgroup_out_of_memory(mem, gfp_mask);
+   return -1;
+   }
+   }
+   return 0;
+}
+
 /*
  * Calculate mapped_ratio under memory controller. This will be used in
  * vmscan.c for deteremining we have to reclaim mapped pages.
@@ -469,15 +560,14 @@ int mem_cgroup_shrink_usage(struct mm_st
return 0;
 
rcu_read_lock();
-   mem = mem_cgroup_from_task(rcu_dereference(mm-owner));
-   css_get(mem-css);
+   mem = mm_get_mem_cgroup(mm);
rcu_read_unlock();
 
do {
progress = try_to_free_mem_cgroup_pages(mem, gfp_mask);
} while (!progress  --retry);
 
-   css_put(mem-css);
+   put_mem_cgroup(mem);
if (!retry)
return -ENOMEM;
return 0;
@@ -558,7 +648,7 @@ static int mem_cgroup_force_empty(struct
int ret = -EBUSY;
int node, zid;
 
-   css_get(mem-css);
+   get_mem_cgroup(mem);
/*
 * page reclaim code (kswapd etc..) will move pages between
 * active_list - inactive_list while we don't take a lock.
@@ -578,7 +668,7 @@ static int mem_cgroup_force_empty(struct
}
ret = 0;
 out:
-   css_put(mem-css);
+   put_mem_cgroup(mem);
return ret;
 }
 
@@ -873,10 +963,37 @@ 

[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-04 Thread Louis Rilling
On Fri, Aug 01, 2008 at 02:51:57PM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Fri, Aug 01, 2008 at 10:15:26AM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 03:12:32PM -0400, Oren Laadan wrote:

Cut the less interesting (IMHO at least) history to make Dave happier ;)


 Returning 0 in case of a restart is what I called a special handling. You 
 won't
 do this for the other tasks, so this is special. Since userspace must cope 
 with
 it anyway, userspace can be clever enough to avoid using the fd on restart, 
 or
 stupid enough to destroy its checkpoint after restart.

 It's a different special hanlding :)   In the case of a single task that 
 wants
 to checkpoint itself - there are no other tasks.  In the case of a container -
 there will be only a single task that calls sys_checkpoint(), so only that 
 task
 will either get the CRID or the 0 (or an error). The other tasks will resume
 whatever it was that they were doing (lol, assuming of course restart works).

 So this special handling ends up being a two-liner: setting the return
 value of the syscall for the task that called sys_checkpoint() (well, actually
 it will call sys_restart() to restart, and return from sys_checkpoint() with
 a value of 0 ...).

I knew it, since I actually saw it in the patches you sent last week.


 If you use an FD, you will have to checkpoint that resource as part of the
 checkpoint, and restore it as part of the restart. In doing so you'll need
 to specially handle it, because it has a special meaning. I agree, of course,
 that it is feasible.


 - Userspace makes less errors when managing incremental checkpoints.
 have you implemented this ?  did you experience issues in real life ?  user
 space will need a way to manage all of it anyway in many aspects. This will
 be the last/least of the issues ...

 No it was not implemented, and I'm not going to enter a discussion about the
 weight of arguments whether they are backed by implementations or not. It 
 just
 becomes easier to create a mess with things depending on each other created 
 as
 separate, freely (userspace-decided)-named objects.

 If I were to write a user-space tool to handle this, I would keep each chain
 of checkpoints (from base and on) in a separate subdir, for example. In 
 fact,
 that's how I did it :)

This is intuitive indeed. Checkpoints are already organized in a similar way in
Kerrighed, except that a notion of application (transparent to applications)
replaces the notion of container, and the kernel decides where to put the
checkpoints and how they are named (I'm not saying that this is the best
way though).

 Besides, this scheme begins to sound much more complex than a single file.
 Do you really gain so much from not having multiple files, one per 
 checkpoint ?

 Well, at least you are not limited by the number of open file descriptors
 (assuming that, as you mentioned earlier, you pass an array of previous 
 images
 to compute the next incremental checkpoint).

 You aren't limited by the number of open file. User space could provide an 
 array
 of CRID, pathname (or serial#, pathname) to the kernel, the kernel will
 access the files as necessary.

But the kernel itself would have to cope with this limit (even if it is
not enforced, just to avoid consuming too much resources), or close and
reopen files when needed...


 Uhh .. hold on:  you need the array of previous checkpoint to _restart_ from
 an incremental checkpoint. You don't care about it when you checkpoint: 
 instead,
 you keep track in memory of (1) what changed (e.g. which pages where touched),
 and (2) where to find unmodified pages in previous checkpoints. You save this
 information with each new checkpoint.  The data structure to describe #2 is
 dynamic and changes with the execution, and easily keeps track of when older
 checkpoint images become irrelevant (because all the pages they hold have been
 overwritten already).

I see. I thought that you also intended to build incremental checkpoints
from previous checkpoints only, because even if this is not fast, this
saves storage space. I agree that if you always keep necessary metadata
in kernel memory, you don't need the previous images. Actually I don't
know any incremental checkpoint scheme not using such in-memory metadata
scheme. Which does not imply that other schemes are not relevant
though...



 where:
 - base_fd is a regular file containing the base checkpoint, or -1 if a full
   checkpoint should be done. The checkpoint could actually also live in 
 memory,
   and the kernel should check that it matches the image pointed to by 
 base_fd.
 - out_fd is whatever file/socket/etc. on which we should dump the 
 checkpoint. In
   particular, out_fd can equal base_fd and should point to the beginning 
 of the
   file if it's a regular file.
 Excellent example. What if the checkpoint data is streamed over the network;
 so you cannot rewrite the file after it has been streamed...  Or you will 

[Devel] Too many I/O controller patches

2008-08-04 Thread Dave Hansen
On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
 This series of patches of dm-ioband now includes The bio tracking mechanism,
 which has been posted individually to this mailing list.
 This makes it easy for anybody to control the I/O bandwidth even when
 the I/O is one of delayed-write requests.

During the Containers mini-summit at OLS, it was mentioned that there
are at least *FOUR* of these I/O controllers floating around.  Have you
talked to the other authors?  (I've cc'd at least one of them).

We obviously can't come to any kind of real consensus with people just
tossing the same patches back and forth.

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Too many I/O controller patches

2008-08-04 Thread Andrea Righi
Dave Hansen wrote:
 On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
 This series of patches of dm-ioband now includes The bio tracking 
 mechanism,
 which has been posted individually to this mailing list.
 This makes it easy for anybody to control the I/O bandwidth even when
 the I/O is one of delayed-write requests.
 
 During the Containers mini-summit at OLS, it was mentioned that there
 are at least *FOUR* of these I/O controllers floating around.  Have you
 talked to the other authors?  (I've cc'd at least one of them).
 
 We obviously can't come to any kind of real consensus with people just
 tossing the same patches back and forth.
 
 -- Dave
 

Dave,

thanks for this email first of all. I've talked with Satoshi (cc-ed)
about his solution Yet another I/O bandwidth controlling subsystem for
CGroups based on CFQ.

I did some experiments trying to implement minimum bandwidth requirements
for my io-throttle controller, mapping the requirements to CFQ prio and
using the Satoshi's controller. But this needs additional work and
testing right now, so I've not posted anything yet, just informed
Satoshi about this.

Unfortunately I've not talked to Ryo yet. I've continued my work using a
quite different approach, because the dm-ioband solution didn't work
with delayed-write requests. Now the bio tracking feature seems really
prosiming and I would like to do some tests ASAP, and review the patch
as well.

But I'm not yet convinced that limiting the IO writes at the device
mapper layer is the best solution. IMHO it would be better to throttle
applications' writes when they're dirtying pages in the page cache (the
io-throttle way), because when the IO requests arrive to the device
mapper it's too late (we would only have a lot of dirty pages that are
waiting to be flushed to the limited block devices, and maybe this could
lead to OOM conditions). IOW dm-ioband is doing this at the wrong level
(at least for my requirements). Ryo, correct me if I'm wrong or if I've
not understood the dm-ioband approach.

Another thing I prefer is to directly define bandwidth limiting rules,
instead of using priorities/weights (i.e. 10MiB/s for /dev/sda), but
this seems to be in the dm-ioband TODO list, so maybe we can merge the
work I did in io-throttle to define such rules.

Anyway, I still need to look at the dm-ioband and bio-cgroup code in
details, so probably all I said above is totally wrong...

-Andrea
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Too many I/O controller patches

2008-08-04 Thread Balbir Singh
Dave Hansen wrote:
 On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
 This series of patches of dm-ioband now includes The bio tracking 
 mechanism,
 which has been posted individually to this mailing list.
 This makes it easy for anybody to control the I/O bandwidth even when
 the I/O is one of delayed-write requests.
 
 During the Containers mini-summit at OLS, it was mentioned that there
 are at least *FOUR* of these I/O controllers floating around.  Have you
 talked to the other authors?  (I've cc'd at least one of them).
 
 We obviously can't come to any kind of real consensus with people just
 tossing the same patches back and forth.

Ryo and Andrea - Naveen and Satoshi met up at OLS and discussed their approach.
It would be really nice to see an RFC, I know Andrea did work on this and
compared the approaches.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Too many I/O controller patches

2008-08-04 Thread Dave Hansen
On Mon, 2008-08-04 at 20:22 +0200, Andrea Righi wrote:
 But I'm not yet convinced that limiting the IO writes at the device
 mapper layer is the best solution. IMHO it would be better to throttle
 applications' writes when they're dirtying pages in the page cache (the
 io-throttle way), because when the IO requests arrive to the device
 mapper it's too late (we would only have a lot of dirty pages that are
 waiting to be flushed to the limited block devices, and maybe this could
 lead to OOM conditions). IOW dm-ioband is doing this at the wrong level
 (at least for my requirements). Ryo, correct me if I'm wrong or if I've
 not understood the dm-ioband approach.

The avoid-lots-of-page-dirtying problem sounds like a hard one.  But, if
you look at this in combination with the memory controller, they would
make a great team.

The memory controller keeps you from dirtying more than your limit of
pages (and pinning too much memory) even if the dm layer is doing the
throttling and itself can't throttle the memory usage.

I also don't think this is any different from the problems we have in
the regular VM these days.  Right now, people can dirty lots of pages on
devices that are slow.  The only thing dm-ioband would be added would be
changing how those devices *got* slow. :)

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-08-04 Thread Balbir Singh
Hugh Dickins wrote:
[snip]
 
 BUG: unable to handle kernel paging request at 6b6b6b8b
 IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29
 *pde =  
 Oops:  [#1] PREEMPT SMP 
 last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
 Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq 
 snd_seq_device thermal ac battery button
 
 Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7)
 EIP: 0060:[7817078f] EFLAGS: 00010206 CPU: 0
 EIP is at memrlimit_cgroup_uncharge_as+0x18/0x29
 EAX: 6b6b6b6b EBX: 7963215c ECX: 7c032000 EDX: 0025e000
 ESI: 96902518 EDI: 9fbb1aa0 EBP: 7c033e9c ESP: 7c033e9c
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
 Process swapoff (pid: 22500, ti=7c032000 task=907e2b70 task.ti=7c032000)
 Stack: 7c033edc 78161323 9fbb1aa0 025e ff77 7c033ecc 96902518 
  
 7c033ec8  0089 7963215c 9fbb1aa0 9fbb1b28 
 a272f040 
7c033ef4 781226b1 9fbb1aa0 9fbb1aa0 790fa884 a272f0c8 7c033f80 
 78165ce3 
 Call Trace:
  [78161323] ? exit_mmap+0xaf/0x133
  [781226b1] ? mmput+0x4c/0xba
  [78165ce3] ? try_to_unuse+0x20b/0x3f5
  [78371534] ? _spin_unlock+0x22/0x3c
  [7816636a] ? sys_swapoff+0x17b/0x37c
  [78102d95] ? sysenter_past_esp+0x6a/0xa5
  ===
 Code: 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 ad a6 fd ff c9 c3 55 89 e5 8b 45 
 08 8b 55 0c 8b 80 30 02 00 00 c1 e2 0c 8b 80 24 0c 00 00 8b 40 20 52 83 c0 
 0c 50 e8 e6 a6 fd ff 58 5a c9 c3 55 89 e5 8b 
 EIP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 SS:ESP 0068:7c033e9c

Hi, Hugh,

I am unable to reproduce the problem, but I do have an initial hypothesis

CPU0CPU1
try_to_unuse
task 1 stars exitinglook at mm = task1-mm
..  increment mm_users
task 1 exits
mm-owner needs to be updated, but
no new owner is found
(mm_users  1, but no other task
has task-mm = task1-mm)
mm_update_next_owner() leaves

grace period
user count drops, call mmput(mm)
task 1 freed
dereferencing mm-owner fails



I do have a potential solution in mind, but I want to make sure my hypothesis is
correct.



-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 2/3] checkpoint/restart: x86 support

2008-08-04 Thread Dave Hansen

The original version of Oren's patch contained a good hunk
of #ifdefs.  I've extracted all of those and created a bit
of an API for new architectures to follow.

Leaving Oren's sign-off because this is all still his code,
even though he hasn't seen it mangled like this before.

Signed-off-by: Oren Laadan [EMAIL PROTECTED]
---

 linux-2.6.git-dave/ckpt/Makefile  |1 
 linux-2.6.git-dave/ckpt/checkpoint.c  |7 
 linux-2.6.git-dave/ckpt/ckpt_arch.h   |6 
 linux-2.6.git-dave/ckpt/restart.c |7 
 linux-2.6.git-dave/ckpt/x86.c |  269 ++
 linux-2.6.git-dave/include/asm-x86/ckpt.h |   46 +
 6 files changed, 336 insertions(+)

diff -puN ckpt/checkpoint.c~x86_part ckpt/checkpoint.c
--- linux-2.6.git/ckpt/checkpoint.c~x86_part2008-08-04 13:29:59.0 
-0700
+++ linux-2.6.git-dave/ckpt/checkpoint.c2008-08-04 13:29:59.0 
-0700
@@ -19,6 +19,7 @@
 
 #include ckpt.h
 #include ckpt_hdr.h
+#include ckpt_arch.h
 
 /**
  * cr_get_fname - return pathname of a given file
@@ -183,6 +184,12 @@ static int cr_write_task(struct cr_ctx *
 
ret = cr_write_task_struct(ctx, t);
CR_PRINTK(ret (task_struct) %d\n, ret);
+   if (!ret)
+   ret = cr_write_thread(ctx, t);
+   CR_PRINTK(ret (thread) %d\n, ret);
+   if (!ret)
+   ret = cr_write_cpu(ctx, t);
+   CR_PRINTK(ret (cpu) %d\n, ret);
 
return ret;
 }
diff -puN /dev/null ckpt/ckpt_arch.h
--- /dev/null   2007-04-11 11:48:27.0 -0700
+++ linux-2.6.git-dave/ckpt/ckpt_arch.h 2008-08-04 13:29:59.0 -0700
@@ -0,0 +1,6 @@
+#include ckpt.h
+
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t);
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
+int cr_read_thread(struct cr_ctx *ctx);
+int cr_read_cpu(struct cr_ctx *ctx);
diff -puN ckpt/Makefile~x86_part ckpt/Makefile
--- linux-2.6.git/ckpt/Makefile~x86_part2008-08-04 13:29:59.0 
-0700
+++ linux-2.6.git-dave/ckpt/Makefile2008-08-04 13:29:59.0 -0700
@@ -1 +1,2 @@
 obj-y += sys.o checkpoint.o restart.o
+obj-$(CONFIG_X86) += x86.o
diff -puN ckpt/restart.c~x86_part ckpt/restart.c
--- linux-2.6.git/ckpt/restart.c~x86_part   2008-08-04 13:29:59.0 
-0700
+++ linux-2.6.git-dave/ckpt/restart.c   2008-08-04 13:29:59.0 -0700
@@ -21,6 +21,7 @@
 
 #include ckpt.h
 #include ckpt_hdr.h
+#include ckpt_arch.h
 
 /**
  * cr_hbuf_get - reserve space on the hbuf
@@ -171,6 +172,12 @@ static int cr_read_task(struct cr_ctx *c
 
ret = cr_read_task_struct(ctx);
CR_PRINTK(ret (task_struct) %d\n, ret);
+   if (!ret)
+   ret = cr_read_thread(ctx);
+   CR_PRINTK(ret (thread) %d\n, ret);
+   if (!ret)
+   ret = cr_read_cpu(ctx);
+   CR_PRINTK(ret (cpu) %d\n, ret);
 
return ret;
 }
diff -puN /dev/null ckpt/x86.c
--- /dev/null   2007-04-11 11:48:27.0 -0700
+++ linux-2.6.git-dave/ckpt/x86.c   2008-08-04 13:29:59.0 -0700
@@ -0,0 +1,269 @@
+#include asm/ckpt.h
+#include asm/desc.h
+#include asm/i387.h
+
+#include ckpt.h
+#include ckpt_hdr.h
+
+/* dump the thread_struct of a given task */
+int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t)
+{
+   struct cr_hdr h;
+   struct cr_hdr_thread *hh = ctx-tbuf;
+   struct thread_struct *thread;
+   struct desc_struct *desc;
+   int ntls = 0;
+   int n, ret;
+
+   h.type = CR_HDR_THREAD;
+   h.len = sizeof(*hh);
+   h.id = ctx-pid;
+
+   thread = t-thread;
+
+   /* calculate no. of TLS entries that follow */
+   desc = thread-tls_array;
+   for (n = GDT_ENTRY_TLS_ENTRIES; n  0; n--, desc++) {
+   if (desc-a || desc-b)
+   ntls++;
+   }
+
+   hh-gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
+   hh-sizeof_tls_array = sizeof(thread-tls_array);
+   hh-ntls = ntls;
+
+   if ((ret = cr_write_obj(ctx, h, hh))  0)
+   return ret;
+
+   /* for simplicity dump the entire array, cherry-pick upon restart */
+   ret = cr_kwrite(ctx, thread-tls_array, sizeof(thread-tls_array));
+
+   CR_PRINTK(ntls %d\n, ntls);
+
+   /* IGNORE RESTART BLOCKS FOR NOW ... */
+
+   return ret;
+}
+
+/* dump the cpu state and registers of a given task */
+int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t)
+{
+   struct cr_hdr h;
+   struct cr_hdr_cpu *hh = ctx-tbuf;
+   struct thread_struct *thread;
+   struct thread_info *thread_info;
+   struct pt_regs *regs;
+
+   h.type = CR_HDR_CPU;
+   h.len = sizeof(*hh);
+   h.id = ctx-pid;
+
+   thread = t-thread;
+   thread_info = task_thread_info(t);
+   regs = task_pt_regs(t);
+
+   hh-bx = regs-bx;
+   hh-cx = regs-cx;
+   hh-dx = regs-dx;
+   hh-si = regs-si;
+   hh-di = regs-di;
+   hh-bp = regs-bp;
+   hh-ax = regs-ax;
+   hh-ds = regs-ds;
+   hh-es = 

[Devel] [RFC][PATCH 3/3] checkpoint/restart: memory management

2008-08-04 Thread Dave Hansen

For each vma, there is a 'struct cr_vma'; if the vma is file-mapped,
it will be followed by the file name.  The cr_vma-npages will tell
how many pages were dumped for this vma.  Then it will be followed
by the actual data: first a dump of the addresses of all dumped
pages (npages entries) followed by a dump of the contents of all
dumped pages (npages pages). Then will come the next vma and so on.

I guess I could also separate out the x86-specific bits here, but
they're pretty small, comparatively.

Signed-off-by: Oren Laadan [EMAIL PROTECTED]
---

 linux-2.6.git-dave/arch/x86/kernel/ldt.c  |2 
 linux-2.6.git-dave/ckpt/Makefile  |2 
 linux-2.6.git-dave/ckpt/ckpt_arch.h   |2 
 linux-2.6.git-dave/ckpt/ckpt_hdr.h|   21 +
 linux-2.6.git-dave/ckpt/ckpt_mem.c|  388 ++
 linux-2.6.git-dave/ckpt/ckpt_mem.h|   32 ++
 linux-2.6.git-dave/ckpt/rstr_mem.c|  354 +++
 linux-2.6.git-dave/ckpt/sys.c |3 
 linux-2.6.git-dave/ckpt/x86.c |   83 ++
 linux-2.6.git-dave/include/asm-x86/ckpt.h |5 
 10 files changed, 890 insertions(+), 2 deletions(-)

diff -puN arch/x86/kernel/ldt.c~memory_part arch/x86/kernel/ldt.c
--- linux-2.6.git/arch/x86/kernel/ldt.c~memory_part 2008-08-04 
13:30:00.0 -0700
+++ linux-2.6.git-dave/arch/x86/kernel/ldt.c2008-08-04 13:30:00.0 
-0700
@@ -183,7 +183,7 @@ static int read_default_ldt(void __user 
return bytecount;
 }
 
-static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
+int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode)
 {
struct mm_struct *mm = current-mm;
struct desc_struct ldt;
diff -puN ckpt/ckpt_arch.h~memory_part ckpt/ckpt_arch.h
--- linux-2.6.git/ckpt/ckpt_arch.h~memory_part  2008-08-04 13:30:00.0 
-0700
+++ linux-2.6.git-dave/ckpt/ckpt_arch.h 2008-08-04 13:30:00.0 -0700
@@ -4,3 +4,5 @@ int cr_write_thread(struct cr_ctx *ctx, 
 int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t);
 int cr_read_thread(struct cr_ctx *ctx);
 int cr_read_cpu(struct cr_ctx *ctx);
+int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm);
+int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm);
diff -puN ckpt/ckpt_hdr.h~memory_part ckpt/ckpt_hdr.h
--- linux-2.6.git/ckpt/ckpt_hdr.h~memory_part   2008-08-04 13:30:00.0 
-0700
+++ linux-2.6.git-dave/ckpt/ckpt_hdr.h  2008-08-04 13:30:00.0 -0700
@@ -67,3 +67,24 @@ struct cr_hdr_task {
 };
 
 
+
+struct cr_hdr_mm {
+   __u32 tag;  /* sharing identifier */
+   __u64 start_code, end_code, start_data, end_data;
+   __u64 start_brk, brk, start_stack;
+   __u64 arg_start, arg_end, env_start, env_end;
+   __s16 map_count;
+};
+
+struct cr_hdr_vma {
+   __u32 how;
+
+   __u64 vm_start;
+   __u64 vm_end;
+   __u64 vm_page_prot;
+   __u64 vm_flags;
+   __u64 vm_pgoff;
+
+   __s16 npages;
+   __s16 namelen;
+};
diff -puN /dev/null ckpt/ckpt_mem.c
--- /dev/null   2007-04-11 11:48:27.0 -0700
+++ linux-2.6.git-dave/ckpt/ckpt_mem.c  2008-08-04 13:30:00.0 -0700
@@ -0,0 +1,388 @@
+/*
+ *  Checkpoint memory contents
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include linux/sched.h
+#include linux/slab.h
+#include linux/file.h
+#include linux/pagemap.h
+#include linux/mm_types.h
+
+#include ckpt.h
+#include ckpt_hdr.h
+#include ckpt_arch.h
+#include ckpt_mem.h
+
+/*
+ * utilities to alloc, free, and handle 'struct cr_pgarr'
+ * (common to ckpt_mem.c and rstr_mem.c)
+ */
+
+#define CR_ORDER_PGARR  0
+#define CR_PGARR_TOTAL  ((PAGE_SIZE  CR_ORDER_PGARR) / sizeof(void *))
+
+/* release pages referenced by a page-array */
+void _cr_pgarr_release(struct cr_ctx *ctx, struct cr_pgarr *pgarr)
+{
+   int n;
+
+   /* only checkpoint keeps references to pages */
+   if (ctx-flags  CR_CTX_CKPT) {
+   CR_PRINTK(release pages (nused %d)\n, pgarr-nused);
+   for (n = pgarr-nused; n--; )
+   page_cache_release(pgarr-pages[n]);
+   }
+   pgarr-nused = 0;
+   pgarr-nleft = CR_PGARR_TOTAL;
+}
+
+/* release pages referenced by chain of page-arrays */
+void cr_pgarr_release(struct cr_ctx *ctx)
+{
+   struct cr_pgarr *pgarr;
+
+   for (pgarr = ctx-pgarr; pgarr; pgarr = pgarr-next)
+   _cr_pgarr_release(ctx, pgarr);
+}
+
+/* free a chain of page-arrays */
+void cr_pgarr_free(struct cr_ctx *ctx)
+{
+   struct cr_pgarr *pgarr, *pgnxt;
+
+   for (pgarr = ctx-pgarr; pgarr; pgarr = pgnxt) {
+   _cr_pgarr_release(ctx, pgarr);
+   free_pages((unsigned long) ctx-pgarr-addrs, CR_ORDER_PGARR);
+   free_pages((unsigned long) ctx-pgarr-pages, 

[Devel] [RFC][PATCH 0/3] broken out c/r patches

2008-08-04 Thread Dave Hansen
I've done a bit of refactoring to Oren's patches.

I wonder if they're in a state that people think we can
share on LKML like Ted suggested.  Thoughts?

--

At the containers mini-conference before OLS, the consensus among
all the stakeholders was that doing checkpoint/restart in the kernel
as much as possible was the best approach.  With this approach, the
kernel will export a relatively opaque 'blob' of data to userspace
which can then be handed to the new kernel at restore time.

This is different that what had been proposed before, which was
that a userspace application would be responsible for collecting
all of this data.  We were also planning on adding lots of new,
little kernel interfaces for all of the things that needed
checkpointing.  This unites those into a single, grand interface.

The 'blob' will contain copies of select portions of kernel
structures such as vmas and mm_structs.  It will also contain
copies of the actual memory that the process uses.  Any changes
in this blob's format between kernel revisions can be handled by
an in-userspace conversion program.

This is a similar approach to virtually all of the commercial
checkpoint/restart products out there, as well as the research
project Zap.

These patches basically serialize internel kernel state and write
it out to a file descriptor.  The checkpoint and restore are done
with two new system calls: sys_checkpoint and sys_restart.

In this incarnation, they can only work checkpoint and restore a
single task. The task's address space may consist of only private,
simple vma's - anonymous or file-mapped.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 1/3] kernel-based checkpoint-restart: general infrastructure

2008-08-04 Thread Dave Hansen

From: Oren Laadan [EMAIL PROTECTED]

This patch adds those interfaces, as well as all of the helpers
needed to easily manage the file format.  

The code is roughly broken out as follows:

ckpt/sys.c - user/kernel data transfer, as well as setting up of the
 checkpoint/restart context (a per-checkpoint data
 structure for housekeeping)
ckpt/checkpoint.c - output wrappers and basic checkpoint handling
ckpt/restart.c - input wrappers and basic restart handling

Patches to add the per-architecture support as well as the actual
work to do the memory checkpoint follow in subsequent patches.

Signed-off-by: Oren Laadan [EMAIL PROTECTED]
---

 linux-2.6.git-dave/Makefile  |2 
 linux-2.6.git-dave/ckpt/Makefile |1 
 linux-2.6.git-dave/ckpt/checkpoint.c |  207 +++
 linux-2.6.git-dave/ckpt/ckpt.h   |   82 
 linux-2.6.git-dave/ckpt/ckpt_hdr.h   |   69 ++
 linux-2.6.git-dave/ckpt/restart.c|  189 
 linux-2.6.git-dave/ckpt/sys.c|  233 +++
 7 files changed, 782 insertions(+), 1 deletion(-)

diff -puN /dev/null ckpt/checkpoint.c
--- /dev/null   2007-04-11 11:48:27.0 -0700
+++ linux-2.6.git-dave/ckpt/checkpoint.c2008-08-04 13:29:55.0 
-0700
@@ -0,0 +1,207 @@
+/*
+ *  Checkpoint logic and helpers
+ *
+ *  Copyright (C) 2008 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include linux/version.h
+#include linux/sched.h
+#include linux/time.h
+#include linux/fs.h
+#include linux/file.h
+#include linux/dcache.h
+#include linux/mount.h
+#include asm/ptrace.h
+
+#include ckpt.h
+#include ckpt_hdr.h
+
+/**
+ * cr_get_fname - return pathname of a given file
+ * @file: file pointer
+ * @buf: buffer for pathname
+ * @n: buffer length (in) and pathname length (out)
+ *
+ * if the buffer provivded by the caller is too small, allocate a new
+ * buffer; caller should call cr_put_pathname() for cleanup
+ */
+char *cr_get_fname(struct path *path, struct path *root, char *buf, int *n)
+{
+   char *fname;
+
+   fname = __d_path(path, root, buf, *n);
+
+   if (IS_ERR(fname)  PTR_ERR(fname) == -ENAMETOOLONG) {
+if (!(buf = (char *) __get_free_pages(GFP_KERNEL, 0)))
+return ERR_PTR(-ENOMEM);
+   fname = __d_path(path, root, buf, PAGE_SIZE);
+   if (IS_ERR(fname))
+   free_pages((unsigned long) buf, 0);
+   }
+   if (!IS_ERR(fname))
+   *n = (buf + *n - fname);
+
+   return fname;
+}
+
+/**
+ * cr_put_fname - (possibly) cleanup pathname buffer
+ * @buf: original buffer that was given to cr_get_pathname()
+ * @fname: resulting pathname from cr_get_pathname()
+ * @n: length of original buffer
+ */
+void cr_put_fname(char *buf, char *fname, int n)
+{
+   if (fname  (fname  buf || fname = buf + n))
+   free_pages((unsigned long) buf, 0);
+}
+
+/**
+ * cr_write_obj - write a record described by a cr_hdr
+ * @ctx: checkpoint context
+ * @h: record descriptor
+ * @buf: record buffer
+ */
+int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf)
+{
+   int ret;
+
+   if ((ret = cr_kwrite(ctx, h, sizeof(*h)))  0)
+   return ret;
+   return cr_kwrite(ctx, buf, h-len);
+}
+
+/**
+ * cr_write_str - write a string record
+ * @ctx: checkpoint context
+ * @str: string buffer
+ * @n: string length
+ */
+int cr_write_str(struct cr_ctx *ctx, char *str, int n)
+{
+   struct cr_hdr h;
+
+   h.type = CR_HDR_STR;
+   h.len = n;
+   h.id = 0;
+
+   return cr_write_obj(ctx, h, str);
+}
+
+/* write the checkpoint header */
+static int cr_write_hdr(struct cr_ctx *ctx)
+{
+   struct cr_hdr h;
+   struct cr_hdr_head *hh = ctx-tbuf;
+   struct timeval ktv;
+
+   h.type = CR_HDR_HEAD;
+   h.len = sizeof(*hh);
+   h.id = 0;
+
+   do_gettimeofday(ktv);
+
+   hh-magic = 0x00a2d200;
+   hh-major = (LINUX_VERSION_CODE  16)  0xff;
+   hh-minor = (LINUX_VERSION_CODE  8)  0xff;
+   hh-patch = (LINUX_VERSION_CODE)  0xff;
+
+   hh-version = 1;
+
+   hh-flags = ctx-flags;
+   hh-time = ktv.tv_sec;
+
+   return cr_write_obj(ctx, h, hh);
+}
+
+/* write the checkpoint trailer */
+static int cr_write_tail(struct cr_ctx *ctx)
+{
+   struct cr_hdr h;
+   struct cr_hdr_tail *hh = ctx-tbuf;
+
+   h.type = CR_HDR_TAIL;
+   h.len = sizeof(*hh);
+   h.id = 0;
+
+   hh-magic = 0x002d2a00;
+   hh-cksum[0] = hh-cksum[1] = 1;/* TBD ... */
+
+   return cr_write_obj(ctx, h, hh);
+}
+
+/* dump the task_struct of a given task */
+static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t)
+{
+   struct cr_hdr h;
+   struct cr_hdr_task *hh = 

[Devel] Re: Too many I/O controller patches

2008-08-04 Thread Andrea Righi
Balbir Singh wrote:
 Dave Hansen wrote:
 On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
 This series of patches of dm-ioband now includes The bio tracking 
 mechanism,
 which has been posted individually to this mailing list.
 This makes it easy for anybody to control the I/O bandwidth even when
 the I/O is one of delayed-write requests.
 During the Containers mini-summit at OLS, it was mentioned that there
 are at least *FOUR* of these I/O controllers floating around.  Have you
 talked to the other authors?  (I've cc'd at least one of them).

 We obviously can't come to any kind of real consensus with people just
 tossing the same patches back and forth.
 
 Ryo and Andrea - Naveen and Satoshi met up at OLS and discussed their 
 approach.
 It would be really nice to see an RFC, I know Andrea did work on this and
 compared the approaches.
 

yes, I wrote down something about the comparison of priority-based vs
bandwidth shaping solutions in terms of performance predictability.  And
other considerations, like the one I cited before, about dirty-ratio
throttling in memory, AIO handling, etc.

Something is also reported in the io-throttle documentation:

http://marc.info/?l=linux-kernelm=121780176907686w=2

But ok, I agree with Balbir, I can try to put the things together (in a
better form in particular) and try to post an RFC together with Ryo.

Ryo, do you have other documentation besides the info reported in the
dm-ioband website?

Thanks,
-Andrea
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Too many I/O controller patches

2008-08-04 Thread Andrea Righi
Dave Hansen wrote:
 On Mon, 2008-08-04 at 20:22 +0200, Andrea Righi wrote:
 But I'm not yet convinced that limiting the IO writes at the device
 mapper layer is the best solution. IMHO it would be better to throttle
 applications' writes when they're dirtying pages in the page cache (the
 io-throttle way), because when the IO requests arrive to the device
 mapper it's too late (we would only have a lot of dirty pages that are
 waiting to be flushed to the limited block devices, and maybe this could
 lead to OOM conditions). IOW dm-ioband is doing this at the wrong level
 (at least for my requirements). Ryo, correct me if I'm wrong or if I've
 not understood the dm-ioband approach.
 
 The avoid-lots-of-page-dirtying problem sounds like a hard one.  But, if
 you look at this in combination with the memory controller, they would
 make a great team.
 
 The memory controller keeps you from dirtying more than your limit of
 pages (and pinning too much memory) even if the dm layer is doing the
 throttling and itself can't throttle the memory usage.

mmh... but in this way we would just move the OOM inside the cgroup,
that is a nice improvement, but the main problem is not resolved...

A safer approach IMHO is to force the tasks to wait synchronously on
each operation that directly or indirectly generates i/o.

In particular the solution used by the io-throttle controller to limit
the dirty-ratio in memory is to impose a sleep via
schedule_timeout_killable() in balance_dirty_pages() when a generic
process exceeds the limits defined for the belonging cgroup.

Limiting read operations is a lot more easy, because they're always
synchronized with i/o requests.

-Andrea
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Too many I/O controller patches

2008-08-04 Thread Dave Hansen
On Mon, 2008-08-04 at 22:44 +0200, Andrea Righi wrote:
 Dave Hansen wrote:
  On Mon, 2008-08-04 at 20:22 +0200, Andrea Righi wrote:
  But I'm not yet convinced that limiting the IO writes at the device
  mapper layer is the best solution. IMHO it would be better to throttle
  applications' writes when they're dirtying pages in the page cache (the
  io-throttle way), because when the IO requests arrive to the device
  mapper it's too late (we would only have a lot of dirty pages that are
  waiting to be flushed to the limited block devices, and maybe this could
  lead to OOM conditions). IOW dm-ioband is doing this at the wrong level
  (at least for my requirements). Ryo, correct me if I'm wrong or if I've
  not understood the dm-ioband approach.
  
  The avoid-lots-of-page-dirtying problem sounds like a hard one.  But, if
  you look at this in combination with the memory controller, they would
  make a great team.
  
  The memory controller keeps you from dirtying more than your limit of
  pages (and pinning too much memory) even if the dm layer is doing the
  throttling and itself can't throttle the memory usage.
 
 mmh... but in this way we would just move the OOM inside the cgroup,
 that is a nice improvement, but the main problem is not resolved...
 
 A safer approach IMHO is to force the tasks to wait synchronously on
 each operation that directly or indirectly generates i/o.

Fine in theory, hard in practice. :)

I think the best we can hope for is to keep parity with what happens in
the rest of the kernel.  We already have a problem today with people
mmap()'ing lots of memory and dirtying it all at once.  Adding a i/o
bandwidth controller or a memory controller isn't really going to fix
that.  I think it is outside the scope of the i/o (and memory)
controllers until we solve it generically, first.

-- Dave

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-08-04 Thread Hugh Dickins
On Tue, 5 Aug 2008, Balbir Singh wrote:
 Hugh Dickins wrote:
 [snip]
  
  BUG: unable to handle kernel paging request at 6b6b6b8b
  IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29
  Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7)
   [78161323] ? exit_mmap+0xaf/0x133
   [781226b1] ? mmput+0x4c/0xba
   [78165ce3] ? try_to_unuse+0x20b/0x3f5
   [78371534] ? _spin_unlock+0x22/0x3c
   [7816636a] ? sys_swapoff+0x17b/0x37c
   [78102d95] ? sysenter_past_esp+0x6a/0xa5
 
 I am unable to reproduce the problem,

Me neither, I've spent many hours trying 2.6.27-rc1-mm1 and then
back to 2.6.26-rc8-mm1.  But I've been SO stupid: saw it originally
on one machine with SLAB_DEBUG=y, have been trying since mostly on
another with SLUB_DEBUG=y, but never thought to boot with
slub_debug=P,task_struct until now.

 but I do have an initial hypothesis
 
 CPU0  CPU1
   try_to_unuse
 task 1 stars exiting  look at mm = task1-mm
 ..increment mm_users
 task 1 exits
 mm-owner needs to be updated, but
 no new owner is found
 (mm_users  1, but no other task
 has task-mm = task1-mm)
 mm_update_next_owner() leaves
 
 grace period
   user count drops, call mmput(mm)
 task 1 freed
   dereferencing mm-owner fails

Yes, that looks right to me: seems obvious now.  I don't think your
careful alternation of CPU0/1 events at the end matters: the swapoff
CPU simply dereferences mm-owner after that task has gone.

(That's a shame, I'd always hoped that mm-owner-comm was going to
be good for use in mm messages, even when tearing down the mm.)

 I do have a potential solution in mind, but I want to make sure my
 hypothesis is correct.

It seems wrong that memrlimit_cgroup_uncharge_as should be called
after mm-owner may have been changed, even if it's to something safe.
But I forget the mm/task exit details, surely they're tricky.

By the way, is the ordering in mm_update_next_owner the best?
Would there be less movement if it searched amongst siblings before
it searched amongst children?  Ought it to make a first pass trying
to stay within the same cgroup?

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 2/6] Container Freezer: Make refrigerator always available

2008-08-04 Thread Matt Helsley

On Sat, 2008-08-02 at 00:53 +0200, Rafael J. Wysocki wrote:
 On Friday, 1 of August 2008, Matt Helsley wrote:
  
  On Fri, 2008-08-01 at 16:27 +0200, Thomas Petazzoni wrote:
   Hi,
   
   Le Thu, 31 Jul 2008 22:07:01 -0700,
   Matt Helsley [EMAIL PROTECTED] a écrit :
   
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -5,7 +5,7 @@
 obj-y = sched.o fork.o exec_domain.o panic.o printk.o \
cpu.o exit.o itimer.o time.o softirq.o resource.o \
sysctl.o capability.o ptrace.o timer.o user.o \
-   signal.o sys.o kmod.o workqueue.o pid.o \
+   signal.o sys.o kmod.o workqueue.o pid.o freezer.o \
   
   I have the impression that the code in kernel/power/process.c was
   compiled only if CONFIG_PM_SLEEP was set. Now that the code has been
   moved to kernel/freezer.c, it is unconditionnaly compiled in every
   kernel. Is that correct ?
  
   If so, is it possible to put this new feature under some
   CONFIG_SOMETHING option, for people who care about the kernel size ?
  
  How about making it depend on a combination of CONFIG variables?
  Here's an RFC PATCH. Completely untested.
  
  Signed-off-by: Matt Helsley [EMAIL PROTECTED]
 
 Can you please also make the contents of include/linux/freezer.h depend on
 CONFIG_FREEZER instead of CONFIG_PM_SLEEP?

Done.

 Also, I'm not really sure if kernel/power/Kconfig is the right place to define
 CONFIG_FREEZER.
 
 Perhaps we should even move freezer.c from kernel/power to kernel
 and define CONFIG_FREEZER in Kconfig in there.  Andrew, what do you think?

I'll check this weekend for replies and repost the RFC PATCH on Monday
if I don't hear anything. In the meantime I'll be doing some config
build testing with the above changes to make sure it's correct.

Cheers,
-Matt

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 0/6] Enable multiple mounts of devpts

2008-08-04 Thread sukadev

I thought I will send out the patches I mentioned to H. Peter Anvin
recently to get some feedback on the general direction. This version
of the patchset ducks the user-space issue, for now.

---

Enable multiple mounts of devpts filesystem so each container can
allocate ptys independently.

To enable multiple mounts (most) devpts interfaces need to know which
instance of devpts is being accessed. This patchset uses the 'struct
inode' of the device being accessed to identify the appropriate devpts
instance. It then uses get_sb_nodev() instead of get_sb_single() to
allow multiple mounts

PATCH 1/6   Pass-in 'struct inode' to devpts interfaces
PATCH 2/6   Remove 'devpts_root' global
PATCH 3/6   Move 'allocated_ptys' to sb-s_s_fs_info
PATCH 4/6   Allow mknod of ptmx and tty devices
PATCH 5/6   Allow multiple mounts of devpts
PATCH 6/6   Tweak in init_dev() /dev/tty

If devpts is mounted just once, this patchset should not change any behavior.

If devpts is mounted more than once, then '/dev/ptmx' must be a symlink
to '/dev/pts/ptmx' and in each new devpts mount we must create the
device node '/dev/pts/ptmx' [c, 5;2] by hand.

Have only done some basic testing with multiple mounts and sshd. May not
be bisect-safe.

Appreciate comments on overall approach of my mapping from the inode
to sb-s_fs_info to allocated_ptys and the hacky use of get_sb_nodev(),
and also on the tweak to init_dev() (patch 6).

Todo:
User-space impact of /dev/ptmx symlink - Options are being
discussed on mailing list (new mount option and config token,
new fs name, etc)

Remove even initial kernel mount of devpts ?
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 1/6] Pass-in 'struct inode' to devpts interfaces

2008-08-04 Thread sukadev
From: Sukadev Bhattiprolu [EMAIL PROTECTED]
Subject: [RFC][PATCH 1/6] Pass-in 'struct inode' to devpts interfaces

Pass-in an 'inode' parameter to devpts interfaces.  The parameter
itself will be used in subsequent patches to identify the instance
of devpts mounted.

---
 drivers/char/pty.c|3 ++-
 drivers/char/tty_io.c |   21 +++--
 fs/devpts/inode.c |   10 +-
 include/linux/devpts_fs.h |   34 --
 4 files changed, 42 insertions(+), 26 deletions(-)

Index: linux-2.6.26-rc8-mm1/fs/devpts/inode.c
===
--- linux-2.6.26-rc8-mm1.orig/fs/devpts/inode.c 2008-08-04 02:07:25.0 
-0700
+++ linux-2.6.26-rc8-mm1/fs/devpts/inode.c  2008-08-04 02:08:15.0 
-0700
@@ -177,7 +177,7 @@ static struct dentry *get_node(int num)
return lookup_one_len(s, root, sprintf(s, %d, num));
 }
 
-int devpts_new_index(void)
+int devpts_new_index(struct inode *inode)
 {
int index;
int idr_ret;
@@ -205,14 +205,14 @@ retry:
return index;
 }
 
-void devpts_kill_index(int idx)
+void devpts_kill_index(struct inode *inode, int idx)
 {
mutex_lock(allocated_ptys_lock);
idr_remove(allocated_ptys, idx);
mutex_unlock(allocated_ptys_lock);
 }
 
-int devpts_pty_new(struct tty_struct *tty)
+int devpts_pty_new(struct inode *ptmx_inode, struct tty_struct *tty)
 {
int number = tty-index; /* tty layer puts index from 
devpts_new_index() in here */
struct tty_driver *driver = tty-driver;
@@ -245,7 +245,7 @@ int devpts_pty_new(struct tty_struct *tt
return 0;
 }
 
-struct tty_struct *devpts_get_tty(int number)
+struct tty_struct *devpts_get_tty(struct inode *inode, int number)
 {
struct dentry *dentry = get_node(number);
struct tty_struct *tty;
@@ -262,7 +262,7 @@ struct tty_struct *devpts_get_tty(int nu
return tty;
 }
 
-void devpts_pty_kill(int number)
+void devpts_pty_kill(struct inode *inode, int number)
 {
struct dentry *dentry = get_node(number);
 
Index: linux-2.6.26-rc8-mm1/include/linux/devpts_fs.h
===
--- linux-2.6.26-rc8-mm1.orig/include/linux/devpts_fs.h 2008-08-04 
02:07:24.0 -0700
+++ linux-2.6.26-rc8-mm1/include/linux/devpts_fs.h  2008-08-04 
02:07:27.0 -0700
@@ -17,20 +17,34 @@
 
 #ifdef CONFIG_UNIX98_PTYS
 
-int devpts_new_index(void);
-void devpts_kill_index(int idx);
-int devpts_pty_new(struct tty_struct *tty);  /* mknod in devpts */
-struct tty_struct *devpts_get_tty(int number);  /* get tty structure */
-void devpts_pty_kill(int number);   /* unlink */
+int devpts_new_index(struct inode *inode);
+void devpts_kill_index(struct inode *inode, int idx);
+
+/* mknod in devpts */
+int devpts_pty_new(struct inode *inode, struct tty_struct *tty);
+
+/* get tty structure */
+struct tty_struct *devpts_get_tty(struct inode *inode, int number);
+
+/* unlink */
+void devpts_pty_kill(struct inode *inode, int number);
 
 #else
 
 /* Dummy stubs in the no-pty case */
-static inline int devpts_new_index(void) { return -EINVAL; }
-static inline void devpts_kill_index(int idx) { }
-static inline int devpts_pty_new(struct tty_struct *tty) { return -EINVAL; }
-static inline struct tty_struct *devpts_get_tty(int number) { return NULL; }
-static inline void devpts_pty_kill(int number) { }
+static inline int devpts_new_index(struct inode *inode) { return -EINVAL; }
+static inline void devpts_kill_index(struct inode *inode, int idx) { }
+
+static inline int devpts_pty_new(struc inode *inode, struct tty_struct *tty)
+{
+   return -EINVAL;
+}
+
+static inline struct tty_struct *devpts_get_tty(struct inode *inode, int 
number)
+{
+   return NULL;
+}
+static inline void devpts_pty_kill(struc inode *inode, int number) { }
 
 #endif
 
Index: linux-2.6.26-rc8-mm1/drivers/char/pty.c
===
--- linux-2.6.26-rc8-mm1.orig/drivers/char/pty.c2008-08-04 
02:07:24.0 -0700
+++ linux-2.6.26-rc8-mm1/drivers/char/pty.c 2008-08-04 02:07:27.0 
-0700
@@ -59,7 +59,8 @@ static void pty_close(struct tty_struct 
set_bit(TTY_OTHER_CLOSED, tty-flags);
 #ifdef CONFIG_UNIX98_PTYS
if (tty-driver == ptm_driver)
-   devpts_pty_kill(tty-index);
+   devpts_pty_kill(filp-f_path.dentry-d_inode,
+   tty-index);
 #endif
tty_vhangup(tty-link);
}
Index: linux-2.6.26-rc8-mm1/drivers/char/tty_io.c
===
--- linux-2.6.26-rc8-mm1.orig/drivers/char/tty_io.c 2008-08-04 
02:07:24.0 -0700
+++ linux-2.6.26-rc8-mm1/drivers/char/tty_io.c  2008-08-04 02:07:55.0 
-0700
@@ -2056,7 +2056,7 @@ static void tty_line_name(struct tty_dri
  * relaxed for 

[Devel] [RFC][PATCH 2/6] Remove 'devpts_root' global

2008-08-04 Thread sukadev

From: Sukadev Bhattiprolu [EMAIL PROTECTED]
Subject: [RFC][PATCH 2/6] Remove 'devpts_root' global

Remove the 'devpts_root' global variable and find the root dentry using
the super_block. The super-block itself is found from the device inode,
using a new wrapper, pts_sb_from_inode().

---
 fs/devpts/inode.c |   36 
 1 file changed, 24 insertions(+), 12 deletions(-)

Index: linux-2.6.26-rc8-mm1/fs/devpts/inode.c
===
--- linux-2.6.26-rc8-mm1.orig/fs/devpts/inode.c 2008-08-04 02:08:15.0 
-0700
+++ linux-2.6.26-rc8-mm1/fs/devpts/inode.c  2008-08-04 02:08:43.0 
-0700
@@ -33,7 +33,14 @@ static DEFINE_IDR(allocated_ptys);
 static DEFINE_MUTEX(allocated_ptys_lock);
 
 static struct vfsmount *devpts_mnt;
-static struct dentry *devpts_root;
+
+static inline struct super_block *pts_sb_from_inode(struct inode *inode)
+{
+   if (inode-i_sb-s_magic == DEVPTS_SUPER_MAGIC)
+   return inode-i_sb;
+
+   return devpts_mnt-mnt_sb;
+}
 
 static struct {
int setuid;
@@ -141,7 +148,7 @@ devpts_fill_super(struct super_block *s,
inode-i_fop = simple_dir_operations;
inode-i_nlink = 2;
 
-   devpts_root = s-s_root = d_alloc_root(inode);
+   s-s_root = d_alloc_root(inode);
if (s-s_root)
return 0;

@@ -169,10 +176,9 @@ static struct file_system_type devpts_fs
  * to the System V naming convention
  */
 
-static struct dentry *get_node(int num)
+static struct dentry *get_node(struct dentry *root, int num)
 {
char s[12];
-   struct dentry *root = devpts_root;
mutex_lock(root-d_inode-i_mutex);
return lookup_one_len(s, root, sprintf(s, %d, num));
 }
@@ -218,7 +224,9 @@ int devpts_pty_new(struct inode *ptmx_in
struct tty_driver *driver = tty-driver;
dev_t device = MKDEV(driver-major, driver-minor_start+number);
struct dentry *dentry;
-   struct inode *inode = new_inode(devpts_mnt-mnt_sb);
+   struct super_block *sb = pts_sb_from_inode(ptmx_inode);
+   struct inode *inode = new_inode(sb);
+   struct dentry *root = sb-s_root;
 
/* We're supposed to be given the slave end of a pty */
BUG_ON(driver-type != TTY_DRIVER_TYPE_PTY);
@@ -234,20 +242,22 @@ int devpts_pty_new(struct inode *ptmx_in
init_special_inode(inode, S_IFCHR|config.mode, device);
inode-i_private = tty;
 
-   dentry = get_node(number);
+   dentry = get_node(root, number);
if (!IS_ERR(dentry)  !dentry-d_inode) {
d_instantiate(dentry, inode);
-   fsnotify_create(devpts_root-d_inode, dentry);
+   fsnotify_create(root-d_inode, dentry);
}
 
-   mutex_unlock(devpts_root-d_inode-i_mutex);
+   mutex_unlock(root-d_inode-i_mutex);
 
return 0;
 }
 
 struct tty_struct *devpts_get_tty(struct inode *inode, int number)
 {
-   struct dentry *dentry = get_node(number);
+   struct super_block *sb = pts_sb_from_inode(inode);
+   struct dentry *root = sb-s_root;
+   struct dentry *dentry = get_node(root, number);
struct tty_struct *tty;
 
tty = NULL;
@@ -257,14 +267,16 @@ struct tty_struct *devpts_get_tty(struct
dput(dentry);
}
 
-   mutex_unlock(devpts_root-d_inode-i_mutex);
+   mutex_unlock(root-d_inode-i_mutex);
 
return tty;
 }
 
 void devpts_pty_kill(struct inode *inode, int number)
 {
-   struct dentry *dentry = get_node(number);
+   struct super_block *sb = pts_sb_from_inode(inode);
+   struct dentry *root = sb-s_root;
+   struct dentry *dentry = get_node(root, number);
 
if (!IS_ERR(dentry)) {
struct inode *inode = dentry-d_inode;
@@ -275,7 +287,7 @@ void devpts_pty_kill(struct inode *inode
}
dput(dentry);
}
-   mutex_unlock(devpts_root-d_inode-i_mutex);
+   mutex_unlock(root-d_inode-i_mutex);
 }
 
 static int __init init_devpts_fs(void)
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 4/6]: Allow mknod of ptmx in devpts

2008-08-04 Thread sukadev

From: Sukadev Bhattiprolu [EMAIL PROTECTED]
Subject: [RFC][PATCH 4/6]: Allow mknod of ptmx in devpts

/dev/ptmx is closely tied to the devpts filesystem. An open of /dev/ptmx,
allocates the next pty index and the associated device shows up in the
devpts fs as /dev/pts/n.

Wih multiple mounts of devpts filesystem, an open of /dev/ptmx would be
unable to determine which instance of the devpts is being accessed.

One solution for this would be to create make /dev/ptmx a symlink to
/dev/pts/ptmx and create the device node, ptmx, in each instance of
devpts.  When /dev/ptmx is opened, we can use the inode of /dev/pts/ptmx
to identify the specific devpts instance.

(This solution has an impact on the 'startup scripts', and that is 
being discussed separately).

This patch merely enables creating the [c, 5:2] (ptmx) device in devpts
filesystem.

TODO:
- Ability to unlink the /dev/pts/ptmx
- Remove traces of '/dev/pts/tty' node

Changelog:
- Earlier version of this patch enabled creating /dev/pts/tty
  as well. As pointed out by Al Viro and H. Peter Anvin, that
  is not really necessary.

---
 fs/devpts/inode.c |   56 +++---
 1 file changed, 53 insertions(+), 3 deletions(-)

Index: linux-2.6.26-rc8-mm1/fs/devpts/inode.c
===
--- linux-2.6.26-rc8-mm1.orig/fs/devpts/inode.c 2008-08-04 02:08:50.0 
-0700
+++ linux-2.6.26-rc8-mm1/fs/devpts/inode.c  2008-08-04 17:26:26.0 
-0700
@@ -141,6 +141,56 @@ static void *new_pts_fs_info(void)
 }
 
 
+
+static int devpts_mknod(struct inode *dir, struct dentry *dentry,
+   int mode, dev_t rdev)
+{
+   int inum;
+   struct inode *inode;
+   struct super_block *sb = dir-i_sb;
+
+   if (dentry-d_inode)
+   return -EEXIST;
+
+   if (!S_ISCHR(mode))
+   return -EPERM;
+
+   if (rdev == MKDEV(TTYAUX_MAJOR, 2))
+   inum = 2;
+#if 0
+   else if (rdev == MKDEV(TTYAUX_MAJOR, 0))
+   inum = 3;
+#endif
+   else
+   return -EPERM;
+
+   inode = new_inode(sb);
+   if (!inode)
+   return -ENOMEM;
+
+   inode-i_ino = inum;
+   inode-i_uid = inode-i_gid = 0;
+   inode-i_blocks = 0;
+   inode-i_mtime = inode-i_atime = inode-i_ctime = CURRENT_TIME;
+
+   init_special_inode(inode, mode, rdev);
+
+   d_instantiate(dentry, inode);
+   /*
+* Get a reference to the dentry so the device-nodes persist
+* even when there are no active references to them. We use
+* kill_litter_super() to remove this entry when unmounting
+* devpts.
+*/
+   dget(dentry);
+   return 0;
+}
+
+const struct inode_operations devpts_dir_inode_operations = {
+   .lookup = simple_lookup,
+   .mknod  = devpts_mknod,
+};
+
 static int
 devpts_fill_super(struct super_block *s, void *data, int silent)
 {
@@ -164,7 +214,7 @@ devpts_fill_super(struct super_block *s,
inode-i_blocks = 0;
inode-i_uid = inode-i_gid = 0;
inode-i_mode = S_IFDIR | S_IRUGO | S_IXUGO | S_IWUSR;
-   inode-i_op = simple_dir_inode_operations;
+   inode-i_op = devpts_dir_inode_operations;
inode-i_fop = simple_dir_operations;
inode-i_nlink = 2;
 
@@ -195,7 +245,7 @@ static void devpts_kill_sb(struct super_
//idr_destroy(fsi-allocated_ptys);
kfree(fsi);
 
-   kill_anon_super(sb);
+   kill_litter_super(sb);
 }
 
 static struct file_system_type devpts_fs_type = {
@@ -274,7 +324,7 @@ int devpts_pty_new(struct inode *ptmx_in
if (!inode)
return -ENOMEM;
 
-   inode-i_ino = number+2;
+   inode-i_ino = number+4;
inode-i_uid = config.setuid ? config.uid : current-fsuid;
inode-i_gid = config.setgid ? config.gid : current-fsgid;
inode-i_mtime = inode-i_atime = inode-i_ctime = CURRENT_TIME;
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 3/6] Move 'allocated_ptys' to sb-s_s_fs_info

2008-08-04 Thread sukadev

From: Sukadev Bhattiprolu [EMAIL PROTECTED]
Subject: [RFC][PATCH 3/6] Move 'allocated_ptys' to sb-s_s_fs_info

To enable multiple mounts of devpts, 'allocated_ptys' must be a per-mount
variable rather than a global variable.  This patch moves 'allocated_ptys'
into the super_block's s_fs_info.

---
 fs/devpts/inode.c |   53 ++---
 1 file changed, 46 insertions(+), 7 deletions(-)

Index: linux-2.6.26-rc8-mm1/fs/devpts/inode.c
===
--- linux-2.6.26-rc8-mm1.orig/fs/devpts/inode.c 2008-08-04 02:08:43.0 
-0700
+++ linux-2.6.26-rc8-mm1/fs/devpts/inode.c  2008-08-04 02:08:50.0 
-0700
@@ -28,8 +28,11 @@
 
 #define DEVPTS_DEFAULT_MODE 0600
 
+struct pts_fs_info {
+   struct idr allocated_ptys;
+};
+
 extern int pty_limit;  /* Config limit on Unix98 ptys */
-static DEFINE_IDR(allocated_ptys);
 static DEFINE_MUTEX(allocated_ptys_lock);
 
 static struct vfsmount *devpts_mnt;
@@ -125,6 +128,19 @@ static const struct super_operations dev
.show_options   = devpts_show_options,
 };
 
+static void *new_pts_fs_info(void)
+{
+   struct pts_fs_info *fsi;
+
+   fsi = kmalloc(sizeof(struct pts_fs_info), GFP_KERNEL);
+   if (fsi) {
+   idr_init(fsi-allocated_ptys);
+   }
+   printk(KERN_ERR new_pts_fs_info(): Returning fsi %p\n, fsi);
+   return fsi;
+}
+
+
 static int
 devpts_fill_super(struct super_block *s, void *data, int silent)
 {
@@ -135,10 +151,14 @@ devpts_fill_super(struct super_block *s,
s-s_magic = DEVPTS_SUPER_MAGIC;
s-s_op = devpts_sops;
s-s_time_gran = 1;
+   s-s_fs_info = new_pts_fs_info();
+
+   if (!s-s_fs_info)
+   goto fail;
 
inode = new_inode(s);
if (!inode)
-   goto fail;
+   goto free_fsi;
inode-i_ino = 1;
inode-i_mtime = inode-i_atime = inode-i_ctime = CURRENT_TIME;
inode-i_blocks = 0;
@@ -154,6 +174,9 @@ devpts_fill_super(struct super_block *s,

printk(devpts: get root dentry failed\n);
iput(inode);
+
+free_fsi:
+   kfree(s-s_fs_info);
 fail:
return -ENOMEM;
 }
@@ -164,11 +187,22 @@ static int devpts_get_sb(struct file_sys
return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt);
 }
 
+
+static void devpts_kill_sb(struct super_block *sb)
+{
+   struct pts_fs_info *fsi = sb-s_fs_info;// rcu ?
+
+   //idr_destroy(fsi-allocated_ptys);
+   kfree(fsi);
+
+   kill_anon_super(sb);
+}
+
 static struct file_system_type devpts_fs_type = {
.owner  = THIS_MODULE,
.name   = devpts,
.get_sb = devpts_get_sb,
-   .kill_sb= kill_anon_super,
+   .kill_sb= devpts_kill_sb,
 };
 
 /*
@@ -187,14 +221,16 @@ int devpts_new_index(struct inode *inode
 {
int index;
int idr_ret;
+   struct super_block *sb = pts_sb_from_inode(inode);
+   struct pts_fs_info *fsi = sb-s_fs_info;// need rcu ?
 
 retry:
-   if (!idr_pre_get(allocated_ptys, GFP_KERNEL)) {
+   if (!idr_pre_get(fsi-allocated_ptys, GFP_KERNEL)) {
return -ENOMEM;
}
 
mutex_lock(allocated_ptys_lock);
-   idr_ret = idr_get_new(allocated_ptys, NULL, index);
+   idr_ret = idr_get_new(fsi-allocated_ptys, NULL, index);
if (idr_ret  0) {
mutex_unlock(allocated_ptys_lock);
if (idr_ret == -EAGAIN)
@@ -203,7 +239,7 @@ retry:
}
 
if (index = pty_limit) {
-   idr_remove(allocated_ptys, index);
+   idr_remove(fsi-allocated_ptys, index);
mutex_unlock(allocated_ptys_lock);
return -EIO;
}
@@ -213,8 +249,11 @@ retry:
 
 void devpts_kill_index(struct inode *inode, int idx)
 {
+   struct super_block *sb = pts_sb_from_inode(inode);
+   struct pts_fs_info *fsi = sb-s_fs_info;// need rcu ?
+
mutex_lock(allocated_ptys_lock);
-   idr_remove(allocated_ptys, idx);
+   idr_remove(fsi-allocated_ptys, idx);
mutex_unlock(allocated_ptys_lock);
 }
 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 6/6]: /dev/tty tweak in init_dev()

2008-08-04 Thread sukadev

From: Sukadev Bhattiprolu [EMAIL PROTECTED]
Subject: [RFC][PATCH 6/6]: /dev/tty tweak in init_dev()

When opening /dev/tty, __tty_open() finds the tty using get_current_tty().
When __tty_open() calls init_dev(), init_dev() tries to 'find' the tty
again from devpts.  Is that really necessary ?

The problem with asking devpts again is that with multiple mounts, devpts
cannot find the tty without knowing the specific mount instance. We can't
find the mount instance of devpts, since the inode of /dev/tty is in a
different filesystem.

---
 drivers/char/tty_io.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6.26-rc8-mm1/drivers/char/tty_io.c
===
--- linux-2.6.26-rc8-mm1.orig/drivers/char/tty_io.c 2008-08-04 
17:25:20.0 -0700
+++ linux-2.6.26-rc8-mm1/drivers/char/tty_io.c  2008-08-04 17:26:34.0 
-0700
@@ -2066,7 +2066,10 @@ static int init_dev(struct tty_driver *d
 
/* check whether we're reopening an existing tty */
if (driver-flags  TTY_DRIVER_DEVPTS_MEM) {
-   tty = devpts_get_tty(inode, idx);
+   if (inode-i_rdev == MKDEV(TTYAUX_MAJOR, 0))
+   tty = *ret_tty;
+   else
+   tty = devpts_get_tty(inode, idx);
/*
 * If we don't have a tty here on a slave open, it's because
 * the master already started the close process and there's
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] [RFC][PATCH 5/6] Allow multiple mounts of devpts

2008-08-04 Thread sukadev

From: Sukadev Bhattiprolu [EMAIL PROTECTED]
Subject: [RFC][PATCH 5/6] Allow multiple mounts of devpts

Can we simply enable multiple mounts using get_sb_nodev(), now that we
don't have any pts_namespace/'data' to be saved ?

(quick/dirty - does not prevent multiple mounts of devpts within
a single 'container')

---
 fs/devpts/inode.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.26-rc8-mm1/fs/devpts/inode.c
===
--- linux-2.6.26-rc8-mm1.orig/fs/devpts/inode.c 2008-08-04 17:26:26.0 
-0700
+++ linux-2.6.26-rc8-mm1/fs/devpts/inode.c  2008-08-04 17:26:31.0 
-0700
@@ -234,7 +234,7 @@ fail:
 static int devpts_get_sb(struct file_system_type *fs_type,
int flags, const char *dev_name, void *data, struct vfsmount *mnt)
 {
-   return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt);
+   return get_sb_nodev(fs_type, flags, data, devpts_fill_super, mnt);
 }
 
 
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 0/6] Enable multiple mounts of devpts

2008-08-04 Thread H. Peter Anvin
[EMAIL PROTECTED] wrote:
 
 If devpts is mounted more than once, then '/dev/ptmx' must be a symlink
 to '/dev/pts/ptmx' and in each new devpts mount we must create the
 device node '/dev/pts/ptmx' [c, 5;2] by hand.
 

This should be auto-created.  That also eliminates any need to support 
the mknod system call.

 Appreciate comments on overall approach of my mapping from the inode
 to sb-s_fs_info to allocated_ptys and the hacky use of get_sb_nodev(),
 and also on the tweak to init_dev() (patch 6).
 
 Todo:
   User-space impact of /dev/ptmx symlink - Options are being
   discussed on mailing list (new mount option and config token,
   new fs name, etc)
 
   Remove even initial kernel mount of devpts ?

The initial kernel mount of devpts should be removed, since that 
instance will never be accessible.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 0/6] Enable multiple mounts of devpts

2008-08-04 Thread sukadev
H. Peter Anvin [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:
 If devpts is mounted more than once, then '/dev/ptmx' must be a symlink
 to '/dev/pts/ptmx' and in each new devpts mount we must create the
 device node '/dev/pts/ptmx' [c, 5;2] by hand.

 This should be auto-created.  That also eliminates any need to support the 
 mknod system call.

Ok. But was wondering if we can pass the ptmx symlink burden to the
'container-startup sripts' since they are the ones that need the second
or subsequent mount of devpts.

So, initially and for systems that don't need multiple mounts of devpts,
existing behavior can continue (/dev/ptmx is a node).

Container startup scripts have to anyway remount /dev/pts and mknod
/dev/pts/ptmx. These scripts could additionally check if /dev/ptmx is
a node and make it a symlink. The container script would have to do
this check while it still has access to the first mount of devpts
and mknod in the first devpts mnt.

But then again, the first mount is still special in the kernel.


 Appreciate comments on overall approach of my mapping from the inode
 to sb-s_fs_info to allocated_ptys and the hacky use of get_sb_nodev(),
 and also on the tweak to init_dev() (patch 6).
 Todo:
  User-space impact of /dev/ptmx symlink - Options are being
  discussed on mailing list (new mount option and config token,
  new fs name, etc)
  Remove even initial kernel mount of devpts ?

 The initial kernel mount of devpts should be removed, since that instance 
 will never be accessible.

   -hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 0/6] Enable multiple mounts of devpts

2008-08-04 Thread H. Peter Anvin
[EMAIL PROTECTED] wrote:
 
 Appreciate comments on overall approach of my mapping from the inode
 to sb-s_fs_info to allocated_ptys and the hacky use of get_sb_nodev(),
 and also on the tweak to init_dev() (patch 6).
 

First of all, thanks for taking this on :)  It's always delightful to 
spout some ideas and have patches appear as a result :)

Once you have the notion of the device nodes tied to a specific devpts 
filesystem, a lot of the operations can be trivialized; for example, the 
whole devpts_get_tty() mechanism can be reduced to:

if (inode-i_sb-sb_magic != DEVPTS_SUPER_MAGIC) {
/* do cleanup */
return -ENXIO;
}
tty = inode-i_private;

This is part of what makes this whole approach so desirable: it actually 
allows for some dramatic simplifications of the existing code.

One can even bind special operations to both the ptmx node and slave 
nodes, to bypass most of the character device and tty dispatch.  That 
might require too much hacking at the tty core to be worth it, though.

-hpa
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 0/6] Enable multiple mounts of devpts

2008-08-04 Thread H. Peter Anvin
[EMAIL PROTECTED] wrote:
 
 Ok. But was wondering if we can pass the ptmx symlink burden to the
 'container-startup sripts' since they are the ones that need the second
 or subsequent mount of devpts.
 
 So, initially and for systems that don't need multiple mounts of devpts,
 existing behavior can continue (/dev/ptmx is a node).
 
 Container startup scripts have to anyway remount /dev/pts and mknod
 /dev/pts/ptmx. These scripts could additionally check if /dev/ptmx is
 a node and make it a symlink. The container script would have to do
 this check while it still has access to the first mount of devpts
 and mknod in the first devpts mnt.
 
 But then again, the first mount is still special in the kernel.
 

You're right, I think we can do this and still retain most of the 
advantages, at least for a transition period.

The idea would be that you'd have a mount option, that if you do not 
specify it, you get a bind to the in-kernel mount; otherwise you get a 
new instance.  ptmx, if not invoked from inside a devpts filesystem, 
would default to the kernel-mounted instance.

Unfortunately I believe that means parsing the command options in 
getpts_get_sb() to know if we do have the multi option, but that isn't 
really all that difficult; it just means breaking the parser out as a 
separate subroutine.

-hpa

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps

2008-08-04 Thread Oren Laadan


Louis Rilling wrote:
 On Fri, Aug 01, 2008 at 02:51:57PM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Fri, Aug 01, 2008 at 10:15:26AM -0400, Oren Laadan wrote:
 Louis Rilling wrote:
 On Thu, Jul 31, 2008 at 03:12:32PM -0400, Oren Laadan wrote:
 
 Cut the less interesting (IMHO at least) history to make Dave happier ;)
 
 Returning 0 in case of a restart is what I called a special handling. You 
 won't
 do this for the other tasks, so this is special. Since userspace must cope 
 with
 it anyway, userspace can be clever enough to avoid using the fd on restart, 
 or
 stupid enough to destroy its checkpoint after restart.
 It's a different special hanlding :)   In the case of a single task that 
 wants
 to checkpoint itself - there are no other tasks.  In the case of a container 
 -
 there will be only a single task that calls sys_checkpoint(), so only that 
 task
 will either get the CRID or the 0 (or an error). The other tasks will resume
 whatever it was that they were doing (lol, assuming of course restart works).

 So this special handling ends up being a two-liner: setting the return
 value of the syscall for the task that called sys_checkpoint() (well, 
 actually
 it will call sys_restart() to restart, and return from sys_checkpoint() with
 a value of 0 ...).
 
 I knew it, since I actually saw it in the patches you sent last week.
 
 If you use an FD, you will have to checkpoint that resource as part of the
 checkpoint, and restore it as part of the restart. In doing so you'll need
 to specially handle it, because it has a special meaning. I agree, of course,
 that it is feasible.

 
 - Userspace makes less errors when managing incremental checkpoints.
 have you implemented this ?  did you experience issues in real life ?  user
 space will need a way to manage all of it anyway in many aspects. This will
 be the last/least of the issues ...
 No it was not implemented, and I'm not going to enter a discussion about the
 weight of arguments whether they are backed by implementations or not. It 
 just
 becomes easier to create a mess with things depending on each other created 
 as
 separate, freely (userspace-decided)-named objects.
 If I were to write a user-space tool to handle this, I would keep each chain
 of checkpoints (from base and on) in a separate subdir, for example. In 
 fact,
 that's how I did it :)
 
 This is intuitive indeed. Checkpoints are already organized in a similar way 
 in
 Kerrighed, except that a notion of application (transparent to applications)
 replaces the notion of container, and the kernel decides where to put the
 checkpoints and how they are named (I'm not saying that this is the best
 way though).
 
 Besides, this scheme begins to sound much more complex than a single file.
 Do you really gain so much from not having multiple files, one per 
 checkpoint ?
 Well, at least you are not limited by the number of open file descriptors
 (assuming that, as you mentioned earlier, you pass an array of previous 
 images
 to compute the next incremental checkpoint).
 You aren't limited by the number of open file. User space could provide an 
 array
 of CRID, pathname (or serial#, pathname) to the kernel, the kernel will
 access the files as necessary.
 
 But the kernel itself would have to cope with this limit (even if it is
 not enforced, just to avoid consuming too much resources), or close and
 reopen files when needed...

You got - close and reopen as needed with LRU policy to decide which open file
to close. My experience so far is that you rarely need more than 100 open files.

 
 Uhh .. hold on:  you need the array of previous checkpoint to _restart_ from
 an incremental checkpoint. You don't care about it when you checkpoint: 
 instead,
 you keep track in memory of (1) what changed (e.g. which pages where 
 touched),
 and (2) where to find unmodified pages in previous checkpoints. You save this
 information with each new checkpoint.  The data structure to describe #2 is
 dynamic and changes with the execution, and easily keeps track of when older
 checkpoint images become irrelevant (because all the pages they hold have 
 been
 overwritten already).
 
 I see. I thought that you also intended to build incremental checkpoints
 from previous checkpoints only, because even if this is not fast, this
 saves storage space. I agree that if you always keep necessary metadata
 in kernel memory, you don't need the previous images. Actually I don't
 know any incremental checkpoint scheme not using such in-memory metadata
 scheme. Which does not imply that other schemes are not relevant
 though...
 

 where:
 - base_fd is a regular file containing the base checkpoint, or -1 if a 
 full
   checkpoint should be done. The checkpoint could actually also live in 
 memory,
   and the kernel should check that it matches the image pointed to by 
 base_fd.
 - out_fd is whatever file/socket/etc. on which we should dump the 
 checkpoint. In
   particular, out_fd can equal base_fd and should 

[Devel] RE: Too many I/O controller patches

2008-08-04 Thread Satoshi UCHIDA
Hi, Andrea.

I participated in Containers Mini-summit.
And, I talked with Mr. Andrew Morton in The Linux Foundation Japan
Symposium BoF, Japan, July 10th.

Currently, in ML, some I/O controller patches is sent and the respective
patch keeps sending the improvement version.
We and maintainers wouldn't like this situation.
We wanted to solve this situation by the Mini-summit, but unfortunately, 
no other developers participated.
(I couldn't give an opinion, because  my English skill is low)
Mr. Naveen present his way in Linux Symposium, and we discussed about
I/O control at a few time after this presentation.


Mr. Andrew gave a advice Should discuss about design more and more
to me.
And, in Containers Mini-summit (and Linux Symposium 2008 in Ottawa),
Paul said that a necessary to us is to decide a requirement first.
So, we must discuss requirement and design.

My requirement is
 * to be able to distribute performance moderately.
 (* to be able to isolate each group(environment)). 

I guess (it may be wrong)
 Naveen's requirement is
   * to be able to handle latency.
  (high priority is always precede in handling I/O)
   (Only share isn't just given priority to, like CFQ.)
   * to be able to distribute performance moderately.
 Andrea's requirement is
   * to be able to set and control by absolute(direct) performance.
 Ryo's requirement is
   * to be able to distribute performance moderately.
   * to be able to set and control I/Os at flexible range
 (multi device such as LVM).

I think that most solutions controls I/O performance moderately
(by using weight/priority/percentage/etc. and by not using absolute) 
because disk I/O performance is inconstant and is affected by
situation (such as application, file(data) balance, and so on).
So, it is difficult to guarantee performance which is set by
absolute bandwidth.
If devices have constant performance, it will good to control by
absolute bandwidth.
And, when guaranteeing it by the low ability, it'll be possible.
However, no one likes to make the resources wasteful.


And, he gave a advice Can't a framework which organized each way,
such as I/O elevator, be made?.
I try to consider such framework (in elevator layer or block layer).
Now, I look at the other methods, again.


I think that OOM problems caused by memory/cache systems.
So, it will be better that I/O controller created out of these problems
first, although a lateness of the I/O device would be related.
If these problem can be resolved, its technique should be applied into 
normal I/O control as well as cgroups.

Buffered write I/O is also related with cache system.
We must consider this problem as I/O control.
I don't have a good way which can resolve this problems.


 I did some experiments trying to implement minimum bandwidth requirements
 for my io-throttle controller, mapping the requirements to CFQ prio and
 using the Satoshi's controller. But this needs additional work and
 testing right now, so I've not posted anything yet, just informed
 Satoshi about this.

I'm very interested in this results.


Thanks,
 Satoshi Uchida.

 -Original Message-
 From: Andrea Righi [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, August 05, 2008 3:23 AM
 To: Dave Hansen
 Cc: Ryo Tsuruta; [EMAIL PROTECTED]; [EMAIL PROTECTED];
 [EMAIL PROTECTED];
 [EMAIL PROTECTED];
 [EMAIL PROTECTED]; [EMAIL PROTECTED]; Satoshi UCHIDA
 Subject: Re: Too many I/O controller patches
 
 Dave Hansen wrote:
  On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote:
  This series of patches of dm-ioband now includes The bio tracking
 mechanism,
  which has been posted individually to this mailing list.
  This makes it easy for anybody to control the I/O bandwidth even when
  the I/O is one of delayed-write requests.
 
  During the Containers mini-summit at OLS, it was mentioned that there
  are at least *FOUR* of these I/O controllers floating around.  Have you
  talked to the other authors?  (I've cc'd at least one of them).
 
  We obviously can't come to any kind of real consensus with people just
  tossing the same patches back and forth.
 
  -- Dave
 
 
 Dave,
 
 thanks for this email first of all. I've talked with Satoshi (cc-ed)
 about his solution Yet another I/O bandwidth controlling subsystem for
 CGroups based on CFQ.
 
 I did some experiments trying to implement minimum bandwidth requirements
 for my io-throttle controller, mapping the requirements to CFQ prio and
 using the Satoshi's controller. But this needs additional work and
 testing right now, so I've not posted anything yet, just informed
 Satoshi about this.
 
 Unfortunately I've not talked to Ryo yet. I've continued my work using a
 quite different approach, because the dm-ioband solution didn't work
 with delayed-write requests. Now the bio tracking feature seems really
 prosiming and I would like to do some tests ASAP, and review the patch
 as well.
 
 But I'm not yet convinced that limiting the IO writes at the device
 mapper layer is the best solution. 

[Devel] Re: memrlimit controller merge to mainline

2008-08-04 Thread Balbir Singh
Hugh Dickins wrote:
 On Tue, 5 Aug 2008, Balbir Singh wrote:
 Hugh Dickins wrote:
 [snip]
 BUG: unable to handle kernel paging request at 6b6b6b8b
 IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29
 Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7)
  [78161323] ? exit_mmap+0xaf/0x133
  [781226b1] ? mmput+0x4c/0xba
  [78165ce3] ? try_to_unuse+0x20b/0x3f5
  [78371534] ? _spin_unlock+0x22/0x3c
  [7816636a] ? sys_swapoff+0x17b/0x37c
  [78102d95] ? sysenter_past_esp+0x6a/0xa5
 I am unable to reproduce the problem,
 
 Me neither, I've spent many hours trying 2.6.27-rc1-mm1 and then
 back to 2.6.26-rc8-mm1.  But I've been SO stupid: saw it originally
 on one machine with SLAB_DEBUG=y, have been trying since mostly on
 another with SLUB_DEBUG=y, but never thought to boot with
 slub_debug=P,task_struct until now.
 

Unfortunately, I've not tried on 32 bit and not at all with SLAB_DEBUG=y. I'll
give the latter a trial run and see what I get.

 but I do have an initial hypothesis

 CPU0 CPU1
  try_to_unuse
 task 1 stars exiting look at mm = task1-mm
 ..   increment mm_users
 task 1 exits
 mm-owner needs to be updated, but
 no new owner is found
 (mm_users  1, but no other task
 has task-mm = task1-mm)
 mm_update_next_owner() leaves

 grace period
  user count drops, call mmput(mm)
 task 1 freed
  dereferencing mm-owner fails
 
 Yes, that looks right to me: seems obvious now.  I don't think your
 careful alternation of CPU0/1 events at the end matters: the swapoff
 CPU simply dereferences mm-owner after that task has gone.
 
 (That's a shame, I'd always hoped that mm-owner-comm was going to
 be good for use in mm messages, even when tearing down the mm.)
 

The problem we have is that tasks are independent of mm_struct's (in some ways)
and are associated almost like a database associates two entities through keys.

 I do have a potential solution in mind, but I want to make sure my
 hypothesis is correct.
 
 It seems wrong that memrlimit_cgroup_uncharge_as should be called
 after mm-owner may have been changed, even if it's to something safe.
 But I forget the mm/task exit details, surely they're tricky.
 

The fix would be to uncharge when a new owner can no longer be found (I am yet
to code/test it though).

 By the way, is the ordering in mm_update_next_owner the best?
 Would there be less movement if it searched amongst siblings before
 it searched amongst children?  Ought it to make a first pass trying
 to stay within the same cgroup?

Yes, we need to make a first pass at keeping it in the same cgroup. You might be
right about the sibling optimization.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: Too many I/O controller patches

2008-08-04 Thread Paul Menage
On Mon, Aug 4, 2008 at 1:44 PM, Andrea Righi [EMAIL PROTECTED] wrote:

 A safer approach IMHO is to force the tasks to wait synchronously on
 each operation that directly or indirectly generates i/o.

 In particular the solution used by the io-throttle controller to limit
 the dirty-ratio in memory is to impose a sleep via
 schedule_timeout_killable() in balance_dirty_pages() when a generic
 process exceeds the limits defined for the belonging cgroup.

 Limiting read operations is a lot more easy, because they're always
 synchronized with i/o requests.

I think that you're conflating two issues:

- controlling how much dirty memory a cgroup can have at any given
time (since dirty memory is much harder/slower to reclaim than clean
memory)

- controlling how much effect a cgroup can have on a given I/O device.

By controlling the rate at which a task can generate dirty pages,
you're not really limiting either of these. I think you'd have to set
your I/O limits artificially low to prevent a case of a process
writing a large data file and then doing fsync() on it, which would
then hit the disk with the entire file at once, and blow away any QoS
guarantees for other groups.

As Dave suggested, I think it would make more sense to have your
page-dirtying throttle points hook into the memory controller instead,
and allow the memory controller to track/limit dirty pages for a
cgroup, and potentially do throttling as part of that.

Paul
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel