[Devel] Re: [PATCH -mm 2/3] i/o controller infrastructure
Li Zefan wrote: Andrea Righi wrote: This is the core io-throttle kernel infrastructure. It creates the basic interfaces to cgroups and implements the I/O measurement and throttling functions. Signed-off-by: Andrea Righi [EMAIL PROTECTED] --- block/Makefile|2 ++ include/linux/cgroup_subsys.h |6 ++ init/Kconfig | 10 ++ 3 files changed, 18 insertions(+), 0 deletions(-) where is block/blk-io-throttle.c and include/linux/blk-io-throttle.h? mmmh.. they should have been here, I mean, in patch 2/3 but it seems they're gone in patch 1/3 (subject: i/o controller documentation): Documentation of the block device I/O controller: description, usage, advantages and design. Signed-off-by: Andrea Righi [EMAIL PROTECTED] --- Documentation/controllers/io-throttle.txt | 312 + block/blk-io-throttle.c | 719 + include/linux/blk-io-throttle.h | 41 ++ 3 files changed, 1072 insertions(+), 0 deletions(-) create mode 100644 Documentation/controllers/io-throttle.txt create mode 100644 block/blk-io-throttle.c create mode 100644 include/linux/blk-io-throttle.h ... I'm pretty sure I did a: git-commit -s Documentation/controllers/io-throttle.txt in my local branch (history confirms this), but, anyway, if you think it's worth it I can fix it and post the patchset again, just let me know. Thanks for looking at it! -Andrea ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 3/7] bio-cgroup: Introduction
With this series of bio-cgruop patches, you can determine the owners of any type of I/Os and it makes dm-ioband -- I/O bandwidth controller -- be able to control the Block I/O bandwidths even when it accepts delayed write requests. Dm-ioband can find the owner cgroup of each request. It is also possible that the other people who work on the I/O bandwidth throttling use this functionality to control asynchronous I/Os with a little enhancement. You have to apply the patch dm-ioband v1.4.0 before applying this series of patches. And you have to select the following config options when compiling kernel: CONFIG_CGROUPS=y CONFIG_CGROUP_BIO=y And I recommend you should also select the options for cgroup memory subsystem, because it makes it possible to give some I/O bandwidth and some memory to a certain cgroup to control delayed write requests and the processes in the cgroup will be able to make pages dirty only inside the cgroup even when the given bandwidth is narrow. CONFIG_RESOURCE_COUNTERS=y CONFIG_CGROUP_MEM_RES_CTLR=y This code is based on some part of the memory subsystem of cgroup and I don't think the accuracy and overhead of the subsystem can be ignored at this time, so we need to keep tuning it up. The following shows how to use dm-ioband with cgroups. Please assume that you want make two cgroups, which we call bio cgroup here, to track down block I/Os and assign them to ioband device ioband1. First, mount the bio cgroup filesystem. # mount -t cgroup -o bio none /cgroup/bio Then, make new bio cgroups and put some processes in them. # mkdir /cgroup/bio/bgroup1 # mkdir /cgroup/bio/bgroup2 # echo 1234 /cgroup/bio/bgroup1/tasks # echo 5678 /cgroup/bio/bgroup1/tasks Now, check the ID of each bio cgroup which is just created. # cat /cgroup/bio/bgroup1/bio.id 1 # cat /cgroup/bio/bgroup2/bio.id 2 Finally, attach the cgroups to ioband1 and assign them weights. # dmsetup message ioband1 0 type cgroup # dmsetup message ioband1 0 attach 1 # dmsetup message ioband1 0 attach 2 # dmsetup message ioband1 0 weight 1:30 # dmsetup message ioband1 0 weight 2:60 You can also make use of the dm-ioband administration tool if you want. The tool will be found here: http://people.valinux.co.jp/~kaizuka/dm-ioband/iobandctl/manual.html You can set up the device with the tool as follows. In this case, you don't need to know the IDs of the cgroups. # iobandctl.py group /dev/mapper/ioband1 cgroup /cgroup/bio/bgroup1:30 /cgroup/bio/bgroup2:60 ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts
This patch splits the cgroup memory subsystem into two parts. One is for tracking pages to find out the owners. The other is for controlling how much amount of memory should be assigned to each cgroup. With this patch, you can use the page tracking mechanism even if the memory subsystem is off. Based on 2.6.27-rc1-mm1 Signed-off-by: Ryo Tsuruta [EMAIL PROTECTED] Signed-off-by: Hirokazu Takahashi [EMAIL PROTECTED] diff -Ndupr linux-2.6.27-rc1-mm1-ioband/include/linux/memcontrol.h linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h --- linux-2.6.27-rc1-mm1-ioband/include/linux/memcontrol.h 2008-08-01 12:18:28.0 +0900 +++ linux-2.6.27-rc1-mm1.cg0/include/linux/memcontrol.h 2008-08-01 19:03:21.0 +0900 @@ -20,12 +20,62 @@ #ifndef _LINUX_MEMCONTROL_H #define _LINUX_MEMCONTROL_H +#include linux/rcupdate.h +#include linux/mm.h +#include linux/smp.h +#include linux/bit_spinlock.h + struct mem_cgroup; struct page_cgroup; struct page; struct mm_struct; +#ifdef CONFIG_CGROUP_PAGE +/* + * We use the lower bit of the page-page_cgroup pointer as a bit spin + * lock. We need to ensure that page-page_cgroup is at least two + * byte aligned (based on comments from Nick Piggin). But since + * bit_spin_lock doesn't actually set that lock bit in a non-debug + * uniprocessor kernel, we should avoid setting it here too. + */ +#define PAGE_CGROUP_LOCK_BIT0x0 +#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) +#define PAGE_CGROUP_LOCK(1 PAGE_CGROUP_LOCK_BIT) +#else +#define PAGE_CGROUP_LOCK0x0 +#endif + +/* + * A page_cgroup page is associated with every page descriptor. The + * page_cgroup helps us identify information about the cgroup + */ +struct page_cgroup { #ifdef CONFIG_CGROUP_MEM_RES_CTLR + struct list_head lru; /* per cgroup LRU list */ + struct mem_cgroup *mem_cgroup; +#endif /* CONFIG_CGROUP_MEM_RES_CTLR */ + struct page *page; + int flags; +}; +#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */ +#define PAGE_CGROUP_FLAG_ACTIVE(0x2) /* page is active in this cgroup */ +#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */ +#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8) /* page is unevictableable */ + +static inline void lock_page_cgroup(struct page *page) +{ + bit_spin_lock(PAGE_CGROUP_LOCK_BIT, page-page_cgroup); +} + +static inline int try_lock_page_cgroup(struct page *page) +{ + return bit_spin_trylock(PAGE_CGROUP_LOCK_BIT, page-page_cgroup); +} + +static inline void unlock_page_cgroup(struct page *page) +{ + bit_spin_unlock(PAGE_CGROUP_LOCK_BIT, page-page_cgroup); +} #define page_reset_bad_cgroup(page)((page)-page_cgroup = 0) @@ -34,45 +84,15 @@ extern int mem_cgroup_charge(struct page gfp_t gfp_mask); extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); -extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru); extern void mem_cgroup_uncharge_page(struct page *page); extern void mem_cgroup_uncharge_cache_page(struct page *page); -extern int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask); - -extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan, - struct list_head *dst, - unsigned long *scanned, int order, - int mode, struct zone *z, - struct mem_cgroup *mem_cont, - int active, int file); -extern void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask); -int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem); - -extern struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p); - -#define mm_match_cgroup(mm, cgroup)\ - ((cgroup) == mem_cgroup_from_task((mm)-owner)) extern int mem_cgroup_prepare_migration(struct page *page, struct page *newpage); extern void mem_cgroup_end_migration(struct page *page); +extern void page_cgroup_init(void); -/* - * For memory reclaim. - */ -extern int mem_cgroup_calc_mapped_ratio(struct mem_cgroup *mem); -extern long mem_cgroup_reclaim_imbalance(struct mem_cgroup *mem); - -extern int mem_cgroup_get_reclaim_priority(struct mem_cgroup *mem); -extern void mem_cgroup_note_reclaim_priority(struct mem_cgroup *mem, - int priority); -extern void mem_cgroup_record_reclaim_priority(struct mem_cgroup *mem, - int priority); - -extern long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, struct zone *zone, - int priority, enum lru_list lru); - -#else /* CONFIG_CGROUP_MEM_RES_CTLR */ +#else /* CONFIG_CGROUP_PAGE */ static inline void page_reset_bad_cgroup(struct page *page) { } @@
[Devel] [PATCH 2/7] dm-ioband: Documentation of design overview, installation, command reference and examples
Here is the documentation of design overview, installation, command reference and examples. Based on 2.6.27-rc1-mm1 Signed-off-by: Ryo Tsuruta [EMAIL PROTECTED] Signed-off-by: Hirokazu Takahashi [EMAIL PROTECTED] diff -uprN linux-2.6.27-rc1-mm1.orig/Documentation/device-mapper/ioband.txt linux-2.6.27-rc1-mm1/Documentation/device-mapper/ioband.txt --- linux-2.6.27-rc1-mm1.orig/Documentation/device-mapper/ioband.txt 1970-01-01 09:00:00.0 +0900 +++ linux-2.6.27-rc1-mm1/Documentation/device-mapper/ioband.txt 2008-08-01 16:44:02.0 +0900 @@ -0,0 +1,937 @@ + Block I/O bandwidth control: dm-ioband + +--- + + Table of Contents + + [1]What's dm-ioband all about? + + [2]Differences from the CFQ I/O scheduler + + [3]How dm-ioband works. + + [4]Setup and Installation + + [5]Getting started + + [6]Command Reference + + [7]Examples + +What's dm-ioband all about? + + dm-ioband is an I/O bandwidth controller implemented as a device-mapper + driver. Several jobs using the same physical device have to share the + bandwidth of the device. dm-ioband gives bandwidth to each job according + to its weight, which each job can set its own value to. + + A job is a group of processes with the same pid or pgrp or uid or a + virtual machine such as KVM or Xen. A job can also be a cgroup by applying + the bio-cgroup patch, which can be found at + http://people.valinux.co.jp/~ryov/bio-cgroup/. + + +--+ +--+ +--+ +--+ +--+ +--+ + |cgroup| |cgroup| | the | | pid | | pid | | the | jobs + | A | | B | |others| | X | | Y | |others| + +--|---+ +--|---+ +--|---+ +--|---+ +--|---+ +--|---+ + +--V+---V---+V---+ +--V+---V---+V---+ + | group | group | default| | group | group | default| ioband groups + | | | group | | | | group | + +---+---++ +---+---++ + |ioband1 | | ioband2 | ioband devices + +---|+ +---|+ + +---V--+-V+ + | | | + | sdb1| sdb2 | physical devices + +--+--+ + + + -- + +Differences from the CFQ I/O scheduler + + Dm-ioband is flexible to configure the bandwidth settings. + + Dm-ioband can work with any type of I/O scheduler such as the NOOP + scheduler, which is often chosen for high-end storages, since it is + implemented outside the I/O scheduling layer. It allows both of partition + based bandwidth control and job --- a group of processes --- based + control. In addition, it can set different configuration on each physical + device to control its bandwidth. + + Meanwhile the current implementation of the CFQ scheduler has 8 IO + priority levels and all jobs whose processes have the same IO priority + share the bandwidth assigned to this level between them. And IO priority + is an attribute of a process so that it equally effects to all block + devices. + + -- + +How dm-ioband works. + + Every ioband device has one ioband group, which by default is called the + default group. + + Ioband devices can also have extra ioband groups in them. Each ioband + group has a job to support and a weight. Proportional to the weight, + dm-ioband gives tokens to the group. + + A group passes on I/O requests that its job issues to the underlying + layer so long as it has tokens left, while requests are blocked if there + aren't any tokens left in the group. Tokens are refilled once all of + groups that have requests on a given physical device use up their tokens. + + There are two policies for token consumption. One is that a token is + consumed for each I/O request. The other is that a token is consumed for + each I/O sector, for example, one I/O request which consists of + 4Kbytes(512bytes * 8 sectors) read consumes 8 tokens. A user can choose + either policy. + + With this approach, a job running on an ioband group with large weight + is guaranteed a wide I/O bandwidth. + + -- + +Setup and Installation + + Build a kernel with these options enabled: + + CONFIG_MD + CONFIG_BLK_DEV_DM + CONFIG_DM_IOBAND + + + If compiled as module, use modprobe to load dm-ioband. + + # make modules + # make modules_install + # depmod -a + # modprobe dm-ioband + + + dmsetup targets command shows all available device-mapper targets. + ioband is displayed if
[Devel] [PATCH 0/7] I/O bandwidth controller and BIO tracking
Hi everyone, This series of patches of dm-ioband now includes The bio tracking mechanism, which has been posted individually to this mailing list. This makes it easy for anybody to control the I/O bandwidth even when the I/O is one of delayed-write requests. Have fun! This series of patches consists of two parts: 1. dm-ioband Dm-ioband is an I/O bandwidth controller implemented as a device-mapper driver, which gives specified bandwidth to each job running on the same physical device. A job is a group of processes with the same pid or pgrp or uid or a virtual machine such as KVM or Xen. A job can also be a cgroup by applying the bio-cgroup patch. 2. bio-cgroup Bio-cgroup is a BIO tracking mechanism, which is implemented on the cgroup memory subsystem. With the mechanism, it is able to determine which cgroup each of bio belongs to, even when the bio is one of delayed-write requests issued from a kernel thread such as pdflush. The above two parts have been posted individually to this mailing list until now, but after this time we would release them all together. [PATCH 1/7] dm-ioband: Patch of device-mapper driver [PATCH 2/7] dm-ioband: Documentation of design overview, installation, command reference and examples. [PATCH 3/7] bio-cgroup: Introduction [PATCH 4/7] bio-cgroup: Split the cgroup memory subsystem into two parts [PATCH 5/7] bio-cgroup: Remove a lot of #ifdefs [PATCH 6/7] bio-cgroup: Implement the bio-cgroup [PATCH 7/7] bio-cgroup: Add a cgroup support to dm-ioband Please see the following site for more information: Linux Block I/O Bandwidth Control Project http://people.valinux.co.jp/~ryov/bwctl/ Thanks, Ryo Tsuruta ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 7/7] bio-cgroup: Add a cgroup support to dm-ioband
With this patch, dm-ioband can work with the bio cgroup. Based on 2.6.27-rc1-mm1 Signed-off-by: Ryo Tsuruta [EMAIL PROTECTED] Signed-off-by: Hirokazu Takahashi [EMAIL PROTECTED] diff -Ndupr linux-2.6.27-rc1-mm1.cg2/drivers/md/dm-ioband-type.c linux-2.6.27-rc1-mm1.cg3/drivers/md/dm-ioband-type.c --- linux-2.6.27-rc1-mm1.cg2/drivers/md/dm-ioband-type.c2008-08-01 16:53:57.0 +0900 +++ linux-2.6.27-rc1-mm1.cg3/drivers/md/dm-ioband-type.c2008-08-01 19:44:36.0 +0900 @@ -6,6 +6,7 @@ * This file is released under the GPL. */ #include linux/bio.h +#include linux/biocontrol.h #include dm.h #include dm-bio-list.h #include dm-ioband.h @@ -53,13 +54,13 @@ static int ioband_node(struct bio *bio) static int ioband_cgroup(struct bio *bio) { - /* - * This function should return the ID of the cgroup which issued bio. - * The ID of the cgroup which the current process belongs to won't be - * suitable ID for this purpose, since some BIOs will be handled by kernel - * threads like aio or pdflush on behalf of the process requesting the BIOs. - */ - return 0; /* not implemented yet */ + struct io_context *ioc = get_bio_cgroup_iocontext(bio); + int id = 0; + if (ioc) { + id = ioc-id; + put_io_context(ioc); + } + return id; } struct group_type dm_ioband_group_type[] = { ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 6/7] bio-cgroup: Implement the bio-cgroup
This patch implements the bio cgroup on the memory cgroup. Based on 2.6.27-rc1-mm1 Signed-off-by: Ryo Tsuruta [EMAIL PROTECTED] Signed-off-by: Hirokazu Takahashi [EMAIL PROTECTED] diff -Ndupr linux-2.6.27-rc1-mm1.cg1/block/blk-ioc.c linux-2.6.27-rc1-mm1.cg2/block/blk-ioc.c --- linux-2.6.27-rc1-mm1.cg1/block/blk-ioc.c2008-07-29 11:40:31.0 +0900 +++ linux-2.6.27-rc1-mm1.cg2/block/blk-ioc.c2008-08-01 19:18:38.0 +0900 @@ -84,24 +84,28 @@ void exit_io_context(void) } } +void init_io_context(struct io_context *ioc) +{ + atomic_set(ioc-refcount, 1); + atomic_set(ioc-nr_tasks, 1); + spin_lock_init(ioc-lock); + ioc-ioprio_changed = 0; + ioc-ioprio = 0; + ioc-last_waited = jiffies; /* doesn't matter... */ + ioc-nr_batch_requests = 0; /* because this is 0 */ + ioc-aic = NULL; + INIT_RADIX_TREE(ioc-radix_root, GFP_ATOMIC | __GFP_HIGH); + INIT_HLIST_HEAD(ioc-cic_list); + ioc-ioc_data = NULL; +} + struct io_context *alloc_io_context(gfp_t gfp_flags, int node) { struct io_context *ret; ret = kmem_cache_alloc_node(iocontext_cachep, gfp_flags, node); - if (ret) { - atomic_set(ret-refcount, 1); - atomic_set(ret-nr_tasks, 1); - spin_lock_init(ret-lock); - ret-ioprio_changed = 0; - ret-ioprio = 0; - ret-last_waited = jiffies; /* doesn't matter... */ - ret-nr_batch_requests = 0; /* because this is 0 */ - ret-aic = NULL; - INIT_RADIX_TREE(ret-radix_root, GFP_ATOMIC | __GFP_HIGH); - INIT_HLIST_HEAD(ret-cic_list); - ret-ioc_data = NULL; - } + if (ret) + init_io_context(ret); return ret; } diff -Ndupr linux-2.6.27-rc1-mm1.cg1/include/linux/biocontrol.h linux-2.6.27-rc1-mm1.cg2/include/linux/biocontrol.h --- linux-2.6.27-rc1-mm1.cg1/include/linux/biocontrol.h 1970-01-01 09:00:00.0 +0900 +++ linux-2.6.27-rc1-mm1.cg2/include/linux/biocontrol.h 2008-08-01 19:21:56.0 +0900 @@ -0,0 +1,159 @@ +#include linux/cgroup.h +#include linux/mm.h +#include linux/memcontrol.h + +#ifndef _LINUX_BIOCONTROL_H +#define _LINUX_BIOCONTROL_H + +#ifdef CONFIG_CGROUP_BIO + +struct io_context; +struct block_device; + +struct bio_cgroup { + struct cgroup_subsys_state css; + int id; + struct io_context *io_context; /* default io_context */ +/* struct radix_tree_root io_context_root; per device io_context */ + spinlock_t page_list_lock; + struct list_headpage_list; +}; + +static inline int bio_cgroup_disabled(void) +{ + return bio_cgroup_subsys.disabled; +} + +static inline struct bio_cgroup *bio_cgroup_from_task(struct task_struct *p) +{ + return container_of(task_subsys_state(p, bio_cgroup_subsys_id), + struct bio_cgroup, css); +} + +static inline void __bio_cgroup_add_page(struct page_cgroup *pc) +{ + struct bio_cgroup *biog = pc-bio_cgroup; + list_add(pc-blist, biog-page_list); +} + +static inline void bio_cgroup_add_page(struct page_cgroup *pc) +{ + struct bio_cgroup *biog = pc-bio_cgroup; + unsigned long flags; + spin_lock_irqsave(biog-page_list_lock, flags); + __bio_cgroup_add_page(pc); + spin_unlock_irqrestore(biog-page_list_lock, flags); +} + +static inline void __bio_cgroup_remove_page(struct page_cgroup *pc) +{ + list_del_init(pc-blist); +} + +static inline void bio_cgroup_remove_page(struct page_cgroup *pc) +{ + struct bio_cgroup *biog = pc-bio_cgroup; + unsigned long flags; + spin_lock_irqsave(biog-page_list_lock, flags); + __bio_cgroup_remove_page(pc); + spin_unlock_irqrestore(biog-page_list_lock, flags); +} + +static inline void get_bio_cgroup(struct bio_cgroup *biog) +{ + css_get(biog-css); +} + +static inline void put_bio_cgroup(struct bio_cgroup *biog) +{ + css_put(biog-css); +} + +static inline void set_bio_cgroup(struct page_cgroup *pc, + struct bio_cgroup *biog) +{ + pc-bio_cgroup = biog; +} + +static inline void clear_bio_cgroup(struct page_cgroup *pc) +{ + struct bio_cgroup *biog = pc-bio_cgroup; + pc-bio_cgroup = NULL; + put_bio_cgroup(biog); +} + +static inline struct bio_cgroup *get_bio_page_cgroup(struct page_cgroup *pc) +{ + struct bio_cgroup *biog = pc-bio_cgroup; + css_get(biog-css); + return biog; +} + +/* This sould be called in an RCU-protected section. */ +static inline struct bio_cgroup *mm_get_bio_cgroup(struct mm_struct *mm) +{ + struct bio_cgroup *biog; + biog = bio_cgroup_from_task(rcu_dereference(mm-owner)); + get_bio_cgroup(biog); + return biog; +} + +extern struct io_context *get_bio_cgroup_iocontext(struct bio *bio); + +#else /* CONFIG_CGROUP_BIO */ + +struct bio_cgroup; + +static inline int
[Devel] [PATCH 5/7] bio-cgroup: Remove a lot of ifdefs
This patch is for cleaning up the code of the cgroup memory subsystem to remove a lot of #ifdefs. Based on 2.6.27-rc1-mm1 Signed-off-by: Ryo Tsuruta [EMAIL PROTECTED] Signed-off-by: Hirokazu Takahashi [EMAIL PROTECTED] diff -Ndupr linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c --- linux-2.6.27-rc1-mm1.cg0/mm/memcontrol.c2008-08-01 19:48:55.0 +0900 +++ linux-2.6.27-rc1-mm1.cg1/mm/memcontrol.c2008-08-01 19:49:38.0 +0900 @@ -228,6 +228,47 @@ struct mem_cgroup *mem_cgroup_from_task( struct mem_cgroup, css); } +static inline void get_mem_cgroup(struct mem_cgroup *mem) +{ + css_get(mem-css); +} + +static inline void put_mem_cgroup(struct mem_cgroup *mem) +{ + css_put(mem-css); +} + +static inline void set_mem_cgroup(struct page_cgroup *pc, + struct mem_cgroup *mem) +{ + pc-mem_cgroup = mem; +} + +static inline void clear_mem_cgroup(struct page_cgroup *pc) +{ + struct mem_cgroup *mem = pc-mem_cgroup; + res_counter_uncharge(mem-res, PAGE_SIZE); + pc-mem_cgroup = NULL; + put_mem_cgroup(mem); +} + +static inline struct mem_cgroup *get_mem_page_cgroup(struct page_cgroup *pc) +{ + struct mem_cgroup *mem = pc-mem_cgroup; + css_get(mem-css); + return mem; +} + +/* This sould be called in an RCU-protected section. */ +static inline struct mem_cgroup *mm_get_mem_cgroup(struct mm_struct *mm) +{ + struct mem_cgroup *mem; + + mem = mem_cgroup_from_task(rcu_dereference(mm-owner)); + get_mem_cgroup(mem); + return mem; +} + static void __mem_cgroup_remove_list(struct mem_cgroup_per_zone *mz, struct page_cgroup *pc) { @@ -297,6 +338,26 @@ static void __mem_cgroup_move_lists(stru list_move(pc-lru, mz-lists[lru]); } +static inline void mem_cgroup_add_page(struct page_cgroup *pc) +{ + struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc); + unsigned long flags; + + spin_lock_irqsave(mz-lru_lock, flags); + __mem_cgroup_add_list(mz, pc); + spin_unlock_irqrestore(mz-lru_lock, flags); +} + +static inline void mem_cgroup_remove_page(struct page_cgroup *pc) +{ + struct mem_cgroup_per_zone *mz = page_cgroup_zoneinfo(pc); + unsigned long flags; + + spin_lock_irqsave(mz-lru_lock, flags); + __mem_cgroup_remove_list(mz, pc); + spin_unlock_irqrestore(mz-lru_lock, flags); +} + int task_in_mem_cgroup(struct task_struct *task, const struct mem_cgroup *mem) { int ret; @@ -339,6 +400,36 @@ void mem_cgroup_move_lists(struct page * unlock_page_cgroup(page); } +static inline int mem_cgroup_try_to_allocate(struct mem_cgroup *mem, + gfp_t gfp_mask) +{ + unsigned long nr_retries = MEM_CGROUP_RECLAIM_RETRIES; + + while (res_counter_charge(mem-res, PAGE_SIZE)) { + if (!(gfp_mask __GFP_WAIT)) + return -1; + + if (try_to_free_mem_cgroup_pages(mem, gfp_mask)) + continue; + + /* +* try_to_free_mem_cgroup_pages() might not give us a full +* picture of reclaim. Some pages are reclaimed and might be +* moved to swap cache or just unmapped from the cgroup. +* Check the limit again to see if the reclaim reduced the +* current usage of the cgroup before giving up +*/ + if (res_counter_check_under_limit(mem-res)) + continue; + + if (!nr_retries--) { + mem_cgroup_out_of_memory(mem, gfp_mask); + return -1; + } + } + return 0; +} + /* * Calculate mapped_ratio under memory controller. This will be used in * vmscan.c for deteremining we have to reclaim mapped pages. @@ -469,15 +560,14 @@ int mem_cgroup_shrink_usage(struct mm_st return 0; rcu_read_lock(); - mem = mem_cgroup_from_task(rcu_dereference(mm-owner)); - css_get(mem-css); + mem = mm_get_mem_cgroup(mm); rcu_read_unlock(); do { progress = try_to_free_mem_cgroup_pages(mem, gfp_mask); } while (!progress --retry); - css_put(mem-css); + put_mem_cgroup(mem); if (!retry) return -ENOMEM; return 0; @@ -558,7 +648,7 @@ static int mem_cgroup_force_empty(struct int ret = -EBUSY; int node, zid; - css_get(mem-css); + get_mem_cgroup(mem); /* * page reclaim code (kswapd etc..) will move pages between * active_list - inactive_list while we don't take a lock. @@ -578,7 +668,7 @@ static int mem_cgroup_force_empty(struct } ret = 0; out: - css_put(mem-css); + put_mem_cgroup(mem); return ret; } @@ -873,10 +963,37 @@
[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps
On Fri, Aug 01, 2008 at 02:51:57PM -0400, Oren Laadan wrote: Louis Rilling wrote: On Fri, Aug 01, 2008 at 10:15:26AM -0400, Oren Laadan wrote: Louis Rilling wrote: On Thu, Jul 31, 2008 at 03:12:32PM -0400, Oren Laadan wrote: Cut the less interesting (IMHO at least) history to make Dave happier ;) Returning 0 in case of a restart is what I called a special handling. You won't do this for the other tasks, so this is special. Since userspace must cope with it anyway, userspace can be clever enough to avoid using the fd on restart, or stupid enough to destroy its checkpoint after restart. It's a different special hanlding :) In the case of a single task that wants to checkpoint itself - there are no other tasks. In the case of a container - there will be only a single task that calls sys_checkpoint(), so only that task will either get the CRID or the 0 (or an error). The other tasks will resume whatever it was that they were doing (lol, assuming of course restart works). So this special handling ends up being a two-liner: setting the return value of the syscall for the task that called sys_checkpoint() (well, actually it will call sys_restart() to restart, and return from sys_checkpoint() with a value of 0 ...). I knew it, since I actually saw it in the patches you sent last week. If you use an FD, you will have to checkpoint that resource as part of the checkpoint, and restore it as part of the restart. In doing so you'll need to specially handle it, because it has a special meaning. I agree, of course, that it is feasible. - Userspace makes less errors when managing incremental checkpoints. have you implemented this ? did you experience issues in real life ? user space will need a way to manage all of it anyway in many aspects. This will be the last/least of the issues ... No it was not implemented, and I'm not going to enter a discussion about the weight of arguments whether they are backed by implementations or not. It just becomes easier to create a mess with things depending on each other created as separate, freely (userspace-decided)-named objects. If I were to write a user-space tool to handle this, I would keep each chain of checkpoints (from base and on) in a separate subdir, for example. In fact, that's how I did it :) This is intuitive indeed. Checkpoints are already organized in a similar way in Kerrighed, except that a notion of application (transparent to applications) replaces the notion of container, and the kernel decides where to put the checkpoints and how they are named (I'm not saying that this is the best way though). Besides, this scheme begins to sound much more complex than a single file. Do you really gain so much from not having multiple files, one per checkpoint ? Well, at least you are not limited by the number of open file descriptors (assuming that, as you mentioned earlier, you pass an array of previous images to compute the next incremental checkpoint). You aren't limited by the number of open file. User space could provide an array of CRID, pathname (or serial#, pathname) to the kernel, the kernel will access the files as necessary. But the kernel itself would have to cope with this limit (even if it is not enforced, just to avoid consuming too much resources), or close and reopen files when needed... Uhh .. hold on: you need the array of previous checkpoint to _restart_ from an incremental checkpoint. You don't care about it when you checkpoint: instead, you keep track in memory of (1) what changed (e.g. which pages where touched), and (2) where to find unmodified pages in previous checkpoints. You save this information with each new checkpoint. The data structure to describe #2 is dynamic and changes with the execution, and easily keeps track of when older checkpoint images become irrelevant (because all the pages they hold have been overwritten already). I see. I thought that you also intended to build incremental checkpoints from previous checkpoints only, because even if this is not fast, this saves storage space. I agree that if you always keep necessary metadata in kernel memory, you don't need the previous images. Actually I don't know any incremental checkpoint scheme not using such in-memory metadata scheme. Which does not imply that other schemes are not relevant though... where: - base_fd is a regular file containing the base checkpoint, or -1 if a full checkpoint should be done. The checkpoint could actually also live in memory, and the kernel should check that it matches the image pointed to by base_fd. - out_fd is whatever file/socket/etc. on which we should dump the checkpoint. In particular, out_fd can equal base_fd and should point to the beginning of the file if it's a regular file. Excellent example. What if the checkpoint data is streamed over the network; so you cannot rewrite the file after it has been streamed... Or you will
[Devel] Too many I/O controller patches
On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote: This series of patches of dm-ioband now includes The bio tracking mechanism, which has been posted individually to this mailing list. This makes it easy for anybody to control the I/O bandwidth even when the I/O is one of delayed-write requests. During the Containers mini-summit at OLS, it was mentioned that there are at least *FOUR* of these I/O controllers floating around. Have you talked to the other authors? (I've cc'd at least one of them). We obviously can't come to any kind of real consensus with people just tossing the same patches back and forth. -- Dave ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Too many I/O controller patches
Dave Hansen wrote: On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote: This series of patches of dm-ioband now includes The bio tracking mechanism, which has been posted individually to this mailing list. This makes it easy for anybody to control the I/O bandwidth even when the I/O is one of delayed-write requests. During the Containers mini-summit at OLS, it was mentioned that there are at least *FOUR* of these I/O controllers floating around. Have you talked to the other authors? (I've cc'd at least one of them). We obviously can't come to any kind of real consensus with people just tossing the same patches back and forth. -- Dave Dave, thanks for this email first of all. I've talked with Satoshi (cc-ed) about his solution Yet another I/O bandwidth controlling subsystem for CGroups based on CFQ. I did some experiments trying to implement minimum bandwidth requirements for my io-throttle controller, mapping the requirements to CFQ prio and using the Satoshi's controller. But this needs additional work and testing right now, so I've not posted anything yet, just informed Satoshi about this. Unfortunately I've not talked to Ryo yet. I've continued my work using a quite different approach, because the dm-ioband solution didn't work with delayed-write requests. Now the bio tracking feature seems really prosiming and I would like to do some tests ASAP, and review the patch as well. But I'm not yet convinced that limiting the IO writes at the device mapper layer is the best solution. IMHO it would be better to throttle applications' writes when they're dirtying pages in the page cache (the io-throttle way), because when the IO requests arrive to the device mapper it's too late (we would only have a lot of dirty pages that are waiting to be flushed to the limited block devices, and maybe this could lead to OOM conditions). IOW dm-ioband is doing this at the wrong level (at least for my requirements). Ryo, correct me if I'm wrong or if I've not understood the dm-ioband approach. Another thing I prefer is to directly define bandwidth limiting rules, instead of using priorities/weights (i.e. 10MiB/s for /dev/sda), but this seems to be in the dm-ioband TODO list, so maybe we can merge the work I did in io-throttle to define such rules. Anyway, I still need to look at the dm-ioband and bio-cgroup code in details, so probably all I said above is totally wrong... -Andrea ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Too many I/O controller patches
Dave Hansen wrote: On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote: This series of patches of dm-ioband now includes The bio tracking mechanism, which has been posted individually to this mailing list. This makes it easy for anybody to control the I/O bandwidth even when the I/O is one of delayed-write requests. During the Containers mini-summit at OLS, it was mentioned that there are at least *FOUR* of these I/O controllers floating around. Have you talked to the other authors? (I've cc'd at least one of them). We obviously can't come to any kind of real consensus with people just tossing the same patches back and forth. Ryo and Andrea - Naveen and Satoshi met up at OLS and discussed their approach. It would be really nice to see an RFC, I know Andrea did work on this and compared the approaches. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Too many I/O controller patches
On Mon, 2008-08-04 at 20:22 +0200, Andrea Righi wrote: But I'm not yet convinced that limiting the IO writes at the device mapper layer is the best solution. IMHO it would be better to throttle applications' writes when they're dirtying pages in the page cache (the io-throttle way), because when the IO requests arrive to the device mapper it's too late (we would only have a lot of dirty pages that are waiting to be flushed to the limited block devices, and maybe this could lead to OOM conditions). IOW dm-ioband is doing this at the wrong level (at least for my requirements). Ryo, correct me if I'm wrong or if I've not understood the dm-ioband approach. The avoid-lots-of-page-dirtying problem sounds like a hard one. But, if you look at this in combination with the memory controller, they would make a great team. The memory controller keeps you from dirtying more than your limit of pages (and pinning too much memory) even if the dm layer is doing the throttling and itself can't throttle the memory usage. I also don't think this is any different from the problems we have in the regular VM these days. Right now, people can dirty lots of pages on devices that are slow. The only thing dm-ioband would be added would be changing how those devices *got* slow. :) -- Dave ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Hugh Dickins wrote: [snip] BUG: unable to handle kernel paging request at 6b6b6b8b IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 *pde = Oops: [#1] PREEMPT SMP last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device thermal ac battery button Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7) EIP: 0060:[7817078f] EFLAGS: 00010206 CPU: 0 EIP is at memrlimit_cgroup_uncharge_as+0x18/0x29 EAX: 6b6b6b6b EBX: 7963215c ECX: 7c032000 EDX: 0025e000 ESI: 96902518 EDI: 9fbb1aa0 EBP: 7c033e9c ESP: 7c033e9c DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process swapoff (pid: 22500, ti=7c032000 task=907e2b70 task.ti=7c032000) Stack: 7c033edc 78161323 9fbb1aa0 025e ff77 7c033ecc 96902518 7c033ec8 0089 7963215c 9fbb1aa0 9fbb1b28 a272f040 7c033ef4 781226b1 9fbb1aa0 9fbb1aa0 790fa884 a272f0c8 7c033f80 78165ce3 Call Trace: [78161323] ? exit_mmap+0xaf/0x133 [781226b1] ? mmput+0x4c/0xba [78165ce3] ? try_to_unuse+0x20b/0x3f5 [78371534] ? _spin_unlock+0x22/0x3c [7816636a] ? sys_swapoff+0x17b/0x37c [78102d95] ? sysenter_past_esp+0x6a/0xa5 === Code: 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 ad a6 fd ff c9 c3 55 89 e5 8b 45 08 8b 55 0c 8b 80 30 02 00 00 c1 e2 0c 8b 80 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 e6 a6 fd ff 58 5a c9 c3 55 89 e5 8b EIP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 SS:ESP 0068:7c033e9c Hi, Hugh, I am unable to reproduce the problem, but I do have an initial hypothesis CPU0CPU1 try_to_unuse task 1 stars exitinglook at mm = task1-mm .. increment mm_users task 1 exits mm-owner needs to be updated, but no new owner is found (mm_users 1, but no other task has task-mm = task1-mm) mm_update_next_owner() leaves grace period user count drops, call mmput(mm) task 1 freed dereferencing mm-owner fails I do have a potential solution in mind, but I want to make sure my hypothesis is correct. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 2/3] checkpoint/restart: x86 support
The original version of Oren's patch contained a good hunk of #ifdefs. I've extracted all of those and created a bit of an API for new architectures to follow. Leaving Oren's sign-off because this is all still his code, even though he hasn't seen it mangled like this before. Signed-off-by: Oren Laadan [EMAIL PROTECTED] --- linux-2.6.git-dave/ckpt/Makefile |1 linux-2.6.git-dave/ckpt/checkpoint.c |7 linux-2.6.git-dave/ckpt/ckpt_arch.h |6 linux-2.6.git-dave/ckpt/restart.c |7 linux-2.6.git-dave/ckpt/x86.c | 269 ++ linux-2.6.git-dave/include/asm-x86/ckpt.h | 46 + 6 files changed, 336 insertions(+) diff -puN ckpt/checkpoint.c~x86_part ckpt/checkpoint.c --- linux-2.6.git/ckpt/checkpoint.c~x86_part2008-08-04 13:29:59.0 -0700 +++ linux-2.6.git-dave/ckpt/checkpoint.c2008-08-04 13:29:59.0 -0700 @@ -19,6 +19,7 @@ #include ckpt.h #include ckpt_hdr.h +#include ckpt_arch.h /** * cr_get_fname - return pathname of a given file @@ -183,6 +184,12 @@ static int cr_write_task(struct cr_ctx * ret = cr_write_task_struct(ctx, t); CR_PRINTK(ret (task_struct) %d\n, ret); + if (!ret) + ret = cr_write_thread(ctx, t); + CR_PRINTK(ret (thread) %d\n, ret); + if (!ret) + ret = cr_write_cpu(ctx, t); + CR_PRINTK(ret (cpu) %d\n, ret); return ret; } diff -puN /dev/null ckpt/ckpt_arch.h --- /dev/null 2007-04-11 11:48:27.0 -0700 +++ linux-2.6.git-dave/ckpt/ckpt_arch.h 2008-08-04 13:29:59.0 -0700 @@ -0,0 +1,6 @@ +#include ckpt.h + +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t); +int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t); +int cr_read_thread(struct cr_ctx *ctx); +int cr_read_cpu(struct cr_ctx *ctx); diff -puN ckpt/Makefile~x86_part ckpt/Makefile --- linux-2.6.git/ckpt/Makefile~x86_part2008-08-04 13:29:59.0 -0700 +++ linux-2.6.git-dave/ckpt/Makefile2008-08-04 13:29:59.0 -0700 @@ -1 +1,2 @@ obj-y += sys.o checkpoint.o restart.o +obj-$(CONFIG_X86) += x86.o diff -puN ckpt/restart.c~x86_part ckpt/restart.c --- linux-2.6.git/ckpt/restart.c~x86_part 2008-08-04 13:29:59.0 -0700 +++ linux-2.6.git-dave/ckpt/restart.c 2008-08-04 13:29:59.0 -0700 @@ -21,6 +21,7 @@ #include ckpt.h #include ckpt_hdr.h +#include ckpt_arch.h /** * cr_hbuf_get - reserve space on the hbuf @@ -171,6 +172,12 @@ static int cr_read_task(struct cr_ctx *c ret = cr_read_task_struct(ctx); CR_PRINTK(ret (task_struct) %d\n, ret); + if (!ret) + ret = cr_read_thread(ctx); + CR_PRINTK(ret (thread) %d\n, ret); + if (!ret) + ret = cr_read_cpu(ctx); + CR_PRINTK(ret (cpu) %d\n, ret); return ret; } diff -puN /dev/null ckpt/x86.c --- /dev/null 2007-04-11 11:48:27.0 -0700 +++ linux-2.6.git-dave/ckpt/x86.c 2008-08-04 13:29:59.0 -0700 @@ -0,0 +1,269 @@ +#include asm/ckpt.h +#include asm/desc.h +#include asm/i387.h + +#include ckpt.h +#include ckpt_hdr.h + +/* dump the thread_struct of a given task */ +int cr_write_thread(struct cr_ctx *ctx, struct task_struct *t) +{ + struct cr_hdr h; + struct cr_hdr_thread *hh = ctx-tbuf; + struct thread_struct *thread; + struct desc_struct *desc; + int ntls = 0; + int n, ret; + + h.type = CR_HDR_THREAD; + h.len = sizeof(*hh); + h.id = ctx-pid; + + thread = t-thread; + + /* calculate no. of TLS entries that follow */ + desc = thread-tls_array; + for (n = GDT_ENTRY_TLS_ENTRIES; n 0; n--, desc++) { + if (desc-a || desc-b) + ntls++; + } + + hh-gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES; + hh-sizeof_tls_array = sizeof(thread-tls_array); + hh-ntls = ntls; + + if ((ret = cr_write_obj(ctx, h, hh)) 0) + return ret; + + /* for simplicity dump the entire array, cherry-pick upon restart */ + ret = cr_kwrite(ctx, thread-tls_array, sizeof(thread-tls_array)); + + CR_PRINTK(ntls %d\n, ntls); + + /* IGNORE RESTART BLOCKS FOR NOW ... */ + + return ret; +} + +/* dump the cpu state and registers of a given task */ +int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t) +{ + struct cr_hdr h; + struct cr_hdr_cpu *hh = ctx-tbuf; + struct thread_struct *thread; + struct thread_info *thread_info; + struct pt_regs *regs; + + h.type = CR_HDR_CPU; + h.len = sizeof(*hh); + h.id = ctx-pid; + + thread = t-thread; + thread_info = task_thread_info(t); + regs = task_pt_regs(t); + + hh-bx = regs-bx; + hh-cx = regs-cx; + hh-dx = regs-dx; + hh-si = regs-si; + hh-di = regs-di; + hh-bp = regs-bp; + hh-ax = regs-ax; + hh-ds = regs-ds; + hh-es =
[Devel] [RFC][PATCH 3/3] checkpoint/restart: memory management
For each vma, there is a 'struct cr_vma'; if the vma is file-mapped, it will be followed by the file name. The cr_vma-npages will tell how many pages were dumped for this vma. Then it will be followed by the actual data: first a dump of the addresses of all dumped pages (npages entries) followed by a dump of the contents of all dumped pages (npages pages). Then will come the next vma and so on. I guess I could also separate out the x86-specific bits here, but they're pretty small, comparatively. Signed-off-by: Oren Laadan [EMAIL PROTECTED] --- linux-2.6.git-dave/arch/x86/kernel/ldt.c |2 linux-2.6.git-dave/ckpt/Makefile |2 linux-2.6.git-dave/ckpt/ckpt_arch.h |2 linux-2.6.git-dave/ckpt/ckpt_hdr.h| 21 + linux-2.6.git-dave/ckpt/ckpt_mem.c| 388 ++ linux-2.6.git-dave/ckpt/ckpt_mem.h| 32 ++ linux-2.6.git-dave/ckpt/rstr_mem.c| 354 +++ linux-2.6.git-dave/ckpt/sys.c |3 linux-2.6.git-dave/ckpt/x86.c | 83 ++ linux-2.6.git-dave/include/asm-x86/ckpt.h |5 10 files changed, 890 insertions(+), 2 deletions(-) diff -puN arch/x86/kernel/ldt.c~memory_part arch/x86/kernel/ldt.c --- linux-2.6.git/arch/x86/kernel/ldt.c~memory_part 2008-08-04 13:30:00.0 -0700 +++ linux-2.6.git-dave/arch/x86/kernel/ldt.c2008-08-04 13:30:00.0 -0700 @@ -183,7 +183,7 @@ static int read_default_ldt(void __user return bytecount; } -static int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode) +int write_ldt(void __user *ptr, unsigned long bytecount, int oldmode) { struct mm_struct *mm = current-mm; struct desc_struct ldt; diff -puN ckpt/ckpt_arch.h~memory_part ckpt/ckpt_arch.h --- linux-2.6.git/ckpt/ckpt_arch.h~memory_part 2008-08-04 13:30:00.0 -0700 +++ linux-2.6.git-dave/ckpt/ckpt_arch.h 2008-08-04 13:30:00.0 -0700 @@ -4,3 +4,5 @@ int cr_write_thread(struct cr_ctx *ctx, int cr_write_cpu(struct cr_ctx *ctx, struct task_struct *t); int cr_read_thread(struct cr_ctx *ctx); int cr_read_cpu(struct cr_ctx *ctx); +int cr_write_mm_context(struct cr_ctx *ctx, struct mm_struct *mm); +int cr_read_mm_context(struct cr_ctx *ctx, struct mm_struct *mm); diff -puN ckpt/ckpt_hdr.h~memory_part ckpt/ckpt_hdr.h --- linux-2.6.git/ckpt/ckpt_hdr.h~memory_part 2008-08-04 13:30:00.0 -0700 +++ linux-2.6.git-dave/ckpt/ckpt_hdr.h 2008-08-04 13:30:00.0 -0700 @@ -67,3 +67,24 @@ struct cr_hdr_task { }; + +struct cr_hdr_mm { + __u32 tag; /* sharing identifier */ + __u64 start_code, end_code, start_data, end_data; + __u64 start_brk, brk, start_stack; + __u64 arg_start, arg_end, env_start, env_end; + __s16 map_count; +}; + +struct cr_hdr_vma { + __u32 how; + + __u64 vm_start; + __u64 vm_end; + __u64 vm_page_prot; + __u64 vm_flags; + __u64 vm_pgoff; + + __s16 npages; + __s16 namelen; +}; diff -puN /dev/null ckpt/ckpt_mem.c --- /dev/null 2007-04-11 11:48:27.0 -0700 +++ linux-2.6.git-dave/ckpt/ckpt_mem.c 2008-08-04 13:30:00.0 -0700 @@ -0,0 +1,388 @@ +/* + * Checkpoint memory contents + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include linux/sched.h +#include linux/slab.h +#include linux/file.h +#include linux/pagemap.h +#include linux/mm_types.h + +#include ckpt.h +#include ckpt_hdr.h +#include ckpt_arch.h +#include ckpt_mem.h + +/* + * utilities to alloc, free, and handle 'struct cr_pgarr' + * (common to ckpt_mem.c and rstr_mem.c) + */ + +#define CR_ORDER_PGARR 0 +#define CR_PGARR_TOTAL ((PAGE_SIZE CR_ORDER_PGARR) / sizeof(void *)) + +/* release pages referenced by a page-array */ +void _cr_pgarr_release(struct cr_ctx *ctx, struct cr_pgarr *pgarr) +{ + int n; + + /* only checkpoint keeps references to pages */ + if (ctx-flags CR_CTX_CKPT) { + CR_PRINTK(release pages (nused %d)\n, pgarr-nused); + for (n = pgarr-nused; n--; ) + page_cache_release(pgarr-pages[n]); + } + pgarr-nused = 0; + pgarr-nleft = CR_PGARR_TOTAL; +} + +/* release pages referenced by chain of page-arrays */ +void cr_pgarr_release(struct cr_ctx *ctx) +{ + struct cr_pgarr *pgarr; + + for (pgarr = ctx-pgarr; pgarr; pgarr = pgarr-next) + _cr_pgarr_release(ctx, pgarr); +} + +/* free a chain of page-arrays */ +void cr_pgarr_free(struct cr_ctx *ctx) +{ + struct cr_pgarr *pgarr, *pgnxt; + + for (pgarr = ctx-pgarr; pgarr; pgarr = pgnxt) { + _cr_pgarr_release(ctx, pgarr); + free_pages((unsigned long) ctx-pgarr-addrs, CR_ORDER_PGARR); + free_pages((unsigned long) ctx-pgarr-pages,
[Devel] [RFC][PATCH 0/3] broken out c/r patches
I've done a bit of refactoring to Oren's patches. I wonder if they're in a state that people think we can share on LKML like Ted suggested. Thoughts? -- At the containers mini-conference before OLS, the consensus among all the stakeholders was that doing checkpoint/restart in the kernel as much as possible was the best approach. With this approach, the kernel will export a relatively opaque 'blob' of data to userspace which can then be handed to the new kernel at restore time. This is different that what had been proposed before, which was that a userspace application would be responsible for collecting all of this data. We were also planning on adding lots of new, little kernel interfaces for all of the things that needed checkpointing. This unites those into a single, grand interface. The 'blob' will contain copies of select portions of kernel structures such as vmas and mm_structs. It will also contain copies of the actual memory that the process uses. Any changes in this blob's format between kernel revisions can be handled by an in-userspace conversion program. This is a similar approach to virtually all of the commercial checkpoint/restart products out there, as well as the research project Zap. These patches basically serialize internel kernel state and write it out to a file descriptor. The checkpoint and restore are done with two new system calls: sys_checkpoint and sys_restart. In this incarnation, they can only work checkpoint and restore a single task. The task's address space may consist of only private, simple vma's - anonymous or file-mapped. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 1/3] kernel-based checkpoint-restart: general infrastructure
From: Oren Laadan [EMAIL PROTECTED] This patch adds those interfaces, as well as all of the helpers needed to easily manage the file format. The code is roughly broken out as follows: ckpt/sys.c - user/kernel data transfer, as well as setting up of the checkpoint/restart context (a per-checkpoint data structure for housekeeping) ckpt/checkpoint.c - output wrappers and basic checkpoint handling ckpt/restart.c - input wrappers and basic restart handling Patches to add the per-architecture support as well as the actual work to do the memory checkpoint follow in subsequent patches. Signed-off-by: Oren Laadan [EMAIL PROTECTED] --- linux-2.6.git-dave/Makefile |2 linux-2.6.git-dave/ckpt/Makefile |1 linux-2.6.git-dave/ckpt/checkpoint.c | 207 +++ linux-2.6.git-dave/ckpt/ckpt.h | 82 linux-2.6.git-dave/ckpt/ckpt_hdr.h | 69 ++ linux-2.6.git-dave/ckpt/restart.c| 189 linux-2.6.git-dave/ckpt/sys.c| 233 +++ 7 files changed, 782 insertions(+), 1 deletion(-) diff -puN /dev/null ckpt/checkpoint.c --- /dev/null 2007-04-11 11:48:27.0 -0700 +++ linux-2.6.git-dave/ckpt/checkpoint.c2008-08-04 13:29:55.0 -0700 @@ -0,0 +1,207 @@ +/* + * Checkpoint logic and helpers + * + * Copyright (C) 2008 Oren Laadan + * + * This file is subject to the terms and conditions of the GNU General Public + * License. See the file COPYING in the main directory of the Linux + * distribution for more details. + */ + +#include linux/version.h +#include linux/sched.h +#include linux/time.h +#include linux/fs.h +#include linux/file.h +#include linux/dcache.h +#include linux/mount.h +#include asm/ptrace.h + +#include ckpt.h +#include ckpt_hdr.h + +/** + * cr_get_fname - return pathname of a given file + * @file: file pointer + * @buf: buffer for pathname + * @n: buffer length (in) and pathname length (out) + * + * if the buffer provivded by the caller is too small, allocate a new + * buffer; caller should call cr_put_pathname() for cleanup + */ +char *cr_get_fname(struct path *path, struct path *root, char *buf, int *n) +{ + char *fname; + + fname = __d_path(path, root, buf, *n); + + if (IS_ERR(fname) PTR_ERR(fname) == -ENAMETOOLONG) { +if (!(buf = (char *) __get_free_pages(GFP_KERNEL, 0))) +return ERR_PTR(-ENOMEM); + fname = __d_path(path, root, buf, PAGE_SIZE); + if (IS_ERR(fname)) + free_pages((unsigned long) buf, 0); + } + if (!IS_ERR(fname)) + *n = (buf + *n - fname); + + return fname; +} + +/** + * cr_put_fname - (possibly) cleanup pathname buffer + * @buf: original buffer that was given to cr_get_pathname() + * @fname: resulting pathname from cr_get_pathname() + * @n: length of original buffer + */ +void cr_put_fname(char *buf, char *fname, int n) +{ + if (fname (fname buf || fname = buf + n)) + free_pages((unsigned long) buf, 0); +} + +/** + * cr_write_obj - write a record described by a cr_hdr + * @ctx: checkpoint context + * @h: record descriptor + * @buf: record buffer + */ +int cr_write_obj(struct cr_ctx *ctx, struct cr_hdr *h, void *buf) +{ + int ret; + + if ((ret = cr_kwrite(ctx, h, sizeof(*h))) 0) + return ret; + return cr_kwrite(ctx, buf, h-len); +} + +/** + * cr_write_str - write a string record + * @ctx: checkpoint context + * @str: string buffer + * @n: string length + */ +int cr_write_str(struct cr_ctx *ctx, char *str, int n) +{ + struct cr_hdr h; + + h.type = CR_HDR_STR; + h.len = n; + h.id = 0; + + return cr_write_obj(ctx, h, str); +} + +/* write the checkpoint header */ +static int cr_write_hdr(struct cr_ctx *ctx) +{ + struct cr_hdr h; + struct cr_hdr_head *hh = ctx-tbuf; + struct timeval ktv; + + h.type = CR_HDR_HEAD; + h.len = sizeof(*hh); + h.id = 0; + + do_gettimeofday(ktv); + + hh-magic = 0x00a2d200; + hh-major = (LINUX_VERSION_CODE 16) 0xff; + hh-minor = (LINUX_VERSION_CODE 8) 0xff; + hh-patch = (LINUX_VERSION_CODE) 0xff; + + hh-version = 1; + + hh-flags = ctx-flags; + hh-time = ktv.tv_sec; + + return cr_write_obj(ctx, h, hh); +} + +/* write the checkpoint trailer */ +static int cr_write_tail(struct cr_ctx *ctx) +{ + struct cr_hdr h; + struct cr_hdr_tail *hh = ctx-tbuf; + + h.type = CR_HDR_TAIL; + h.len = sizeof(*hh); + h.id = 0; + + hh-magic = 0x002d2a00; + hh-cksum[0] = hh-cksum[1] = 1;/* TBD ... */ + + return cr_write_obj(ctx, h, hh); +} + +/* dump the task_struct of a given task */ +static int cr_write_task_struct(struct cr_ctx *ctx, struct task_struct *t) +{ + struct cr_hdr h; + struct cr_hdr_task *hh =
[Devel] Re: Too many I/O controller patches
Balbir Singh wrote: Dave Hansen wrote: On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote: This series of patches of dm-ioband now includes The bio tracking mechanism, which has been posted individually to this mailing list. This makes it easy for anybody to control the I/O bandwidth even when the I/O is one of delayed-write requests. During the Containers mini-summit at OLS, it was mentioned that there are at least *FOUR* of these I/O controllers floating around. Have you talked to the other authors? (I've cc'd at least one of them). We obviously can't come to any kind of real consensus with people just tossing the same patches back and forth. Ryo and Andrea - Naveen and Satoshi met up at OLS and discussed their approach. It would be really nice to see an RFC, I know Andrea did work on this and compared the approaches. yes, I wrote down something about the comparison of priority-based vs bandwidth shaping solutions in terms of performance predictability. And other considerations, like the one I cited before, about dirty-ratio throttling in memory, AIO handling, etc. Something is also reported in the io-throttle documentation: http://marc.info/?l=linux-kernelm=121780176907686w=2 But ok, I agree with Balbir, I can try to put the things together (in a better form in particular) and try to post an RFC together with Ryo. Ryo, do you have other documentation besides the info reported in the dm-ioband website? Thanks, -Andrea ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Too many I/O controller patches
Dave Hansen wrote: On Mon, 2008-08-04 at 20:22 +0200, Andrea Righi wrote: But I'm not yet convinced that limiting the IO writes at the device mapper layer is the best solution. IMHO it would be better to throttle applications' writes when they're dirtying pages in the page cache (the io-throttle way), because when the IO requests arrive to the device mapper it's too late (we would only have a lot of dirty pages that are waiting to be flushed to the limited block devices, and maybe this could lead to OOM conditions). IOW dm-ioband is doing this at the wrong level (at least for my requirements). Ryo, correct me if I'm wrong or if I've not understood the dm-ioband approach. The avoid-lots-of-page-dirtying problem sounds like a hard one. But, if you look at this in combination with the memory controller, they would make a great team. The memory controller keeps you from dirtying more than your limit of pages (and pinning too much memory) even if the dm layer is doing the throttling and itself can't throttle the memory usage. mmh... but in this way we would just move the OOM inside the cgroup, that is a nice improvement, but the main problem is not resolved... A safer approach IMHO is to force the tasks to wait synchronously on each operation that directly or indirectly generates i/o. In particular the solution used by the io-throttle controller to limit the dirty-ratio in memory is to impose a sleep via schedule_timeout_killable() in balance_dirty_pages() when a generic process exceeds the limits defined for the belonging cgroup. Limiting read operations is a lot more easy, because they're always synchronized with i/o requests. -Andrea ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Too many I/O controller patches
On Mon, 2008-08-04 at 22:44 +0200, Andrea Righi wrote: Dave Hansen wrote: On Mon, 2008-08-04 at 20:22 +0200, Andrea Righi wrote: But I'm not yet convinced that limiting the IO writes at the device mapper layer is the best solution. IMHO it would be better to throttle applications' writes when they're dirtying pages in the page cache (the io-throttle way), because when the IO requests arrive to the device mapper it's too late (we would only have a lot of dirty pages that are waiting to be flushed to the limited block devices, and maybe this could lead to OOM conditions). IOW dm-ioband is doing this at the wrong level (at least for my requirements). Ryo, correct me if I'm wrong or if I've not understood the dm-ioband approach. The avoid-lots-of-page-dirtying problem sounds like a hard one. But, if you look at this in combination with the memory controller, they would make a great team. The memory controller keeps you from dirtying more than your limit of pages (and pinning too much memory) even if the dm layer is doing the throttling and itself can't throttle the memory usage. mmh... but in this way we would just move the OOM inside the cgroup, that is a nice improvement, but the main problem is not resolved... A safer approach IMHO is to force the tasks to wait synchronously on each operation that directly or indirectly generates i/o. Fine in theory, hard in practice. :) I think the best we can hope for is to keep parity with what happens in the rest of the kernel. We already have a problem today with people mmap()'ing lots of memory and dirtying it all at once. Adding a i/o bandwidth controller or a memory controller isn't really going to fix that. I think it is outside the scope of the i/o (and memory) controllers until we solve it generically, first. -- Dave ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Tue, 5 Aug 2008, Balbir Singh wrote: Hugh Dickins wrote: [snip] BUG: unable to handle kernel paging request at 6b6b6b8b IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7) [78161323] ? exit_mmap+0xaf/0x133 [781226b1] ? mmput+0x4c/0xba [78165ce3] ? try_to_unuse+0x20b/0x3f5 [78371534] ? _spin_unlock+0x22/0x3c [7816636a] ? sys_swapoff+0x17b/0x37c [78102d95] ? sysenter_past_esp+0x6a/0xa5 I am unable to reproduce the problem, Me neither, I've spent many hours trying 2.6.27-rc1-mm1 and then back to 2.6.26-rc8-mm1. But I've been SO stupid: saw it originally on one machine with SLAB_DEBUG=y, have been trying since mostly on another with SLUB_DEBUG=y, but never thought to boot with slub_debug=P,task_struct until now. but I do have an initial hypothesis CPU0 CPU1 try_to_unuse task 1 stars exiting look at mm = task1-mm ..increment mm_users task 1 exits mm-owner needs to be updated, but no new owner is found (mm_users 1, but no other task has task-mm = task1-mm) mm_update_next_owner() leaves grace period user count drops, call mmput(mm) task 1 freed dereferencing mm-owner fails Yes, that looks right to me: seems obvious now. I don't think your careful alternation of CPU0/1 events at the end matters: the swapoff CPU simply dereferences mm-owner after that task has gone. (That's a shame, I'd always hoped that mm-owner-comm was going to be good for use in mm messages, even when tearing down the mm.) I do have a potential solution in mind, but I want to make sure my hypothesis is correct. It seems wrong that memrlimit_cgroup_uncharge_as should be called after mm-owner may have been changed, even if it's to something safe. But I forget the mm/task exit details, surely they're tricky. By the way, is the ordering in mm_update_next_owner the best? Would there be less movement if it searched amongst siblings before it searched amongst children? Ought it to make a first pass trying to stay within the same cgroup? Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/6] Container Freezer: Make refrigerator always available
On Sat, 2008-08-02 at 00:53 +0200, Rafael J. Wysocki wrote: On Friday, 1 of August 2008, Matt Helsley wrote: On Fri, 2008-08-01 at 16:27 +0200, Thomas Petazzoni wrote: Hi, Le Thu, 31 Jul 2008 22:07:01 -0700, Matt Helsley [EMAIL PROTECTED] a écrit : --- a/kernel/Makefile +++ b/kernel/Makefile @@ -5,7 +5,7 @@ obj-y = sched.o fork.o exec_domain.o panic.o printk.o \ cpu.o exit.o itimer.o time.o softirq.o resource.o \ sysctl.o capability.o ptrace.o timer.o user.o \ - signal.o sys.o kmod.o workqueue.o pid.o \ + signal.o sys.o kmod.o workqueue.o pid.o freezer.o \ I have the impression that the code in kernel/power/process.c was compiled only if CONFIG_PM_SLEEP was set. Now that the code has been moved to kernel/freezer.c, it is unconditionnaly compiled in every kernel. Is that correct ? If so, is it possible to put this new feature under some CONFIG_SOMETHING option, for people who care about the kernel size ? How about making it depend on a combination of CONFIG variables? Here's an RFC PATCH. Completely untested. Signed-off-by: Matt Helsley [EMAIL PROTECTED] Can you please also make the contents of include/linux/freezer.h depend on CONFIG_FREEZER instead of CONFIG_PM_SLEEP? Done. Also, I'm not really sure if kernel/power/Kconfig is the right place to define CONFIG_FREEZER. Perhaps we should even move freezer.c from kernel/power to kernel and define CONFIG_FREEZER in Kconfig in there. Andrew, what do you think? I'll check this weekend for replies and repost the RFC PATCH on Monday if I don't hear anything. In the meantime I'll be doing some config build testing with the above changes to make sure it's correct. Cheers, -Matt ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 0/6] Enable multiple mounts of devpts
I thought I will send out the patches I mentioned to H. Peter Anvin recently to get some feedback on the general direction. This version of the patchset ducks the user-space issue, for now. --- Enable multiple mounts of devpts filesystem so each container can allocate ptys independently. To enable multiple mounts (most) devpts interfaces need to know which instance of devpts is being accessed. This patchset uses the 'struct inode' of the device being accessed to identify the appropriate devpts instance. It then uses get_sb_nodev() instead of get_sb_single() to allow multiple mounts PATCH 1/6 Pass-in 'struct inode' to devpts interfaces PATCH 2/6 Remove 'devpts_root' global PATCH 3/6 Move 'allocated_ptys' to sb-s_s_fs_info PATCH 4/6 Allow mknod of ptmx and tty devices PATCH 5/6 Allow multiple mounts of devpts PATCH 6/6 Tweak in init_dev() /dev/tty If devpts is mounted just once, this patchset should not change any behavior. If devpts is mounted more than once, then '/dev/ptmx' must be a symlink to '/dev/pts/ptmx' and in each new devpts mount we must create the device node '/dev/pts/ptmx' [c, 5;2] by hand. Have only done some basic testing with multiple mounts and sshd. May not be bisect-safe. Appreciate comments on overall approach of my mapping from the inode to sb-s_fs_info to allocated_ptys and the hacky use of get_sb_nodev(), and also on the tweak to init_dev() (patch 6). Todo: User-space impact of /dev/ptmx symlink - Options are being discussed on mailing list (new mount option and config token, new fs name, etc) Remove even initial kernel mount of devpts ? ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 1/6] Pass-in 'struct inode' to devpts interfaces
From: Sukadev Bhattiprolu [EMAIL PROTECTED] Subject: [RFC][PATCH 1/6] Pass-in 'struct inode' to devpts interfaces Pass-in an 'inode' parameter to devpts interfaces. The parameter itself will be used in subsequent patches to identify the instance of devpts mounted. --- drivers/char/pty.c|3 ++- drivers/char/tty_io.c | 21 +++-- fs/devpts/inode.c | 10 +- include/linux/devpts_fs.h | 34 -- 4 files changed, 42 insertions(+), 26 deletions(-) Index: linux-2.6.26-rc8-mm1/fs/devpts/inode.c === --- linux-2.6.26-rc8-mm1.orig/fs/devpts/inode.c 2008-08-04 02:07:25.0 -0700 +++ linux-2.6.26-rc8-mm1/fs/devpts/inode.c 2008-08-04 02:08:15.0 -0700 @@ -177,7 +177,7 @@ static struct dentry *get_node(int num) return lookup_one_len(s, root, sprintf(s, %d, num)); } -int devpts_new_index(void) +int devpts_new_index(struct inode *inode) { int index; int idr_ret; @@ -205,14 +205,14 @@ retry: return index; } -void devpts_kill_index(int idx) +void devpts_kill_index(struct inode *inode, int idx) { mutex_lock(allocated_ptys_lock); idr_remove(allocated_ptys, idx); mutex_unlock(allocated_ptys_lock); } -int devpts_pty_new(struct tty_struct *tty) +int devpts_pty_new(struct inode *ptmx_inode, struct tty_struct *tty) { int number = tty-index; /* tty layer puts index from devpts_new_index() in here */ struct tty_driver *driver = tty-driver; @@ -245,7 +245,7 @@ int devpts_pty_new(struct tty_struct *tt return 0; } -struct tty_struct *devpts_get_tty(int number) +struct tty_struct *devpts_get_tty(struct inode *inode, int number) { struct dentry *dentry = get_node(number); struct tty_struct *tty; @@ -262,7 +262,7 @@ struct tty_struct *devpts_get_tty(int nu return tty; } -void devpts_pty_kill(int number) +void devpts_pty_kill(struct inode *inode, int number) { struct dentry *dentry = get_node(number); Index: linux-2.6.26-rc8-mm1/include/linux/devpts_fs.h === --- linux-2.6.26-rc8-mm1.orig/include/linux/devpts_fs.h 2008-08-04 02:07:24.0 -0700 +++ linux-2.6.26-rc8-mm1/include/linux/devpts_fs.h 2008-08-04 02:07:27.0 -0700 @@ -17,20 +17,34 @@ #ifdef CONFIG_UNIX98_PTYS -int devpts_new_index(void); -void devpts_kill_index(int idx); -int devpts_pty_new(struct tty_struct *tty); /* mknod in devpts */ -struct tty_struct *devpts_get_tty(int number); /* get tty structure */ -void devpts_pty_kill(int number); /* unlink */ +int devpts_new_index(struct inode *inode); +void devpts_kill_index(struct inode *inode, int idx); + +/* mknod in devpts */ +int devpts_pty_new(struct inode *inode, struct tty_struct *tty); + +/* get tty structure */ +struct tty_struct *devpts_get_tty(struct inode *inode, int number); + +/* unlink */ +void devpts_pty_kill(struct inode *inode, int number); #else /* Dummy stubs in the no-pty case */ -static inline int devpts_new_index(void) { return -EINVAL; } -static inline void devpts_kill_index(int idx) { } -static inline int devpts_pty_new(struct tty_struct *tty) { return -EINVAL; } -static inline struct tty_struct *devpts_get_tty(int number) { return NULL; } -static inline void devpts_pty_kill(int number) { } +static inline int devpts_new_index(struct inode *inode) { return -EINVAL; } +static inline void devpts_kill_index(struct inode *inode, int idx) { } + +static inline int devpts_pty_new(struc inode *inode, struct tty_struct *tty) +{ + return -EINVAL; +} + +static inline struct tty_struct *devpts_get_tty(struct inode *inode, int number) +{ + return NULL; +} +static inline void devpts_pty_kill(struc inode *inode, int number) { } #endif Index: linux-2.6.26-rc8-mm1/drivers/char/pty.c === --- linux-2.6.26-rc8-mm1.orig/drivers/char/pty.c2008-08-04 02:07:24.0 -0700 +++ linux-2.6.26-rc8-mm1/drivers/char/pty.c 2008-08-04 02:07:27.0 -0700 @@ -59,7 +59,8 @@ static void pty_close(struct tty_struct set_bit(TTY_OTHER_CLOSED, tty-flags); #ifdef CONFIG_UNIX98_PTYS if (tty-driver == ptm_driver) - devpts_pty_kill(tty-index); + devpts_pty_kill(filp-f_path.dentry-d_inode, + tty-index); #endif tty_vhangup(tty-link); } Index: linux-2.6.26-rc8-mm1/drivers/char/tty_io.c === --- linux-2.6.26-rc8-mm1.orig/drivers/char/tty_io.c 2008-08-04 02:07:24.0 -0700 +++ linux-2.6.26-rc8-mm1/drivers/char/tty_io.c 2008-08-04 02:07:55.0 -0700 @@ -2056,7 +2056,7 @@ static void tty_line_name(struct tty_dri * relaxed for
[Devel] [RFC][PATCH 2/6] Remove 'devpts_root' global
From: Sukadev Bhattiprolu [EMAIL PROTECTED] Subject: [RFC][PATCH 2/6] Remove 'devpts_root' global Remove the 'devpts_root' global variable and find the root dentry using the super_block. The super-block itself is found from the device inode, using a new wrapper, pts_sb_from_inode(). --- fs/devpts/inode.c | 36 1 file changed, 24 insertions(+), 12 deletions(-) Index: linux-2.6.26-rc8-mm1/fs/devpts/inode.c === --- linux-2.6.26-rc8-mm1.orig/fs/devpts/inode.c 2008-08-04 02:08:15.0 -0700 +++ linux-2.6.26-rc8-mm1/fs/devpts/inode.c 2008-08-04 02:08:43.0 -0700 @@ -33,7 +33,14 @@ static DEFINE_IDR(allocated_ptys); static DEFINE_MUTEX(allocated_ptys_lock); static struct vfsmount *devpts_mnt; -static struct dentry *devpts_root; + +static inline struct super_block *pts_sb_from_inode(struct inode *inode) +{ + if (inode-i_sb-s_magic == DEVPTS_SUPER_MAGIC) + return inode-i_sb; + + return devpts_mnt-mnt_sb; +} static struct { int setuid; @@ -141,7 +148,7 @@ devpts_fill_super(struct super_block *s, inode-i_fop = simple_dir_operations; inode-i_nlink = 2; - devpts_root = s-s_root = d_alloc_root(inode); + s-s_root = d_alloc_root(inode); if (s-s_root) return 0; @@ -169,10 +176,9 @@ static struct file_system_type devpts_fs * to the System V naming convention */ -static struct dentry *get_node(int num) +static struct dentry *get_node(struct dentry *root, int num) { char s[12]; - struct dentry *root = devpts_root; mutex_lock(root-d_inode-i_mutex); return lookup_one_len(s, root, sprintf(s, %d, num)); } @@ -218,7 +224,9 @@ int devpts_pty_new(struct inode *ptmx_in struct tty_driver *driver = tty-driver; dev_t device = MKDEV(driver-major, driver-minor_start+number); struct dentry *dentry; - struct inode *inode = new_inode(devpts_mnt-mnt_sb); + struct super_block *sb = pts_sb_from_inode(ptmx_inode); + struct inode *inode = new_inode(sb); + struct dentry *root = sb-s_root; /* We're supposed to be given the slave end of a pty */ BUG_ON(driver-type != TTY_DRIVER_TYPE_PTY); @@ -234,20 +242,22 @@ int devpts_pty_new(struct inode *ptmx_in init_special_inode(inode, S_IFCHR|config.mode, device); inode-i_private = tty; - dentry = get_node(number); + dentry = get_node(root, number); if (!IS_ERR(dentry) !dentry-d_inode) { d_instantiate(dentry, inode); - fsnotify_create(devpts_root-d_inode, dentry); + fsnotify_create(root-d_inode, dentry); } - mutex_unlock(devpts_root-d_inode-i_mutex); + mutex_unlock(root-d_inode-i_mutex); return 0; } struct tty_struct *devpts_get_tty(struct inode *inode, int number) { - struct dentry *dentry = get_node(number); + struct super_block *sb = pts_sb_from_inode(inode); + struct dentry *root = sb-s_root; + struct dentry *dentry = get_node(root, number); struct tty_struct *tty; tty = NULL; @@ -257,14 +267,16 @@ struct tty_struct *devpts_get_tty(struct dput(dentry); } - mutex_unlock(devpts_root-d_inode-i_mutex); + mutex_unlock(root-d_inode-i_mutex); return tty; } void devpts_pty_kill(struct inode *inode, int number) { - struct dentry *dentry = get_node(number); + struct super_block *sb = pts_sb_from_inode(inode); + struct dentry *root = sb-s_root; + struct dentry *dentry = get_node(root, number); if (!IS_ERR(dentry)) { struct inode *inode = dentry-d_inode; @@ -275,7 +287,7 @@ void devpts_pty_kill(struct inode *inode } dput(dentry); } - mutex_unlock(devpts_root-d_inode-i_mutex); + mutex_unlock(root-d_inode-i_mutex); } static int __init init_devpts_fs(void) ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 4/6]: Allow mknod of ptmx in devpts
From: Sukadev Bhattiprolu [EMAIL PROTECTED] Subject: [RFC][PATCH 4/6]: Allow mknod of ptmx in devpts /dev/ptmx is closely tied to the devpts filesystem. An open of /dev/ptmx, allocates the next pty index and the associated device shows up in the devpts fs as /dev/pts/n. Wih multiple mounts of devpts filesystem, an open of /dev/ptmx would be unable to determine which instance of the devpts is being accessed. One solution for this would be to create make /dev/ptmx a symlink to /dev/pts/ptmx and create the device node, ptmx, in each instance of devpts. When /dev/ptmx is opened, we can use the inode of /dev/pts/ptmx to identify the specific devpts instance. (This solution has an impact on the 'startup scripts', and that is being discussed separately). This patch merely enables creating the [c, 5:2] (ptmx) device in devpts filesystem. TODO: - Ability to unlink the /dev/pts/ptmx - Remove traces of '/dev/pts/tty' node Changelog: - Earlier version of this patch enabled creating /dev/pts/tty as well. As pointed out by Al Viro and H. Peter Anvin, that is not really necessary. --- fs/devpts/inode.c | 56 +++--- 1 file changed, 53 insertions(+), 3 deletions(-) Index: linux-2.6.26-rc8-mm1/fs/devpts/inode.c === --- linux-2.6.26-rc8-mm1.orig/fs/devpts/inode.c 2008-08-04 02:08:50.0 -0700 +++ linux-2.6.26-rc8-mm1/fs/devpts/inode.c 2008-08-04 17:26:26.0 -0700 @@ -141,6 +141,56 @@ static void *new_pts_fs_info(void) } + +static int devpts_mknod(struct inode *dir, struct dentry *dentry, + int mode, dev_t rdev) +{ + int inum; + struct inode *inode; + struct super_block *sb = dir-i_sb; + + if (dentry-d_inode) + return -EEXIST; + + if (!S_ISCHR(mode)) + return -EPERM; + + if (rdev == MKDEV(TTYAUX_MAJOR, 2)) + inum = 2; +#if 0 + else if (rdev == MKDEV(TTYAUX_MAJOR, 0)) + inum = 3; +#endif + else + return -EPERM; + + inode = new_inode(sb); + if (!inode) + return -ENOMEM; + + inode-i_ino = inum; + inode-i_uid = inode-i_gid = 0; + inode-i_blocks = 0; + inode-i_mtime = inode-i_atime = inode-i_ctime = CURRENT_TIME; + + init_special_inode(inode, mode, rdev); + + d_instantiate(dentry, inode); + /* +* Get a reference to the dentry so the device-nodes persist +* even when there are no active references to them. We use +* kill_litter_super() to remove this entry when unmounting +* devpts. +*/ + dget(dentry); + return 0; +} + +const struct inode_operations devpts_dir_inode_operations = { + .lookup = simple_lookup, + .mknod = devpts_mknod, +}; + static int devpts_fill_super(struct super_block *s, void *data, int silent) { @@ -164,7 +214,7 @@ devpts_fill_super(struct super_block *s, inode-i_blocks = 0; inode-i_uid = inode-i_gid = 0; inode-i_mode = S_IFDIR | S_IRUGO | S_IXUGO | S_IWUSR; - inode-i_op = simple_dir_inode_operations; + inode-i_op = devpts_dir_inode_operations; inode-i_fop = simple_dir_operations; inode-i_nlink = 2; @@ -195,7 +245,7 @@ static void devpts_kill_sb(struct super_ //idr_destroy(fsi-allocated_ptys); kfree(fsi); - kill_anon_super(sb); + kill_litter_super(sb); } static struct file_system_type devpts_fs_type = { @@ -274,7 +324,7 @@ int devpts_pty_new(struct inode *ptmx_in if (!inode) return -ENOMEM; - inode-i_ino = number+2; + inode-i_ino = number+4; inode-i_uid = config.setuid ? config.uid : current-fsuid; inode-i_gid = config.setgid ? config.gid : current-fsgid; inode-i_mtime = inode-i_atime = inode-i_ctime = CURRENT_TIME; ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 3/6] Move 'allocated_ptys' to sb-s_s_fs_info
From: Sukadev Bhattiprolu [EMAIL PROTECTED] Subject: [RFC][PATCH 3/6] Move 'allocated_ptys' to sb-s_s_fs_info To enable multiple mounts of devpts, 'allocated_ptys' must be a per-mount variable rather than a global variable. This patch moves 'allocated_ptys' into the super_block's s_fs_info. --- fs/devpts/inode.c | 53 ++--- 1 file changed, 46 insertions(+), 7 deletions(-) Index: linux-2.6.26-rc8-mm1/fs/devpts/inode.c === --- linux-2.6.26-rc8-mm1.orig/fs/devpts/inode.c 2008-08-04 02:08:43.0 -0700 +++ linux-2.6.26-rc8-mm1/fs/devpts/inode.c 2008-08-04 02:08:50.0 -0700 @@ -28,8 +28,11 @@ #define DEVPTS_DEFAULT_MODE 0600 +struct pts_fs_info { + struct idr allocated_ptys; +}; + extern int pty_limit; /* Config limit on Unix98 ptys */ -static DEFINE_IDR(allocated_ptys); static DEFINE_MUTEX(allocated_ptys_lock); static struct vfsmount *devpts_mnt; @@ -125,6 +128,19 @@ static const struct super_operations dev .show_options = devpts_show_options, }; +static void *new_pts_fs_info(void) +{ + struct pts_fs_info *fsi; + + fsi = kmalloc(sizeof(struct pts_fs_info), GFP_KERNEL); + if (fsi) { + idr_init(fsi-allocated_ptys); + } + printk(KERN_ERR new_pts_fs_info(): Returning fsi %p\n, fsi); + return fsi; +} + + static int devpts_fill_super(struct super_block *s, void *data, int silent) { @@ -135,10 +151,14 @@ devpts_fill_super(struct super_block *s, s-s_magic = DEVPTS_SUPER_MAGIC; s-s_op = devpts_sops; s-s_time_gran = 1; + s-s_fs_info = new_pts_fs_info(); + + if (!s-s_fs_info) + goto fail; inode = new_inode(s); if (!inode) - goto fail; + goto free_fsi; inode-i_ino = 1; inode-i_mtime = inode-i_atime = inode-i_ctime = CURRENT_TIME; inode-i_blocks = 0; @@ -154,6 +174,9 @@ devpts_fill_super(struct super_block *s, printk(devpts: get root dentry failed\n); iput(inode); + +free_fsi: + kfree(s-s_fs_info); fail: return -ENOMEM; } @@ -164,11 +187,22 @@ static int devpts_get_sb(struct file_sys return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt); } + +static void devpts_kill_sb(struct super_block *sb) +{ + struct pts_fs_info *fsi = sb-s_fs_info;// rcu ? + + //idr_destroy(fsi-allocated_ptys); + kfree(fsi); + + kill_anon_super(sb); +} + static struct file_system_type devpts_fs_type = { .owner = THIS_MODULE, .name = devpts, .get_sb = devpts_get_sb, - .kill_sb= kill_anon_super, + .kill_sb= devpts_kill_sb, }; /* @@ -187,14 +221,16 @@ int devpts_new_index(struct inode *inode { int index; int idr_ret; + struct super_block *sb = pts_sb_from_inode(inode); + struct pts_fs_info *fsi = sb-s_fs_info;// need rcu ? retry: - if (!idr_pre_get(allocated_ptys, GFP_KERNEL)) { + if (!idr_pre_get(fsi-allocated_ptys, GFP_KERNEL)) { return -ENOMEM; } mutex_lock(allocated_ptys_lock); - idr_ret = idr_get_new(allocated_ptys, NULL, index); + idr_ret = idr_get_new(fsi-allocated_ptys, NULL, index); if (idr_ret 0) { mutex_unlock(allocated_ptys_lock); if (idr_ret == -EAGAIN) @@ -203,7 +239,7 @@ retry: } if (index = pty_limit) { - idr_remove(allocated_ptys, index); + idr_remove(fsi-allocated_ptys, index); mutex_unlock(allocated_ptys_lock); return -EIO; } @@ -213,8 +249,11 @@ retry: void devpts_kill_index(struct inode *inode, int idx) { + struct super_block *sb = pts_sb_from_inode(inode); + struct pts_fs_info *fsi = sb-s_fs_info;// need rcu ? + mutex_lock(allocated_ptys_lock); - idr_remove(allocated_ptys, idx); + idr_remove(fsi-allocated_ptys, idx); mutex_unlock(allocated_ptys_lock); } ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 6/6]: /dev/tty tweak in init_dev()
From: Sukadev Bhattiprolu [EMAIL PROTECTED] Subject: [RFC][PATCH 6/6]: /dev/tty tweak in init_dev() When opening /dev/tty, __tty_open() finds the tty using get_current_tty(). When __tty_open() calls init_dev(), init_dev() tries to 'find' the tty again from devpts. Is that really necessary ? The problem with asking devpts again is that with multiple mounts, devpts cannot find the tty without knowing the specific mount instance. We can't find the mount instance of devpts, since the inode of /dev/tty is in a different filesystem. --- drivers/char/tty_io.c |5 - 1 file changed, 4 insertions(+), 1 deletion(-) Index: linux-2.6.26-rc8-mm1/drivers/char/tty_io.c === --- linux-2.6.26-rc8-mm1.orig/drivers/char/tty_io.c 2008-08-04 17:25:20.0 -0700 +++ linux-2.6.26-rc8-mm1/drivers/char/tty_io.c 2008-08-04 17:26:34.0 -0700 @@ -2066,7 +2066,10 @@ static int init_dev(struct tty_driver *d /* check whether we're reopening an existing tty */ if (driver-flags TTY_DRIVER_DEVPTS_MEM) { - tty = devpts_get_tty(inode, idx); + if (inode-i_rdev == MKDEV(TTYAUX_MAJOR, 0)) + tty = *ret_tty; + else + tty = devpts_get_tty(inode, idx); /* * If we don't have a tty here on a slave open, it's because * the master already started the close process and there's ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][PATCH 5/6] Allow multiple mounts of devpts
From: Sukadev Bhattiprolu [EMAIL PROTECTED] Subject: [RFC][PATCH 5/6] Allow multiple mounts of devpts Can we simply enable multiple mounts using get_sb_nodev(), now that we don't have any pts_namespace/'data' to be saved ? (quick/dirty - does not prevent multiple mounts of devpts within a single 'container') --- fs/devpts/inode.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6.26-rc8-mm1/fs/devpts/inode.c === --- linux-2.6.26-rc8-mm1.orig/fs/devpts/inode.c 2008-08-04 17:26:26.0 -0700 +++ linux-2.6.26-rc8-mm1/fs/devpts/inode.c 2008-08-04 17:26:31.0 -0700 @@ -234,7 +234,7 @@ fail: static int devpts_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, struct vfsmount *mnt) { - return get_sb_single(fs_type, flags, data, devpts_fill_super, mnt); + return get_sb_nodev(fs_type, flags, data, devpts_fill_super, mnt); } ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH 0/6] Enable multiple mounts of devpts
[EMAIL PROTECTED] wrote: If devpts is mounted more than once, then '/dev/ptmx' must be a symlink to '/dev/pts/ptmx' and in each new devpts mount we must create the device node '/dev/pts/ptmx' [c, 5;2] by hand. This should be auto-created. That also eliminates any need to support the mknod system call. Appreciate comments on overall approach of my mapping from the inode to sb-s_fs_info to allocated_ptys and the hacky use of get_sb_nodev(), and also on the tweak to init_dev() (patch 6). Todo: User-space impact of /dev/ptmx symlink - Options are being discussed on mailing list (new mount option and config token, new fs name, etc) Remove even initial kernel mount of devpts ? The initial kernel mount of devpts should be removed, since that instance will never be accessible. -hpa ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH 0/6] Enable multiple mounts of devpts
H. Peter Anvin [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: If devpts is mounted more than once, then '/dev/ptmx' must be a symlink to '/dev/pts/ptmx' and in each new devpts mount we must create the device node '/dev/pts/ptmx' [c, 5;2] by hand. This should be auto-created. That also eliminates any need to support the mknod system call. Ok. But was wondering if we can pass the ptmx symlink burden to the 'container-startup sripts' since they are the ones that need the second or subsequent mount of devpts. So, initially and for systems that don't need multiple mounts of devpts, existing behavior can continue (/dev/ptmx is a node). Container startup scripts have to anyway remount /dev/pts and mknod /dev/pts/ptmx. These scripts could additionally check if /dev/ptmx is a node and make it a symlink. The container script would have to do this check while it still has access to the first mount of devpts and mknod in the first devpts mnt. But then again, the first mount is still special in the kernel. Appreciate comments on overall approach of my mapping from the inode to sb-s_fs_info to allocated_ptys and the hacky use of get_sb_nodev(), and also on the tweak to init_dev() (patch 6). Todo: User-space impact of /dev/ptmx symlink - Options are being discussed on mailing list (new mount option and config token, new fs name, etc) Remove even initial kernel mount of devpts ? The initial kernel mount of devpts should be removed, since that instance will never be accessible. -hpa ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH 0/6] Enable multiple mounts of devpts
[EMAIL PROTECTED] wrote: Appreciate comments on overall approach of my mapping from the inode to sb-s_fs_info to allocated_ptys and the hacky use of get_sb_nodev(), and also on the tweak to init_dev() (patch 6). First of all, thanks for taking this on :) It's always delightful to spout some ideas and have patches appear as a result :) Once you have the notion of the device nodes tied to a specific devpts filesystem, a lot of the operations can be trivialized; for example, the whole devpts_get_tty() mechanism can be reduced to: if (inode-i_sb-sb_magic != DEVPTS_SUPER_MAGIC) { /* do cleanup */ return -ENXIO; } tty = inode-i_private; This is part of what makes this whole approach so desirable: it actually allows for some dramatic simplifications of the existing code. One can even bind special operations to both the ptmx node and slave nodes, to bypass most of the character device and tty dispatch. That might require too much hacking at the tty core to be worth it, though. -hpa ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH 0/6] Enable multiple mounts of devpts
[EMAIL PROTECTED] wrote: Ok. But was wondering if we can pass the ptmx symlink burden to the 'container-startup sripts' since they are the ones that need the second or subsequent mount of devpts. So, initially and for systems that don't need multiple mounts of devpts, existing behavior can continue (/dev/ptmx is a node). Container startup scripts have to anyway remount /dev/pts and mknod /dev/pts/ptmx. These scripts could additionally check if /dev/ptmx is a node and make it a symlink. The container script would have to do this check while it still has access to the first mount of devpts and mknod in the first devpts mnt. But then again, the first mount is still special in the kernel. You're right, I think we can do this and still retain most of the advantages, at least for a transition period. The idea would be that you'd have a mount option, that if you do not specify it, you get a bind to the in-kernel mount; otherwise you get a new instance. ptmx, if not invoked from inside a devpts filesystem, would default to the kernel-mounted instance. Unfortunately I believe that means parsing the command options in getpts_get_sb() to know if we do have the multi option, but that isn't really all that difficult; it just means breaking the parser out as a separate subroutine. -hpa ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC][PATCH 2/2] CR: handle a single task with private memory maps
Louis Rilling wrote: On Fri, Aug 01, 2008 at 02:51:57PM -0400, Oren Laadan wrote: Louis Rilling wrote: On Fri, Aug 01, 2008 at 10:15:26AM -0400, Oren Laadan wrote: Louis Rilling wrote: On Thu, Jul 31, 2008 at 03:12:32PM -0400, Oren Laadan wrote: Cut the less interesting (IMHO at least) history to make Dave happier ;) Returning 0 in case of a restart is what I called a special handling. You won't do this for the other tasks, so this is special. Since userspace must cope with it anyway, userspace can be clever enough to avoid using the fd on restart, or stupid enough to destroy its checkpoint after restart. It's a different special hanlding :) In the case of a single task that wants to checkpoint itself - there are no other tasks. In the case of a container - there will be only a single task that calls sys_checkpoint(), so only that task will either get the CRID or the 0 (or an error). The other tasks will resume whatever it was that they were doing (lol, assuming of course restart works). So this special handling ends up being a two-liner: setting the return value of the syscall for the task that called sys_checkpoint() (well, actually it will call sys_restart() to restart, and return from sys_checkpoint() with a value of 0 ...). I knew it, since I actually saw it in the patches you sent last week. If you use an FD, you will have to checkpoint that resource as part of the checkpoint, and restore it as part of the restart. In doing so you'll need to specially handle it, because it has a special meaning. I agree, of course, that it is feasible. - Userspace makes less errors when managing incremental checkpoints. have you implemented this ? did you experience issues in real life ? user space will need a way to manage all of it anyway in many aspects. This will be the last/least of the issues ... No it was not implemented, and I'm not going to enter a discussion about the weight of arguments whether they are backed by implementations or not. It just becomes easier to create a mess with things depending on each other created as separate, freely (userspace-decided)-named objects. If I were to write a user-space tool to handle this, I would keep each chain of checkpoints (from base and on) in a separate subdir, for example. In fact, that's how I did it :) This is intuitive indeed. Checkpoints are already organized in a similar way in Kerrighed, except that a notion of application (transparent to applications) replaces the notion of container, and the kernel decides where to put the checkpoints and how they are named (I'm not saying that this is the best way though). Besides, this scheme begins to sound much more complex than a single file. Do you really gain so much from not having multiple files, one per checkpoint ? Well, at least you are not limited by the number of open file descriptors (assuming that, as you mentioned earlier, you pass an array of previous images to compute the next incremental checkpoint). You aren't limited by the number of open file. User space could provide an array of CRID, pathname (or serial#, pathname) to the kernel, the kernel will access the files as necessary. But the kernel itself would have to cope with this limit (even if it is not enforced, just to avoid consuming too much resources), or close and reopen files when needed... You got - close and reopen as needed with LRU policy to decide which open file to close. My experience so far is that you rarely need more than 100 open files. Uhh .. hold on: you need the array of previous checkpoint to _restart_ from an incremental checkpoint. You don't care about it when you checkpoint: instead, you keep track in memory of (1) what changed (e.g. which pages where touched), and (2) where to find unmodified pages in previous checkpoints. You save this information with each new checkpoint. The data structure to describe #2 is dynamic and changes with the execution, and easily keeps track of when older checkpoint images become irrelevant (because all the pages they hold have been overwritten already). I see. I thought that you also intended to build incremental checkpoints from previous checkpoints only, because even if this is not fast, this saves storage space. I agree that if you always keep necessary metadata in kernel memory, you don't need the previous images. Actually I don't know any incremental checkpoint scheme not using such in-memory metadata scheme. Which does not imply that other schemes are not relevant though... where: - base_fd is a regular file containing the base checkpoint, or -1 if a full checkpoint should be done. The checkpoint could actually also live in memory, and the kernel should check that it matches the image pointed to by base_fd. - out_fd is whatever file/socket/etc. on which we should dump the checkpoint. In particular, out_fd can equal base_fd and should
[Devel] RE: Too many I/O controller patches
Hi, Andrea. I participated in Containers Mini-summit. And, I talked with Mr. Andrew Morton in The Linux Foundation Japan Symposium BoF, Japan, July 10th. Currently, in ML, some I/O controller patches is sent and the respective patch keeps sending the improvement version. We and maintainers wouldn't like this situation. We wanted to solve this situation by the Mini-summit, but unfortunately, no other developers participated. (I couldn't give an opinion, because my English skill is low) Mr. Naveen present his way in Linux Symposium, and we discussed about I/O control at a few time after this presentation. Mr. Andrew gave a advice Should discuss about design more and more to me. And, in Containers Mini-summit (and Linux Symposium 2008 in Ottawa), Paul said that a necessary to us is to decide a requirement first. So, we must discuss requirement and design. My requirement is * to be able to distribute performance moderately. (* to be able to isolate each group(environment)). I guess (it may be wrong) Naveen's requirement is * to be able to handle latency. (high priority is always precede in handling I/O) (Only share isn't just given priority to, like CFQ.) * to be able to distribute performance moderately. Andrea's requirement is * to be able to set and control by absolute(direct) performance. Ryo's requirement is * to be able to distribute performance moderately. * to be able to set and control I/Os at flexible range (multi device such as LVM). I think that most solutions controls I/O performance moderately (by using weight/priority/percentage/etc. and by not using absolute) because disk I/O performance is inconstant and is affected by situation (such as application, file(data) balance, and so on). So, it is difficult to guarantee performance which is set by absolute bandwidth. If devices have constant performance, it will good to control by absolute bandwidth. And, when guaranteeing it by the low ability, it'll be possible. However, no one likes to make the resources wasteful. And, he gave a advice Can't a framework which organized each way, such as I/O elevator, be made?. I try to consider such framework (in elevator layer or block layer). Now, I look at the other methods, again. I think that OOM problems caused by memory/cache systems. So, it will be better that I/O controller created out of these problems first, although a lateness of the I/O device would be related. If these problem can be resolved, its technique should be applied into normal I/O control as well as cgroups. Buffered write I/O is also related with cache system. We must consider this problem as I/O control. I don't have a good way which can resolve this problems. I did some experiments trying to implement minimum bandwidth requirements for my io-throttle controller, mapping the requirements to CFQ prio and using the Satoshi's controller. But this needs additional work and testing right now, so I've not posted anything yet, just informed Satoshi about this. I'm very interested in this results. Thanks, Satoshi Uchida. -Original Message- From: Andrea Righi [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 05, 2008 3:23 AM To: Dave Hansen Cc: Ryo Tsuruta; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; Satoshi UCHIDA Subject: Re: Too many I/O controller patches Dave Hansen wrote: On Mon, 2008-08-04 at 17:51 +0900, Ryo Tsuruta wrote: This series of patches of dm-ioband now includes The bio tracking mechanism, which has been posted individually to this mailing list. This makes it easy for anybody to control the I/O bandwidth even when the I/O is one of delayed-write requests. During the Containers mini-summit at OLS, it was mentioned that there are at least *FOUR* of these I/O controllers floating around. Have you talked to the other authors? (I've cc'd at least one of them). We obviously can't come to any kind of real consensus with people just tossing the same patches back and forth. -- Dave Dave, thanks for this email first of all. I've talked with Satoshi (cc-ed) about his solution Yet another I/O bandwidth controlling subsystem for CGroups based on CFQ. I did some experiments trying to implement minimum bandwidth requirements for my io-throttle controller, mapping the requirements to CFQ prio and using the Satoshi's controller. But this needs additional work and testing right now, so I've not posted anything yet, just informed Satoshi about this. Unfortunately I've not talked to Ryo yet. I've continued my work using a quite different approach, because the dm-ioband solution didn't work with delayed-write requests. Now the bio tracking feature seems really prosiming and I would like to do some tests ASAP, and review the patch as well. But I'm not yet convinced that limiting the IO writes at the device mapper layer is the best solution.
[Devel] Re: memrlimit controller merge to mainline
Hugh Dickins wrote: On Tue, 5 Aug 2008, Balbir Singh wrote: Hugh Dickins wrote: [snip] BUG: unable to handle kernel paging request at 6b6b6b8b IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7) [78161323] ? exit_mmap+0xaf/0x133 [781226b1] ? mmput+0x4c/0xba [78165ce3] ? try_to_unuse+0x20b/0x3f5 [78371534] ? _spin_unlock+0x22/0x3c [7816636a] ? sys_swapoff+0x17b/0x37c [78102d95] ? sysenter_past_esp+0x6a/0xa5 I am unable to reproduce the problem, Me neither, I've spent many hours trying 2.6.27-rc1-mm1 and then back to 2.6.26-rc8-mm1. But I've been SO stupid: saw it originally on one machine with SLAB_DEBUG=y, have been trying since mostly on another with SLUB_DEBUG=y, but never thought to boot with slub_debug=P,task_struct until now. Unfortunately, I've not tried on 32 bit and not at all with SLAB_DEBUG=y. I'll give the latter a trial run and see what I get. but I do have an initial hypothesis CPU0 CPU1 try_to_unuse task 1 stars exiting look at mm = task1-mm .. increment mm_users task 1 exits mm-owner needs to be updated, but no new owner is found (mm_users 1, but no other task has task-mm = task1-mm) mm_update_next_owner() leaves grace period user count drops, call mmput(mm) task 1 freed dereferencing mm-owner fails Yes, that looks right to me: seems obvious now. I don't think your careful alternation of CPU0/1 events at the end matters: the swapoff CPU simply dereferences mm-owner after that task has gone. (That's a shame, I'd always hoped that mm-owner-comm was going to be good for use in mm messages, even when tearing down the mm.) The problem we have is that tasks are independent of mm_struct's (in some ways) and are associated almost like a database associates two entities through keys. I do have a potential solution in mind, but I want to make sure my hypothesis is correct. It seems wrong that memrlimit_cgroup_uncharge_as should be called after mm-owner may have been changed, even if it's to something safe. But I forget the mm/task exit details, surely they're tricky. The fix would be to uncharge when a new owner can no longer be found (I am yet to code/test it though). By the way, is the ordering in mm_update_next_owner the best? Would there be less movement if it searched amongst siblings before it searched amongst children? Ought it to make a first pass trying to stay within the same cgroup? Yes, we need to make a first pass at keeping it in the same cgroup. You might be right about the sibling optimization. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Too many I/O controller patches
On Mon, Aug 4, 2008 at 1:44 PM, Andrea Righi [EMAIL PROTECTED] wrote: A safer approach IMHO is to force the tasks to wait synchronously on each operation that directly or indirectly generates i/o. In particular the solution used by the io-throttle controller to limit the dirty-ratio in memory is to impose a sleep via schedule_timeout_killable() in balance_dirty_pages() when a generic process exceeds the limits defined for the belonging cgroup. Limiting read operations is a lot more easy, because they're always synchronized with i/o requests. I think that you're conflating two issues: - controlling how much dirty memory a cgroup can have at any given time (since dirty memory is much harder/slower to reclaim than clean memory) - controlling how much effect a cgroup can have on a given I/O device. By controlling the rate at which a task can generate dirty pages, you're not really limiting either of these. I think you'd have to set your I/O limits artificially low to prevent a case of a process writing a large data file and then doing fsync() on it, which would then hit the disk with the entire file at once, and blow away any QoS guarantees for other groups. As Dave suggested, I think it would make more sense to have your page-dirtying throttle points hook into the memory controller instead, and allow the memory controller to track/limit dirty pages for a cgroup, and potentially do throttling as part of that. Paul ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel