[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Tue, Feb 22, 2011 at 07:03:58PM -0500, Vivek Goyal wrote: I think we should accept to have an inode granularity. We could redesign the writeback code to work per-cgroup / per-page, etc. but that would add a huge overhead. The limit of inode granularity could be an acceptable tradeoff, cgroups are supposed to work to different files usually, well.. except when databases come into play (ouch!). Agreed. Granularity of per inode level might be accetable in many cases. Again, I am worried faster group getting stuck behind slower group. I am wondering if we are trying to solve the problem of ASYNC write throttling at wrong layer. Should ASYNC IO be throttled before we allow task to write to page cache. The way we throttle the process based on dirty ratio, can we just check for throttle limits also there or something like that.(I think that's what you had done in your initial throttling controller implementation?) Right. This is exactly the same approach I've used in my old throttling controller: throttle sync READs and WRITEs at the block layer and async WRITEs when the task is dirtying memory pages. This is probably the simplest way to resolve the problem of faster group getting blocked by slower group, but the controller will be a little bit more leaky, because the writeback IO will be never throttled and we'll see some limited IO spikes during the writeback. However, this is always a better solution IMHO respect to the current implementation that is affected by that kind of priority inversion problem. I can try to add this logic to the current blk-throttle controller if you think it is worth to test it. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
Agreed. Granularity of per inode level might be accetable in many cases. Again, I am worried faster group getting stuck behind slower group. I am wondering if we are trying to solve the problem of ASYNC write throttling at wrong layer. Should ASYNC IO be throttled before we allow task to write to page cache. The way we throttle the process based on dirty ratio, can we just check for throttle limits also there or something like that.(I think that's what you had done in your initial throttling controller implementation?) Right. This is exactly the same approach I've used in my old throttling controller: throttle sync READs and WRITEs at the block layer and async WRITEs when the task is dirtying memory pages. This is probably the simplest way to resolve the problem of faster group getting blocked by slower group, but the controller will be a little bit more leaky, because the writeback IO will be never throttled and we'll see some limited IO spikes during the writeback. Yes writeback will not be throttled. Not sure how big a problem that is. - We have controlled the input rate. So that should help a bit. - May be one can put some high limit on root cgroup to in blkio throttle controller to limit overall WRITE rate of the system. - For SATA disks, try to use CFQ which can try to minimize the impact of WRITE. It will atleast provide consistent bandwindth experience to application. However, this is always a better solution IMHO respect to the current implementation that is affected by that kind of priority inversion problem. I can try to add this logic to the current blk-throttle controller if you think it is worth to test it. At this point of time I have few concerns with this approach. - Configuration issues. Asking user to plan for SYNC ans ASYNC IO separately is inconvenient. One has to know the nature of workload. - Most likely we will come up with global limits (atleast to begin with), and not per device limit. That can lead to contention on one single lock and scalability issues on big systems. Having said that, this approach should reduce the kernel complexity a lot. So if we can do some intelligent locking to limit the overhead then it will boil down to reduced complexity in kernel vs ease of use to user. I guess at this point of time I am inclined towards keeping it simple in kernel. Couple of people have asked me that we have backup jobs running at night and we want to reduce the IO bandwidth of these jobs to limit the impact on latency of other jobs, I guess this approach will definitely solve that issue. IMHO, it might be worth trying this approach and see how well does it work. It might not solve all the problems but can be helpful in many situations. I feel that for proportional bandwidth division, implementing ASYNC control at CFQ will make sense because even if things get serialized in higher layers, consequences are not very bad as it is work conserving algorithm. But for throttling serialization will lead to bad consequences. May be one can think of new files in blkio controller to limit async IO per group during page dirty time. blkio.throttle.async.write_bps_limit blkio.throttle.async.write_iops_limit Thanks Vivek ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote: Agreed. Granularity of per inode level might be accetable in many cases. Again, I am worried faster group getting stuck behind slower group. I am wondering if we are trying to solve the problem of ASYNC write throttling at wrong layer. Should ASYNC IO be throttled before we allow task to write to page cache. The way we throttle the process based on dirty ratio, can we just check for throttle limits also there or something like that.(I think that's what you had done in your initial throttling controller implementation?) Right. This is exactly the same approach I've used in my old throttling controller: throttle sync READs and WRITEs at the block layer and async WRITEs when the task is dirtying memory pages. This is probably the simplest way to resolve the problem of faster group getting blocked by slower group, but the controller will be a little bit more leaky, because the writeback IO will be never throttled and we'll see some limited IO spikes during the writeback. Yes writeback will not be throttled. Not sure how big a problem that is. - We have controlled the input rate. So that should help a bit. - May be one can put some high limit on root cgroup to in blkio throttle controller to limit overall WRITE rate of the system. - For SATA disks, try to use CFQ which can try to minimize the impact of WRITE. It will atleast provide consistent bandwindth experience to application. Right. However, this is always a better solution IMHO respect to the current implementation that is affected by that kind of priority inversion problem. I can try to add this logic to the current blk-throttle controller if you think it is worth to test it. At this point of time I have few concerns with this approach. - Configuration issues. Asking user to plan for SYNC ans ASYNC IO separately is inconvenient. One has to know the nature of workload. - Most likely we will come up with global limits (atleast to begin with), and not per device limit. That can lead to contention on one single lock and scalability issues on big systems. Having said that, this approach should reduce the kernel complexity a lot. So if we can do some intelligent locking to limit the overhead then it will boil down to reduced complexity in kernel vs ease of use to user. I guess at this point of time I am inclined towards keeping it simple in kernel. BTW, with this approach probably we can even get rid of the page tracking stuff for now. If we don't consider the swap IO, any other IO operation from our point of view will happen directly from process context (writes in memory + sync reads from the block device). However, I'm sure we'll need the page tracking also for the blkio controller soon or later. This is an important information and also the proportional bandwidth controller can take advantage of it. Couple of people have asked me that we have backup jobs running at night and we want to reduce the IO bandwidth of these jobs to limit the impact on latency of other jobs, I guess this approach will definitely solve that issue. IMHO, it might be worth trying this approach and see how well does it work. It might not solve all the problems but can be helpful in many situations. Agreed. This could be a good tradeoff for a lot of common cases. I feel that for proportional bandwidth division, implementing ASYNC control at CFQ will make sense because even if things get serialized in higher layers, consequences are not very bad as it is work conserving algorithm. But for throttling serialization will lead to bad consequences. Agreed. May be one can think of new files in blkio controller to limit async IO per group during page dirty time. blkio.throttle.async.write_bps_limit blkio.throttle.async.write_iops_limit OK, I'll try to add the async throttling logic and use this interface. -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote: On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote: Agreed. Granularity of per inode level might be accetable in many cases. Again, I am worried faster group getting stuck behind slower group. I am wondering if we are trying to solve the problem of ASYNC write throttling at wrong layer. Should ASYNC IO be throttled before we allow task to write to page cache. The way we throttle the process based on dirty ratio, can we just check for throttle limits also there or something like that.(I think that's what you had done in your initial throttling controller implementation?) Right. This is exactly the same approach I've used in my old throttling controller: throttle sync READs and WRITEs at the block layer and async WRITEs when the task is dirtying memory pages. This is probably the simplest way to resolve the problem of faster group getting blocked by slower group, but the controller will be a little bit more leaky, because the writeback IO will be never throttled and we'll see some limited IO spikes during the writeback. Yes writeback will not be throttled. Not sure how big a problem that is. - We have controlled the input rate. So that should help a bit. - May be one can put some high limit on root cgroup to in blkio throttle controller to limit overall WRITE rate of the system. - For SATA disks, try to use CFQ which can try to minimize the impact of WRITE. It will atleast provide consistent bandwindth experience to application. Right. However, this is always a better solution IMHO respect to the current implementation that is affected by that kind of priority inversion problem. I can try to add this logic to the current blk-throttle controller if you think it is worth to test it. At this point of time I have few concerns with this approach. - Configuration issues. Asking user to plan for SYNC ans ASYNC IO separately is inconvenient. One has to know the nature of workload. - Most likely we will come up with global limits (atleast to begin with), and not per device limit. That can lead to contention on one single lock and scalability issues on big systems. Having said that, this approach should reduce the kernel complexity a lot. So if we can do some intelligent locking to limit the overhead then it will boil down to reduced complexity in kernel vs ease of use to user. I guess at this point of time I am inclined towards keeping it simple in kernel. BTW, with this approach probably we can even get rid of the page tracking stuff for now. Agreed. If we don't consider the swap IO, any other IO operation from our point of view will happen directly from process context (writes in memory + sync reads from the block device). Why do we need to account for swap IO? Application never asked for swap IO. It is kernel's decision to move soem pages to swap to free up some memory. What's the point in charging those pages to application group and throttle accordingly? However, I'm sure we'll need the page tracking also for the blkio controller soon or later. This is an important information and also the proportional bandwidth controller can take advantage of it. Yes page tracking will be needed for CFQ proportional bandwidth ASYNC write support. But until and unless we implement memory cgroup dirty ratio and figure a way out to make writeback logic cgroup aware, till then I think page tracking stuff is not really useful. Couple of people have asked me that we have backup jobs running at night and we want to reduce the IO bandwidth of these jobs to limit the impact on latency of other jobs, I guess this approach will definitely solve that issue. IMHO, it might be worth trying this approach and see how well does it work. It might not solve all the problems but can be helpful in many situations. Agreed. This could be a good tradeoff for a lot of common cases. I feel that for proportional bandwidth division, implementing ASYNC control at CFQ will make sense because even if things get serialized in higher layers, consequences are not very bad as it is work conserving algorithm. But for throttling serialization will lead to bad consequences. Agreed. May be one can think of new files in blkio controller to limit async IO per group during page dirty time. blkio.throttle.async.write_bps_limit blkio.throttle.async.write_iops_limit OK, I'll try to add the async throttling logic and use this interface. Cool, I would like to play with it a bit once patches are ready. Thanks Vivek ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Wed, 23 Feb 2011 19:10:33 -0500 Vivek Goyal vgo...@redhat.com wrote: On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote: On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote: Agreed. Granularity of per inode level might be accetable in many cases. Again, I am worried faster group getting stuck behind slower group. I am wondering if we are trying to solve the problem of ASYNC write throttling at wrong layer. Should ASYNC IO be throttled before we allow task to write to page cache. The way we throttle the process based on dirty ratio, can we just check for throttle limits also there or something like that.(I think that's what you had done in your initial throttling controller implementation?) Right. This is exactly the same approach I've used in my old throttling controller: throttle sync READs and WRITEs at the block layer and async WRITEs when the task is dirtying memory pages. This is probably the simplest way to resolve the problem of faster group getting blocked by slower group, but the controller will be a little bit more leaky, because the writeback IO will be never throttled and we'll see some limited IO spikes during the writeback. Yes writeback will not be throttled. Not sure how big a problem that is. - We have controlled the input rate. So that should help a bit. - May be one can put some high limit on root cgroup to in blkio throttle controller to limit overall WRITE rate of the system. - For SATA disks, try to use CFQ which can try to minimize the impact of WRITE. It will atleast provide consistent bandwindth experience to application. Right. However, this is always a better solution IMHO respect to the current implementation that is affected by that kind of priority inversion problem. I can try to add this logic to the current blk-throttle controller if you think it is worth to test it. At this point of time I have few concerns with this approach. - Configuration issues. Asking user to plan for SYNC ans ASYNC IO separately is inconvenient. One has to know the nature of workload. - Most likely we will come up with global limits (atleast to begin with), and not per device limit. That can lead to contention on one single lock and scalability issues on big systems. Having said that, this approach should reduce the kernel complexity a lot. So if we can do some intelligent locking to limit the overhead then it will boil down to reduced complexity in kernel vs ease of use to user. I guess at this point of time I am inclined towards keeping it simple in kernel. BTW, with this approach probably we can even get rid of the page tracking stuff for now. Agreed. If we don't consider the swap IO, any other IO operation from our point of view will happen directly from process context (writes in memory + sync reads from the block device). Why do we need to account for swap IO? Application never asked for swap IO. It is kernel's decision to move soem pages to swap to free up some memory. What's the point in charging those pages to application group and throttle accordingly? I think swap I/O should be controlled by memcg's dirty_ratio. But, IIRC, NEC guy had a requirement for this... I think some enterprise cusotmer may want to throttle the whole speed of swapout I/O (not swapin)...so, they may be glad if they can limit throttle the I/O against a disk partition or all I/O tagged as 'swapio' rather than some cgroup name. But I'm afraid slow swapout may consume much dirty_ratio and make things worse ;) However, I'm sure we'll need the page tracking also for the blkio controller soon or later. This is an important information and also the proportional bandwidth controller can take advantage of it. Yes page tracking will be needed for CFQ proportional bandwidth ASYNC write support. But until and unless we implement memory cgroup dirty ratio and figure a way out to make writeback logic cgroup aware, till then I think page tracking stuff is not really useful. I think Greg Thelen is now preparing patches for dirty_ratio. Thanks, -Kame ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Wed, Feb 23, 2011 at 4:40 PM, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Wed, 23 Feb 2011 19:10:33 -0500 Vivek Goyal vgo...@redhat.com wrote: On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote: On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote: Agreed. Granularity of per inode level might be accetable in many cases. Again, I am worried faster group getting stuck behind slower group. I am wondering if we are trying to solve the problem of ASYNC write throttling at wrong layer. Should ASYNC IO be throttled before we allow task to write to page cache. The way we throttle the process based on dirty ratio, can we just check for throttle limits also there or something like that.(I think that's what you had done in your initial throttling controller implementation?) Right. This is exactly the same approach I've used in my old throttling controller: throttle sync READs and WRITEs at the block layer and async WRITEs when the task is dirtying memory pages. This is probably the simplest way to resolve the problem of faster group getting blocked by slower group, but the controller will be a little bit more leaky, because the writeback IO will be never throttled and we'll see some limited IO spikes during the writeback. Yes writeback will not be throttled. Not sure how big a problem that is. - We have controlled the input rate. So that should help a bit. - May be one can put some high limit on root cgroup to in blkio throttle controller to limit overall WRITE rate of the system. - For SATA disks, try to use CFQ which can try to minimize the impact of WRITE. It will atleast provide consistent bandwindth experience to application. Right. However, this is always a better solution IMHO respect to the current implementation that is affected by that kind of priority inversion problem. I can try to add this logic to the current blk-throttle controller if you think it is worth to test it. At this point of time I have few concerns with this approach. - Configuration issues. Asking user to plan for SYNC ans ASYNC IO separately is inconvenient. One has to know the nature of workload. - Most likely we will come up with global limits (atleast to begin with), and not per device limit. That can lead to contention on one single lock and scalability issues on big systems. Having said that, this approach should reduce the kernel complexity a lot. So if we can do some intelligent locking to limit the overhead then it will boil down to reduced complexity in kernel vs ease of use to user. I guess at this point of time I am inclined towards keeping it simple in kernel. BTW, with this approach probably we can even get rid of the page tracking stuff for now. Agreed. If we don't consider the swap IO, any other IO operation from our point of view will happen directly from process context (writes in memory + sync reads from the block device). Why do we need to account for swap IO? Application never asked for swap IO. It is kernel's decision to move soem pages to swap to free up some memory. What's the point in charging those pages to application group and throttle accordingly? I think swap I/O should be controlled by memcg's dirty_ratio. But, IIRC, NEC guy had a requirement for this... I think some enterprise cusotmer may want to throttle the whole speed of swapout I/O (not swapin)...so, they may be glad if they can limit throttle the I/O against a disk partition or all I/O tagged as 'swapio' rather than some cgroup name. But I'm afraid slow swapout may consume much dirty_ratio and make things worse ;) However, I'm sure we'll need the page tracking also for the blkio controller soon or later. This is an important information and also the proportional bandwidth controller can take advantage of it. Yes page tracking will be needed for CFQ proportional bandwidth ASYNC write support. But until and unless we implement memory cgroup dirty ratio and figure a way out to make writeback logic cgroup aware, till then I think page tracking stuff is not really useful. I think Greg Thelen is now preparing patches for dirty_ratio. Thanks, -Kame Correct. I am working on the memcg dirty_ratio patches with latest mmotm memcg. I am running some test cases which should be complete tomorrow. Once testing is complete, I will sent the patches for review. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
* Andrea Righi ari...@develer.com [2011-02-22 18:12:51]: Currently the blkio.throttle controller only support synchronous IO requests. This means that we always look at the current task to identify the owner of each IO request. However dirty pages in the page cache can be wrote to disk asynchronously by the per-bdi flusher kernel threads or by any other thread in the system, according to the writeback policy. For this reason the real writes to the underlying block devices may occur in a different IO context respect to the task that originally generated the dirty pages involved in the IO operation. This makes the tracking and throttling of writeback IO more complicate respect to the synchronous IO from the blkio controller's perspective. The same concept is also valid for anonymous pages involed in IO operations (swap). This patch allow to track the cgroup that originally dirtied each page in page cache and each anonymous page and pass these informations to the blk-throttle controller. These informations can be used to provide a better service level differentiation of buffered writes swap IO between different cgroups. Testcase - create a cgroup with 1MiB/s write limit: # mount -t cgroup -o blkio none /mnt/cgroup # mkdir /mnt/cgroup/foo # echo 8:0 $((1024 * 1024)) /mnt/cgroup/foo/blkio.throttle.write_bps_device - move a task into the cgroup and run a dd to generate some writeback IO Results: - 2.6.38-rc6 vanilla: $ cat /proc/$$/cgroup 1:blkio:/foo $ dd if=/dev/zero of=zero bs=1M count=1024 $ dstat -df --dsk/sda-- read writ 019M 019M 0 0 0 0 019M ... - 2.6.38-rc6 + blk-throttle writeback IO control: $ cat /proc/$$/cgroup 1:blkio:/foo $ dd if=/dev/zero of=zero bs=1M count=1024 $ dstat -df --dsk/sda-- read writ 0 1024 0 1024 0 1024 0 1024 0 1024 ... Thanks for looking into this, further review follows. -- Three Cheers, Balbir ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Tue, Feb 22, 2011 at 06:12:51PM +0100, Andrea Righi wrote: Currently the blkio.throttle controller only support synchronous IO requests. This means that we always look at the current task to identify the owner of each IO request. However dirty pages in the page cache can be wrote to disk asynchronously by the per-bdi flusher kernel threads or by any other thread in the system, according to the writeback policy. For this reason the real writes to the underlying block devices may occur in a different IO context respect to the task that originally generated the dirty pages involved in the IO operation. This makes the tracking and throttling of writeback IO more complicate respect to the synchronous IO from the blkio controller's perspective. The same concept is also valid for anonymous pages involed in IO operations (swap). This patch allow to track the cgroup that originally dirtied each page in page cache and each anonymous page and pass these informations to the blk-throttle controller. These informations can be used to provide a better service level differentiation of buffered writes swap IO between different cgroups. Hi Andrea, Thanks for the patches. Before I look deeper into patches, had few general queries/thoughts. - So this requires memory controller to be enabled. Does it also require these to be co-mounted? - Currently in throttling there is no limit on number of bios queued per group. I think this is not necessarily a very good idea because if throttling limits are low, we will build very long bio queues. So some AIO process can queue up lots of bios, consume lots of memory without getting blocked. I am sure there will be other side affects too. One of the side affects I noticed is that if an AIO process queues up too much of IO, and if I want to kill it now, it just hangs there for a really-2 long time (waiting for all the throttled IO to complete). So I was thinking of implementing either per group limit or per io context limit and after that process will be put to sleep. (something like request descriptor mechanism). If that's the case, then comes the question of what do to about kernel threads. Should they be blocked or not. If these are blocked then a fast group will also be indirectly throttled behind a slow group. If they are not then we still have the problem of too many bios queued in throttling layer. - What to do about other kernel thread like kjournald which is doing IO on behalf of all the filesystem users. If data is also journalled then I think again everything got serialized and a faster group got backlogged behind a slower one. - Two processes doing IO to same file and slower group will throttle IO for faster group also. (flushing is per inode). I am not sure what are other common operations by kernel threads which can make IO serialized. Thanks Vivek Testcase - create a cgroup with 1MiB/s write limit: # mount -t cgroup -o blkio none /mnt/cgroup # mkdir /mnt/cgroup/foo # echo 8:0 $((1024 * 1024)) /mnt/cgroup/foo/blkio.throttle.write_bps_device - move a task into the cgroup and run a dd to generate some writeback IO Results: - 2.6.38-rc6 vanilla: $ cat /proc/$$/cgroup 1:blkio:/foo $ dd if=/dev/zero of=zero bs=1M count=1024 $ dstat -df --dsk/sda-- read writ 019M 019M 0 0 0 0 019M ... - 2.6.38-rc6 + blk-throttle writeback IO control: $ cat /proc/$$/cgroup 1:blkio:/foo $ dd if=/dev/zero of=zero bs=1M count=1024 $ dstat -df --dsk/sda-- read writ 0 1024 0 1024 0 1024 0 1024 0 1024 ... TODO - lots of testing Any feedback is welcome. -Andrea [PATCH 1/5] blk-cgroup: move blk-cgroup.h in include/linux/blk-cgroup.h [PATCH 2/5] blk-cgroup: introduce task_to_blkio_cgroup() [PATCH 3/5] page_cgroup: make page tracking available for blkio [PATCH 4/5] blk-throttle: track buffered and anonymous pages [PATCH 5/5] blk-throttle: buffered and anonymous page tracking instrumentation block/Kconfig |2 + block/blk-cgroup.c | 15 ++- block/blk-cgroup.h | 335 -- block/blk-throttle.c| 89 +++- block/cfq.h |2 +- fs/buffer.c |1 + include/linux/blk-cgroup.h | 341 +++ include/linux/blkdev.h | 26 +++- include/linux/memcontrol.h |6 + include/linux/mmzone.h |4 +- include/linux/page_cgroup.h | 33 - init/Kconfig|4 + mm/Makefile |3 +- mm/bounce.c |1 + mm/filemap.c|1 + mm/memcontrol.c |6 + mm/memory.c |5 + mm/page-writeback.c |1 + mm/page_cgroup.c| 129
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Tue, Feb 22, 2011 at 02:34:03PM -0500, Vivek Goyal wrote: On Tue, Feb 22, 2011 at 06:12:51PM +0100, Andrea Righi wrote: Currently the blkio.throttle controller only support synchronous IO requests. This means that we always look at the current task to identify the owner of each IO request. However dirty pages in the page cache can be wrote to disk asynchronously by the per-bdi flusher kernel threads or by any other thread in the system, according to the writeback policy. For this reason the real writes to the underlying block devices may occur in a different IO context respect to the task that originally generated the dirty pages involved in the IO operation. This makes the tracking and throttling of writeback IO more complicate respect to the synchronous IO from the blkio controller's perspective. The same concept is also valid for anonymous pages involed in IO operations (swap). This patch allow to track the cgroup that originally dirtied each page in page cache and each anonymous page and pass these informations to the blk-throttle controller. These informations can be used to provide a better service level differentiation of buffered writes swap IO between different cgroups. Hi Andrea, Thanks for the patches. Before I look deeper into patches, had few general queries/thoughts. - So this requires memory controller to be enabled. Does it also require these to be co-mounted? No and no. The blkio controller enables and uses the page_cgroup functionality, but it doesn't depend on the memory controller. It automatically selects CONFIG_MM_OWNER and CONFIG_PAGE_TRACKING (last one added in PATCH 3/5) and this is sufficient to make page_cgroup usable from any generic controller. - Currently in throttling there is no limit on number of bios queued per group. I think this is not necessarily a very good idea because if throttling limits are low, we will build very long bio queues. So some AIO process can queue up lots of bios, consume lots of memory without getting blocked. I am sure there will be other side affects too. One of the side affects I noticed is that if an AIO process queues up too much of IO, and if I want to kill it now, it just hangs there for a really-2 long time (waiting for all the throttled IO to complete). So I was thinking of implementing either per group limit or per io context limit and after that process will be put to sleep. (something like request descriptor mechanism). io context limit seems a better solution for now. We can also expect some help from the memory controller, if we'll have the dirty memory limit per cgroup in the future the max amount of bios queued will be automatically limited by this functionality. If that's the case, then comes the question of what do to about kernel threads. Should they be blocked or not. If these are blocked then a fast group will also be indirectly throttled behind a slow group. If they are not then we still have the problem of too many bios queued in throttling layer. I think kernel threads should be never forced to sleep, to avoid the classic priority inversion problem and create potential DoS in the system. Also for this part the dirty memory limit per cgroup could help a lot, because a cgroup will never exceed its quota of dirty memory, so it will not be able to submit more than a certain amount of bios (corresponding to the dirty memory limit). - What to do about other kernel thread like kjournald which is doing IO on behalf of all the filesystem users. If data is also journalled then I think again everything got serialized and a faster group got backlogged behind a slower one. This is the most critical issue IMHO. The blkio controller should need some help from the filesystems to understand which IO request can be throttled and which cannot. At the moment critical IO requests (with critical I mean that are dependency for other requests) and non-critical requests are mixed together in a way that throttling a single request may stop a lot of other requests in the system, and at the block layer it's not possible to retrieve such informations. I don't have a solution for this right now. Except looking at each filesystem implementation and try to understand how to pass these informations to the block layer. - Two processes doing IO to same file and slower group will throttle IO for faster group also. (flushing is per inode). I think we should accept to have an inode granularity. We could redesign the writeback code to work per-cgroup / per-page, etc. but that would add a huge overhead. The limit of inode granularity could be an acceptable tradeoff, cgroups are supposed to work to different files usually, well.. except when databases come into play (ouch!). Thanks, -Andrea ___ Containers mailing list contain...@lists.linux-foundation.org
[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control
On Tue, Feb 22, 2011 at 11:41:41PM +0100, Andrea Righi wrote: On Tue, Feb 22, 2011 at 02:34:03PM -0500, Vivek Goyal wrote: On Tue, Feb 22, 2011 at 06:12:51PM +0100, Andrea Righi wrote: Currently the blkio.throttle controller only support synchronous IO requests. This means that we always look at the current task to identify the owner of each IO request. However dirty pages in the page cache can be wrote to disk asynchronously by the per-bdi flusher kernel threads or by any other thread in the system, according to the writeback policy. For this reason the real writes to the underlying block devices may occur in a different IO context respect to the task that originally generated the dirty pages involved in the IO operation. This makes the tracking and throttling of writeback IO more complicate respect to the synchronous IO from the blkio controller's perspective. The same concept is also valid for anonymous pages involed in IO operations (swap). This patch allow to track the cgroup that originally dirtied each page in page cache and each anonymous page and pass these informations to the blk-throttle controller. These informations can be used to provide a better service level differentiation of buffered writes swap IO between different cgroups. Hi Andrea, Thanks for the patches. Before I look deeper into patches, had few general queries/thoughts. - So this requires memory controller to be enabled. Does it also require these to be co-mounted? No and no. The blkio controller enables and uses the page_cgroup functionality, but it doesn't depend on the memory controller. It automatically selects CONFIG_MM_OWNER and CONFIG_PAGE_TRACKING (last one added in PATCH 3/5) and this is sufficient to make page_cgroup usable from any generic controller. - Currently in throttling there is no limit on number of bios queued per group. I think this is not necessarily a very good idea because if throttling limits are low, we will build very long bio queues. So some AIO process can queue up lots of bios, consume lots of memory without getting blocked. I am sure there will be other side affects too. One of the side affects I noticed is that if an AIO process queues up too much of IO, and if I want to kill it now, it just hangs there for a really-2 long time (waiting for all the throttled IO to complete). So I was thinking of implementing either per group limit or per io context limit and after that process will be put to sleep. (something like request descriptor mechanism). io context limit seems a better solution for now. We can also expect some help from the memory controller, if we'll have the dirty memory limit per cgroup in the future the max amount of bios queued will be automatically limited by this functionality. If that's the case, then comes the question of what do to about kernel threads. Should they be blocked or not. If these are blocked then a fast group will also be indirectly throttled behind a slow group. If they are not then we still have the problem of too many bios queued in throttling layer. I think kernel threads should be never forced to sleep, to avoid the classic priority inversion problem and create potential DoS in the system. Also for this part the dirty memory limit per cgroup could help a lot, because a cgroup will never exceed its quota of dirty memory, so it will not be able to submit more than a certain amount of bios (corresponding to the dirty memory limit). Per memory cgroup dirty ratio should help a bit. But with intentional throttling we always run the risk of faster groups getting stuck behind slower groups. Even in the case of buffered WRITES, are you able to run two buffered WRITE streams in two groups and then throttle these to respective rates. It might be interesting to run that and see what happens. Practically I feel we shall have to run this with per cgroup memory dirty ratio bit hence coumount with memory controller. - What to do about other kernel thread like kjournald which is doing IO on behalf of all the filesystem users. If data is also journalled then I think again everything got serialized and a faster group got backlogged behind a slower one. This is the most critical issue IMHO. The blkio controller should need some help from the filesystems to understand which IO request can be throttled and which cannot. At the moment critical IO requests (with critical I mean that are dependency for other requests) and non-critical requests are mixed together in a way that throttling a single request may stop a lot of other requests in the system, and at the block layer it's not possible to retrieve such informations. I don't have a solution for this right now. Except looking at each filesystem implementation and try to