[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread Andrea Righi
On Tue, Feb 22, 2011 at 07:03:58PM -0500, Vivek Goyal wrote:
  I think we should accept to have an inode granularity. We could redesign
  the writeback code to work per-cgroup / per-page, etc. but that would
  add a huge overhead. The limit of inode granularity could be an
  acceptable tradeoff, cgroups are supposed to work to different files
  usually, well.. except when databases come into play (ouch!).
 
 Agreed. Granularity of per inode level might be accetable in many 
 cases. Again, I am worried faster group getting stuck behind slower
 group.
 
 I am wondering if we are trying to solve the problem of ASYNC write throttling
 at wrong layer. Should ASYNC IO be throttled before we allow task to write to
 page cache. The way we throttle the process based on dirty ratio, can we
 just check for throttle limits also there or something like that.(I think
 that's what you had done in your initial throttling controller 
 implementation?)

Right. This is exactly the same approach I've used in my old throttling
controller: throttle sync READs and WRITEs at the block layer and async
WRITEs when the task is dirtying memory pages.

This is probably the simplest way to resolve the problem of faster group
getting blocked by slower group, but the controller will be a little bit
more leaky, because the writeback IO will be never throttled and we'll
see some limited IO spikes during the writeback. However, this is always
a better solution IMHO respect to the current implementation that is
affected by that kind of priority inversion problem.

I can try to add this logic to the current blk-throttle controller if
you think it is worth to test it.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread Vivek Goyal
  Agreed. Granularity of per inode level might be accetable in many 
  cases. Again, I am worried faster group getting stuck behind slower
  group.
  
  I am wondering if we are trying to solve the problem of ASYNC write 
  throttling
  at wrong layer. Should ASYNC IO be throttled before we allow task to write 
  to
  page cache. The way we throttle the process based on dirty ratio, can we
  just check for throttle limits also there or something like that.(I think
  that's what you had done in your initial throttling controller 
  implementation?)
 
 Right. This is exactly the same approach I've used in my old throttling
 controller: throttle sync READs and WRITEs at the block layer and async
 WRITEs when the task is dirtying memory pages.
 
 This is probably the simplest way to resolve the problem of faster group
 getting blocked by slower group, but the controller will be a little bit
 more leaky, because the writeback IO will be never throttled and we'll
 see some limited IO spikes during the writeback.

Yes writeback will not be throttled. Not sure how big a problem that is.

- We have controlled the input rate. So that should help a bit.
- May be one can put some high limit on root cgroup to in blkio throttle
  controller to limit overall WRITE rate of the system.
- For SATA disks, try to use CFQ which can try to minimize the impact of
  WRITE.

It will atleast provide consistent bandwindth experience to application.

However, this is always
 a better solution IMHO respect to the current implementation that is
 affected by that kind of priority inversion problem.
 
 I can try to add this logic to the current blk-throttle controller if
 you think it is worth to test it.

At this point of time I have few concerns with this approach.

- Configuration issues. Asking user to plan for SYNC ans ASYNC IO
  separately is inconvenient. One has to know the nature of workload.

- Most likely we will come up with global limits (atleast to begin with),
  and not per device limit. That can lead to contention on one single
  lock and scalability issues on big systems.

Having said that, this approach should reduce the kernel complexity a lot.
So if we can do some intelligent locking to limit the overhead then it
will boil down to reduced complexity in kernel vs ease of use to user. I 
guess at this point of time I am inclined towards keeping it simple in
kernel.

Couple of people have asked me that we have backup jobs running at night
and we want to reduce the IO bandwidth of these jobs to limit the impact
on latency of other jobs, I guess this approach will definitely solve
that issue.

IMHO, it might be worth trying this approach and see how well does it work. It
might not solve all the problems but can be helpful in many situations.

I feel that for proportional bandwidth division, implementing ASYNC
control at CFQ will make sense because even if things get serialized in
higher layers, consequences are not very bad as it is work conserving
algorithm. But for throttling serialization will lead to bad consequences.

May be one can think of new files in blkio controller to limit async IO
per group during page dirty time.

blkio.throttle.async.write_bps_limit
blkio.throttle.async.write_iops_limit

Thanks
Vivek
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread Andrea Righi
On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote:
   Agreed. Granularity of per inode level might be accetable in many 
   cases. Again, I am worried faster group getting stuck behind slower
   group.
   
   I am wondering if we are trying to solve the problem of ASYNC write 
   throttling
   at wrong layer. Should ASYNC IO be throttled before we allow task to 
   write to
   page cache. The way we throttle the process based on dirty ratio, can we
   just check for throttle limits also there or something like that.(I think
   that's what you had done in your initial throttling controller 
   implementation?)
  
  Right. This is exactly the same approach I've used in my old throttling
  controller: throttle sync READs and WRITEs at the block layer and async
  WRITEs when the task is dirtying memory pages.
  
  This is probably the simplest way to resolve the problem of faster group
  getting blocked by slower group, but the controller will be a little bit
  more leaky, because the writeback IO will be never throttled and we'll
  see some limited IO spikes during the writeback.
 
 Yes writeback will not be throttled. Not sure how big a problem that is.
 
 - We have controlled the input rate. So that should help a bit.
 - May be one can put some high limit on root cgroup to in blkio throttle
   controller to limit overall WRITE rate of the system.
 - For SATA disks, try to use CFQ which can try to minimize the impact of
   WRITE.
 
 It will atleast provide consistent bandwindth experience to application.

Right.

 
 However, this is always
  a better solution IMHO respect to the current implementation that is
  affected by that kind of priority inversion problem.
  
  I can try to add this logic to the current blk-throttle controller if
  you think it is worth to test it.
 
 At this point of time I have few concerns with this approach.
 
 - Configuration issues. Asking user to plan for SYNC ans ASYNC IO
   separately is inconvenient. One has to know the nature of workload.
 
 - Most likely we will come up with global limits (atleast to begin with),
   and not per device limit. That can lead to contention on one single
   lock and scalability issues on big systems.
 
 Having said that, this approach should reduce the kernel complexity a lot.
 So if we can do some intelligent locking to limit the overhead then it
 will boil down to reduced complexity in kernel vs ease of use to user. I 
 guess at this point of time I am inclined towards keeping it simple in
 kernel.
 

BTW, with this approach probably we can even get rid of the page
tracking stuff for now. If we don't consider the swap IO, any other IO
operation from our point of view will happen directly from process
context (writes in memory + sync reads from the block device).

However, I'm sure we'll need the page tracking also for the blkio
controller soon or later. This is an important information and also the
proportional bandwidth controller can take advantage of it.

 
 Couple of people have asked me that we have backup jobs running at night
 and we want to reduce the IO bandwidth of these jobs to limit the impact
 on latency of other jobs, I guess this approach will definitely solve
 that issue.
 
 IMHO, it might be worth trying this approach and see how well does it work. It
 might not solve all the problems but can be helpful in many situations.

Agreed. This could be a good tradeoff for a lot of common cases.

 
 I feel that for proportional bandwidth division, implementing ASYNC
 control at CFQ will make sense because even if things get serialized in
 higher layers, consequences are not very bad as it is work conserving
 algorithm. But for throttling serialization will lead to bad consequences.

Agreed.

 
 May be one can think of new files in blkio controller to limit async IO
 per group during page dirty time.
 
 blkio.throttle.async.write_bps_limit
 blkio.throttle.async.write_iops_limit

OK, I'll try to add the async throttling logic and use this interface.

-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread Vivek Goyal
On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote:
 On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote:
Agreed. Granularity of per inode level might be accetable in many 
cases. Again, I am worried faster group getting stuck behind slower
group.

I am wondering if we are trying to solve the problem of ASYNC write 
throttling
at wrong layer. Should ASYNC IO be throttled before we allow task to 
write to
page cache. The way we throttle the process based on dirty ratio, can we
just check for throttle limits also there or something like that.(I 
think
that's what you had done in your initial throttling controller 
implementation?)
   
   Right. This is exactly the same approach I've used in my old throttling
   controller: throttle sync READs and WRITEs at the block layer and async
   WRITEs when the task is dirtying memory pages.
   
   This is probably the simplest way to resolve the problem of faster group
   getting blocked by slower group, but the controller will be a little bit
   more leaky, because the writeback IO will be never throttled and we'll
   see some limited IO spikes during the writeback.
  
  Yes writeback will not be throttled. Not sure how big a problem that is.
  
  - We have controlled the input rate. So that should help a bit.
  - May be one can put some high limit on root cgroup to in blkio throttle
controller to limit overall WRITE rate of the system.
  - For SATA disks, try to use CFQ which can try to minimize the impact of
WRITE.
  
  It will atleast provide consistent bandwindth experience to application.
 
 Right.
 
  
  However, this is always
   a better solution IMHO respect to the current implementation that is
   affected by that kind of priority inversion problem.
   
   I can try to add this logic to the current blk-throttle controller if
   you think it is worth to test it.
  
  At this point of time I have few concerns with this approach.
  
  - Configuration issues. Asking user to plan for SYNC ans ASYNC IO
separately is inconvenient. One has to know the nature of workload.
  
  - Most likely we will come up with global limits (atleast to begin with),
and not per device limit. That can lead to contention on one single
lock and scalability issues on big systems.
  
  Having said that, this approach should reduce the kernel complexity a lot.
  So if we can do some intelligent locking to limit the overhead then it
  will boil down to reduced complexity in kernel vs ease of use to user. I 
  guess at this point of time I am inclined towards keeping it simple in
  kernel.
  
 
 BTW, with this approach probably we can even get rid of the page
 tracking stuff for now.

Agreed.

 If we don't consider the swap IO, any other IO
 operation from our point of view will happen directly from process
 context (writes in memory + sync reads from the block device).

Why do we need to account for swap IO? Application never asked for swap
IO. It is kernel's decision to move soem pages to swap to free up some
memory. What's the point in charging those pages to application group
and throttle accordingly?

 
 However, I'm sure we'll need the page tracking also for the blkio
 controller soon or later. This is an important information and also the
 proportional bandwidth controller can take advantage of it.

Yes page tracking will be needed for CFQ proportional bandwidth ASYNC
write support. But until and unless we implement memory cgroup dirty
ratio and figure a way out to make writeback logic cgroup aware, till
then I think page tracking stuff is not really useful.

  
  Couple of people have asked me that we have backup jobs running at night
  and we want to reduce the IO bandwidth of these jobs to limit the impact
  on latency of other jobs, I guess this approach will definitely solve
  that issue.
  
  IMHO, it might be worth trying this approach and see how well does it work. 
  It
  might not solve all the problems but can be helpful in many situations.
 
 Agreed. This could be a good tradeoff for a lot of common cases.
 
  
  I feel that for proportional bandwidth division, implementing ASYNC
  control at CFQ will make sense because even if things get serialized in
  higher layers, consequences are not very bad as it is work conserving
  algorithm. But for throttling serialization will lead to bad consequences.
 
 Agreed.
 
  
  May be one can think of new files in blkio controller to limit async IO
  per group during page dirty time.
  
  blkio.throttle.async.write_bps_limit
  blkio.throttle.async.write_iops_limit
 
 OK, I'll try to add the async throttling logic and use this interface.

Cool, I would like to play with it a bit once patches are ready.

Thanks
Vivek
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel 

[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread KAMEZAWA Hiroyuki
On Wed, 23 Feb 2011 19:10:33 -0500
Vivek Goyal vgo...@redhat.com wrote:

 On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote:
  On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote:
 Agreed. Granularity of per inode level might be accetable in many 
 cases. Again, I am worried faster group getting stuck behind slower
 group.
 
 I am wondering if we are trying to solve the problem of ASYNC write 
 throttling
 at wrong layer. Should ASYNC IO be throttled before we allow task to 
 write to
 page cache. The way we throttle the process based on dirty ratio, can 
 we
 just check for throttle limits also there or something like that.(I 
 think
 that's what you had done in your initial throttling controller 
 implementation?)

Right. This is exactly the same approach I've used in my old throttling
controller: throttle sync READs and WRITEs at the block layer and async
WRITEs when the task is dirtying memory pages.

This is probably the simplest way to resolve the problem of faster group
getting blocked by slower group, but the controller will be a little bit
more leaky, because the writeback IO will be never throttled and we'll
see some limited IO spikes during the writeback.
   
   Yes writeback will not be throttled. Not sure how big a problem that is.
   
   - We have controlled the input rate. So that should help a bit.
   - May be one can put some high limit on root cgroup to in blkio throttle
 controller to limit overall WRITE rate of the system.
   - For SATA disks, try to use CFQ which can try to minimize the impact of
 WRITE.
   
   It will atleast provide consistent bandwindth experience to application.
  
  Right.
  
   
   However, this is always
a better solution IMHO respect to the current implementation that is
affected by that kind of priority inversion problem.

I can try to add this logic to the current blk-throttle controller if
you think it is worth to test it.
   
   At this point of time I have few concerns with this approach.
   
   - Configuration issues. Asking user to plan for SYNC ans ASYNC IO
 separately is inconvenient. One has to know the nature of workload.
   
   - Most likely we will come up with global limits (atleast to begin with),
 and not per device limit. That can lead to contention on one single
 lock and scalability issues on big systems.
   
   Having said that, this approach should reduce the kernel complexity a lot.
   So if we can do some intelligent locking to limit the overhead then it
   will boil down to reduced complexity in kernel vs ease of use to user. I 
   guess at this point of time I am inclined towards keeping it simple in
   kernel.
   
  
  BTW, with this approach probably we can even get rid of the page
  tracking stuff for now.
 
 Agreed.
 
  If we don't consider the swap IO, any other IO
  operation from our point of view will happen directly from process
  context (writes in memory + sync reads from the block device).
 
 Why do we need to account for swap IO? Application never asked for swap
 IO. It is kernel's decision to move soem pages to swap to free up some
 memory. What's the point in charging those pages to application group
 and throttle accordingly?
 

I think swap I/O should be controlled by memcg's dirty_ratio.
But, IIRC, NEC guy had a requirement for this...

I think some enterprise cusotmer may want to throttle the whole speed of
swapout I/O (not swapin)...so, they may be glad if they can limit throttle
the I/O against a disk partition or all I/O tagged as 'swapio' rather than
some cgroup name.

But I'm afraid slow swapout may consume much dirty_ratio and make things
worse ;)



  
  However, I'm sure we'll need the page tracking also for the blkio
  controller soon or later. This is an important information and also the
  proportional bandwidth controller can take advantage of it.
 
 Yes page tracking will be needed for CFQ proportional bandwidth ASYNC
 write support. But until and unless we implement memory cgroup dirty
 ratio and figure a way out to make writeback logic cgroup aware, till
 then I think page tracking stuff is not really useful.
 

I think Greg Thelen is now preparing patches for dirty_ratio.

Thanks,
-Kame

___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread Greg Thelen
On Wed, Feb 23, 2011 at 4:40 PM, KAMEZAWA Hiroyuki
kamezawa.hir...@jp.fujitsu.com wrote:
 On Wed, 23 Feb 2011 19:10:33 -0500
 Vivek Goyal vgo...@redhat.com wrote:

 On Thu, Feb 24, 2011 at 12:14:11AM +0100, Andrea Righi wrote:
  On Wed, Feb 23, 2011 at 10:23:54AM -0500, Vivek Goyal wrote:
 Agreed. Granularity of per inode level might be accetable in many
 cases. Again, I am worried faster group getting stuck behind slower
 group.

 I am wondering if we are trying to solve the problem of ASYNC write 
 throttling
 at wrong layer. Should ASYNC IO be throttled before we allow task to 
 write to
 page cache. The way we throttle the process based on dirty ratio, 
 can we
 just check for throttle limits also there or something like that.(I 
 think
 that's what you had done in your initial throttling controller 
 implementation?)
   
Right. This is exactly the same approach I've used in my old throttling
controller: throttle sync READs and WRITEs at the block layer and async
WRITEs when the task is dirtying memory pages.
   
This is probably the simplest way to resolve the problem of faster 
group
getting blocked by slower group, but the controller will be a little 
bit
more leaky, because the writeback IO will be never throttled and we'll
see some limited IO spikes during the writeback.
  
   Yes writeback will not be throttled. Not sure how big a problem that is.
  
   - We have controlled the input rate. So that should help a bit.
   - May be one can put some high limit on root cgroup to in blkio throttle
     controller to limit overall WRITE rate of the system.
   - For SATA disks, try to use CFQ which can try to minimize the impact of
     WRITE.
  
   It will atleast provide consistent bandwindth experience to application.
 
  Right.
 
  
   However, this is always
a better solution IMHO respect to the current implementation that is
affected by that kind of priority inversion problem.
   
I can try to add this logic to the current blk-throttle controller if
you think it is worth to test it.
  
   At this point of time I have few concerns with this approach.
  
   - Configuration issues. Asking user to plan for SYNC ans ASYNC IO
     separately is inconvenient. One has to know the nature of workload.
  
   - Most likely we will come up with global limits (atleast to begin with),
     and not per device limit. That can lead to contention on one single
     lock and scalability issues on big systems.
  
   Having said that, this approach should reduce the kernel complexity a 
   lot.
   So if we can do some intelligent locking to limit the overhead then it
   will boil down to reduced complexity in kernel vs ease of use to user. I
   guess at this point of time I am inclined towards keeping it simple in
   kernel.
  
 
  BTW, with this approach probably we can even get rid of the page
  tracking stuff for now.

 Agreed.

  If we don't consider the swap IO, any other IO
  operation from our point of view will happen directly from process
  context (writes in memory + sync reads from the block device).

 Why do we need to account for swap IO? Application never asked for swap
 IO. It is kernel's decision to move soem pages to swap to free up some
 memory. What's the point in charging those pages to application group
 and throttle accordingly?


 I think swap I/O should be controlled by memcg's dirty_ratio.
 But, IIRC, NEC guy had a requirement for this...

 I think some enterprise cusotmer may want to throttle the whole speed of
 swapout I/O (not swapin)...so, they may be glad if they can limit throttle
 the I/O against a disk partition or all I/O tagged as 'swapio' rather than
 some cgroup name.

 But I'm afraid slow swapout may consume much dirty_ratio and make things
 worse ;)



 
  However, I'm sure we'll need the page tracking also for the blkio
  controller soon or later. This is an important information and also the
  proportional bandwidth controller can take advantage of it.

 Yes page tracking will be needed for CFQ proportional bandwidth ASYNC
 write support. But until and unless we implement memory cgroup dirty
 ratio and figure a way out to make writeback logic cgroup aware, till
 then I think page tracking stuff is not really useful.


 I think Greg Thelen is now preparing patches for dirty_ratio.

 Thanks,
 -Kame



Correct.  I am working on the memcg dirty_ratio patches with latest
mmotm memcg.  I am running some test cases which should be complete
tomorrow.  Once testing is complete, I will sent  the patches for
review.
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-23 Thread Balbir Singh
* Andrea Righi ari...@develer.com [2011-02-22 18:12:51]:

 Currently the blkio.throttle controller only support synchronous IO requests.
 This means that we always look at the current task to identify the owner of
 each IO request.
 
 However dirty pages in the page cache can be wrote to disk asynchronously by
 the per-bdi flusher kernel threads or by any other thread in the system,
 according to the writeback policy.
 
 For this reason the real writes to the underlying block devices may
 occur in a different IO context respect to the task that originally
 generated the dirty pages involved in the IO operation. This makes the
 tracking and throttling of writeback IO more complicate respect to the
 synchronous IO from the blkio controller's perspective.
 
 The same concept is also valid for anonymous pages involed in IO operations
 (swap).
 
 This patch allow to track the cgroup that originally dirtied each page in page
 cache and each anonymous page and pass these informations to the blk-throttle
 controller. These informations can be used to provide a better service level
 differentiation of buffered writes swap IO between different cgroups.
 
 Testcase
 
 - create a cgroup with 1MiB/s write limit:
   # mount -t cgroup -o blkio none /mnt/cgroup
   # mkdir /mnt/cgroup/foo
   # echo 8:0 $((1024 * 1024))  
 /mnt/cgroup/foo/blkio.throttle.write_bps_device
 
 - move a task into the cgroup and run a dd to generate some writeback IO
 
 Results:
   - 2.6.38-rc6 vanilla:
   $ cat /proc/$$/cgroup
   1:blkio:/foo
   $ dd if=/dev/zero of=zero bs=1M count=1024 
   $ dstat -df
   --dsk/sda--
read  writ
  019M
  019M
  0 0
  0 0
  019M
   ...
 
   - 2.6.38-rc6 + blk-throttle writeback IO control:
   $ cat /proc/$$/cgroup
   1:blkio:/foo
   $ dd if=/dev/zero of=zero bs=1M count=1024 
   $ dstat -df
   --dsk/sda--
read  writ
  0  1024
  0  1024
  0  1024
  0  1024
  0  1024
   ...
 

Thanks for looking into this, further review follows.

-- 
Three Cheers,
Balbir
___
Containers mailing list
contain...@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-22 Thread Vivek Goyal
On Tue, Feb 22, 2011 at 06:12:51PM +0100, Andrea Righi wrote:
 Currently the blkio.throttle controller only support synchronous IO requests.
 This means that we always look at the current task to identify the owner of
 each IO request.
 
 However dirty pages in the page cache can be wrote to disk asynchronously by
 the per-bdi flusher kernel threads or by any other thread in the system,
 according to the writeback policy.
 
 For this reason the real writes to the underlying block devices may
 occur in a different IO context respect to the task that originally
 generated the dirty pages involved in the IO operation. This makes the
 tracking and throttling of writeback IO more complicate respect to the
 synchronous IO from the blkio controller's perspective.
 
 The same concept is also valid for anonymous pages involed in IO operations
 (swap).
 
 This patch allow to track the cgroup that originally dirtied each page in page
 cache and each anonymous page and pass these informations to the blk-throttle
 controller. These informations can be used to provide a better service level
 differentiation of buffered writes swap IO between different cgroups.
 

Hi Andrea,

Thanks for the patches. Before I look deeper into patches, had few
general queries/thoughts.

- So this requires memory controller to be enabled. Does it also require
  these to be co-mounted?

- Currently in throttling there is no limit on number of bios queued
  per group. I think this is not necessarily a very good idea because
  if throttling limits are low, we will build very long bio queues. So
  some AIO process can queue up lots of bios, consume lots of memory
  without getting blocked. I am sure there will be other side affects
  too. One of the side affects I noticed is that if an AIO process
  queues up too much of IO, and if I want to kill it now, it just hangs
  there for a really-2 long time (waiting for all the throttled IO
  to complete).

  So I was thinking of implementing either per group limit or per io
  context limit and after that process will be put to sleep. (something
  like request descriptor mechanism).

  If that's the case, then comes the question of what do to about kernel
  threads. Should they be blocked or not. If these are blocked then a
  fast group will also be indirectly throttled behind a slow group. If
  they are not then we still have the problem of too many bios queued
  in throttling layer.

- What to do about other kernel thread like kjournald which is doing
  IO on behalf of all the filesystem users. If data is also journalled
  then I think again everything got serialized and a faster group got
  backlogged behind a slower one.

- Two processes doing IO to same file and slower group will throttle
  IO for faster group also. (flushing is per inode). 

I am not sure what are other common operations by kernel threads which
can make IO serialized.

Thanks
Vivek 
  

 Testcase
 
 - create a cgroup with 1MiB/s write limit:
   # mount -t cgroup -o blkio none /mnt/cgroup
   # mkdir /mnt/cgroup/foo
   # echo 8:0 $((1024 * 1024))  
 /mnt/cgroup/foo/blkio.throttle.write_bps_device
 
 - move a task into the cgroup and run a dd to generate some writeback IO
 
 Results:
   - 2.6.38-rc6 vanilla:
   $ cat /proc/$$/cgroup
   1:blkio:/foo
   $ dd if=/dev/zero of=zero bs=1M count=1024 
   $ dstat -df
   --dsk/sda--
read  writ
  019M
  019M
  0 0
  0 0
  019M
   ...
 
   - 2.6.38-rc6 + blk-throttle writeback IO control:
   $ cat /proc/$$/cgroup
   1:blkio:/foo
   $ dd if=/dev/zero of=zero bs=1M count=1024 
   $ dstat -df
   --dsk/sda--
read  writ
  0  1024
  0  1024
  0  1024
  0  1024
  0  1024
   ...
 
 TODO
 
  - lots of testing
 
 Any feedback is welcome.
 -Andrea
 
 [PATCH 1/5] blk-cgroup: move blk-cgroup.h in include/linux/blk-cgroup.h
 [PATCH 2/5] blk-cgroup: introduce task_to_blkio_cgroup()
 [PATCH 3/5] page_cgroup: make page tracking available for blkio
 [PATCH 4/5] blk-throttle: track buffered and anonymous pages
 [PATCH 5/5] blk-throttle: buffered and anonymous page tracking instrumentation
 
  block/Kconfig   |2 +
  block/blk-cgroup.c  |   15 ++-
  block/blk-cgroup.h  |  335 --
  block/blk-throttle.c|   89 +++-
  block/cfq.h |2 +-
  fs/buffer.c |1 +
  include/linux/blk-cgroup.h  |  341 
 +++
  include/linux/blkdev.h  |   26 +++-
  include/linux/memcontrol.h  |6 +
  include/linux/mmzone.h  |4 +-
  include/linux/page_cgroup.h |   33 -
  init/Kconfig|4 +
  mm/Makefile |3 +-
  mm/bounce.c |1 +
  mm/filemap.c|1 +
  mm/memcontrol.c |6 +
  mm/memory.c |5 +
  mm/page-writeback.c |1 +
  mm/page_cgroup.c|  129 

[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-22 Thread Andrea Righi
On Tue, Feb 22, 2011 at 02:34:03PM -0500, Vivek Goyal wrote:
 On Tue, Feb 22, 2011 at 06:12:51PM +0100, Andrea Righi wrote:
  Currently the blkio.throttle controller only support synchronous IO 
  requests.
  This means that we always look at the current task to identify the owner 
  of
  each IO request.
  
  However dirty pages in the page cache can be wrote to disk asynchronously by
  the per-bdi flusher kernel threads or by any other thread in the system,
  according to the writeback policy.
  
  For this reason the real writes to the underlying block devices may
  occur in a different IO context respect to the task that originally
  generated the dirty pages involved in the IO operation. This makes the
  tracking and throttling of writeback IO more complicate respect to the
  synchronous IO from the blkio controller's perspective.
  
  The same concept is also valid for anonymous pages involed in IO operations
  (swap).
  
  This patch allow to track the cgroup that originally dirtied each page in 
  page
  cache and each anonymous page and pass these informations to the 
  blk-throttle
  controller. These informations can be used to provide a better service level
  differentiation of buffered writes swap IO between different cgroups.
  
 
 Hi Andrea,
 
 Thanks for the patches. Before I look deeper into patches, had few
 general queries/thoughts.
 
 - So this requires memory controller to be enabled. Does it also require
   these to be co-mounted?

No and no. The blkio controller enables and uses the page_cgroup
functionality, but it doesn't depend on the memory controller. It
automatically selects CONFIG_MM_OWNER and CONFIG_PAGE_TRACKING (last
one added in PATCH 3/5) and this is sufficient to make page_cgroup
usable from any generic controller.

 
 - Currently in throttling there is no limit on number of bios queued
   per group. I think this is not necessarily a very good idea because
   if throttling limits are low, we will build very long bio queues. So
   some AIO process can queue up lots of bios, consume lots of memory
   without getting blocked. I am sure there will be other side affects
   too. One of the side affects I noticed is that if an AIO process
   queues up too much of IO, and if I want to kill it now, it just hangs
   there for a really-2 long time (waiting for all the throttled IO
   to complete).
 
   So I was thinking of implementing either per group limit or per io
   context limit and after that process will be put to sleep. (something
   like request descriptor mechanism).

io context limit seems a better solution for now. We can also expect
some help from the memory controller, if we'll have the dirty memory
limit per cgroup in the future the max amount of bios queued will be
automatically limited by this functionality.

 
   If that's the case, then comes the question of what do to about kernel
   threads. Should they be blocked or not. If these are blocked then a
   fast group will also be indirectly throttled behind a slow group. If
   they are not then we still have the problem of too many bios queued
   in throttling layer.

I think kernel threads should be never forced to sleep, to avoid the
classic priority inversion problem and create potential DoS in the
system.

Also for this part the dirty memory limit per cgroup could help a lot,
because a cgroup will never exceed its quota of dirty memory, so it
will not be able to submit more than a certain amount of bios
(corresponding to the dirty memory limit).

 
 - What to do about other kernel thread like kjournald which is doing
   IO on behalf of all the filesystem users. If data is also journalled
   then I think again everything got serialized and a faster group got
   backlogged behind a slower one.

This is the most critical issue IMHO.

The blkio controller should need some help from the filesystems to
understand which IO request can be throttled and which cannot. At the
moment critical IO requests (with critical I mean that are dependency
for other requests) and non-critical requests are mixed together in a
way that throttling a single request may stop a lot of other requests in
the system, and at the block layer it's not possible to retrieve such
informations.

I don't have a solution for this right now. Except looking at each
filesystem implementation and try to understand how to pass these
informations to the block layer.

 
 - Two processes doing IO to same file and slower group will throttle
   IO for faster group also. (flushing is per inode). 
 

I think we should accept to have an inode granularity. We could redesign
the writeback code to work per-cgroup / per-page, etc. but that would
add a huge overhead. The limit of inode granularity could be an
acceptable tradeoff, cgroups are supposed to work to different files
usually, well.. except when databases come into play (ouch!).

Thanks,
-Andrea
___
Containers mailing list
contain...@lists.linux-foundation.org

[Devel] Re: [PATCH 0/5] blk-throttle: writeback and swap IO control

2011-02-22 Thread Vivek Goyal
On Tue, Feb 22, 2011 at 11:41:41PM +0100, Andrea Righi wrote:
 On Tue, Feb 22, 2011 at 02:34:03PM -0500, Vivek Goyal wrote:
  On Tue, Feb 22, 2011 at 06:12:51PM +0100, Andrea Righi wrote:
   Currently the blkio.throttle controller only support synchronous IO 
   requests.
   This means that we always look at the current task to identify the 
   owner of
   each IO request.
   
   However dirty pages in the page cache can be wrote to disk asynchronously 
   by
   the per-bdi flusher kernel threads or by any other thread in the system,
   according to the writeback policy.
   
   For this reason the real writes to the underlying block devices may
   occur in a different IO context respect to the task that originally
   generated the dirty pages involved in the IO operation. This makes the
   tracking and throttling of writeback IO more complicate respect to the
   synchronous IO from the blkio controller's perspective.
   
   The same concept is also valid for anonymous pages involed in IO 
   operations
   (swap).
   
   This patch allow to track the cgroup that originally dirtied each page in 
   page
   cache and each anonymous page and pass these informations to the 
   blk-throttle
   controller. These informations can be used to provide a better service 
   level
   differentiation of buffered writes swap IO between different cgroups.
   
  
  Hi Andrea,
  
  Thanks for the patches. Before I look deeper into patches, had few
  general queries/thoughts.
  
  - So this requires memory controller to be enabled. Does it also require
these to be co-mounted?
 
 No and no. The blkio controller enables and uses the page_cgroup
 functionality, but it doesn't depend on the memory controller. It
 automatically selects CONFIG_MM_OWNER and CONFIG_PAGE_TRACKING (last
 one added in PATCH 3/5) and this is sufficient to make page_cgroup
 usable from any generic controller.
 
  
  - Currently in throttling there is no limit on number of bios queued
per group. I think this is not necessarily a very good idea because
if throttling limits are low, we will build very long bio queues. So
some AIO process can queue up lots of bios, consume lots of memory
without getting blocked. I am sure there will be other side affects
too. One of the side affects I noticed is that if an AIO process
queues up too much of IO, and if I want to kill it now, it just hangs
there for a really-2 long time (waiting for all the throttled IO
to complete).
  
So I was thinking of implementing either per group limit or per io
context limit and after that process will be put to sleep. (something
like request descriptor mechanism).
 
 io context limit seems a better solution for now. We can also expect
 some help from the memory controller, if we'll have the dirty memory
 limit per cgroup in the future the max amount of bios queued will be
 automatically limited by this functionality.
 
  
If that's the case, then comes the question of what do to about kernel
threads. Should they be blocked or not. If these are blocked then a
fast group will also be indirectly throttled behind a slow group. If
they are not then we still have the problem of too many bios queued
in throttling layer.
 
 I think kernel threads should be never forced to sleep, to avoid the
 classic priority inversion problem and create potential DoS in the
 system.
 
 Also for this part the dirty memory limit per cgroup could help a lot,
 because a cgroup will never exceed its quota of dirty memory, so it
 will not be able to submit more than a certain amount of bios
 (corresponding to the dirty memory limit).

Per memory cgroup dirty ratio should help a bit. But with intentional
throttling we always run the risk of faster groups getting stuck behind
slower groups.

Even in the case of buffered WRITES, are you able to run two buffered
WRITE streams in two groups and then throttle these to respective rates.
It might be interesting to run that and see what happens.

Practically I feel we shall have to run this with per cgroup memory
dirty ratio bit hence coumount with memory controller.

 
  
  - What to do about other kernel thread like kjournald which is doing
IO on behalf of all the filesystem users. If data is also journalled
then I think again everything got serialized and a faster group got
backlogged behind a slower one.
 
 This is the most critical issue IMHO.
 
 The blkio controller should need some help from the filesystems to
 understand which IO request can be throttled and which cannot. At the
 moment critical IO requests (with critical I mean that are dependency
 for other requests) and non-critical requests are mixed together in a
 way that throttling a single request may stop a lot of other requests in
 the system, and at the block layer it's not possible to retrieve such
 informations.
 
 I don't have a solution for this right now. Except looking at each
 filesystem implementation and try to