[Devel] Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces

Greg Thelen Tue, 19 Oct 2010 21:02:51 -0700

KAMEZAWA Hiroyuki <[email protected]> writes:

> On Tue, 19 Oct 2010 14:00:58 -0700
> Greg Thelen <[email protected]> wrote:
>
>> Daisuke Nishimura <[email protected]> writes:
>> 
>> > On Mon, 18 Oct 2010 17:39:35 -0700
>> > Greg Thelen <[email protected]> wrote:
>> >
>> >> Document cgroup dirty memory interfaces and statistics.
>> >> 
>> >> Signed-off-by: Andrea Righi <[email protected]>
>> >> Signed-off-by: Greg Thelen <[email protected]>
>> >> ---
>> >> 
>> >> Changelog since v1:
>> >> - Renamed "nfs"/"total_nfs" to "nfs_unstable"/"total_nfs_unstable" in per 
>> >> cgroup
>> >>   memory.stat to match /proc/meminfo.
>> >> 
>> >> - Allow [kKmMgG] suffixes for newly created dirty limit value cgroupfs 
>> >> files.
>> >> 
>> >> - Describe a situation where a cgroup can exceed its dirty limit.
>> >> 
>> >>  Documentation/cgroups/memory.txt |   60 
>> >> ++++++++++++++++++++++++++++++++++++++
>> >>  1 files changed, 60 insertions(+), 0 deletions(-)
>> >> 
>> >> diff --git a/Documentation/cgroups/memory.txt 
>> >> b/Documentation/cgroups/memory.txt
>> >> index 7781857..02bbd6f 100644
>> >> --- a/Documentation/cgroups/memory.txt
>> >> +++ b/Documentation/cgroups/memory.txt
>> >> @@ -385,6 +385,10 @@ mapped_file  - # of bytes of mapped file (includes 
>> >> tmpfs/shmem)
>> >>  pgpgin           - # of pages paged in (equivalent to # of charging 
>> >> events).
>> >>  pgpgout          - # of pages paged out (equivalent to # of uncharging 
>> >> events).
>> >>  swap             - # of bytes of swap usage
>> >> +dirty            - # of bytes that are waiting to get written back to 
>> >> the disk.
>> >> +writeback        - # of bytes that are actively being written back to 
>> >> the disk.
>> >> +nfs_unstable     - # of bytes sent to the NFS server, but not yet 
>> >> committed to
>> >> +         the actual storage.
>> >>  inactive_anon    - # of bytes of anonymous memory and swap cache memory 
>> >> on
>> >>           LRU list.
>> >>  active_anon      - # of bytes of anonymous and swap cache memory on 
>> >> active
>> >
>> > Shouldn't we add description of "total_diryt/writeback/nfs_unstable" too ?
>> > Seeing [5/11], it will be showed in memory.stat.
>> 
>> Good catch.  See patch (below).
>> 
>> >> @@ -453,6 +457,62 @@ memory under it will be reclaimed.
>> >>  You can reset failcnt by writing 0 to failcnt file.
>> >>  # echo 0 > .../memory.failcnt
>> >>  
>> >> +5.5 dirty memory
>> >> +
>> >> +Control the maximum amount of dirty pages a cgroup can have at any given 
>> >> time.
>> >> +
>> >> +Limiting dirty memory is like fixing the max amount of dirty (hard to 
>> >> reclaim)
>> >> +page cache used by a cgroup.  So, in case of multiple cgroup writers, 
>> >> they will
>> >> +not be able to consume more than their designated share of dirty pages 
>> >> and will
>> >> +be forced to perform write-out if they cross that limit.
>> >> +
>> >> +The interface is equivalent to the procfs interface: 
>> >> /proc/sys/vm/dirty_*.  It
>> >> +is possible to configure a limit to trigger both a direct writeback or a
>> >> +background writeback performed by per-bdi flusher threads.  The root 
>> >> cgroup
>> >> +memory.dirty_* control files are read-only and match the contents of
>> >> +the /proc/sys/vm/dirty_* files.
>> >> +
>> >> +Per-cgroup dirty limits can be set using the following files in the 
>> >> cgroupfs:
>> >> +
>> >> +- memory.dirty_ratio: the amount of dirty memory (expressed as a 
>> >> percentage of
>> >> +  cgroup memory) at which a process generating dirty pages will itself 
>> >> start
>> >> +  writing out dirty data.
>> >> +
>> >> +- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in 
>> >> bytes)
>> >> +  in the cgroup at which a process generating dirty pages will start 
>> >> itself
>> >> +  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to 
>> >> indicate
>> >> +  that value is kilo, mega or gigabytes.
>> >> +
>> >> +  Note: memory.dirty_limit_in_bytes is the counterpart of 
>> >> memory.dirty_ratio.
>> >> +  Only one of them may be specified at a time.  When one is written it is
>> >> +  immediately taken into account to evaluate the dirty memory limits and 
>> >> the
>> >> +  other appears as 0 when read.
>> >> +
>> >> +- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
>> >> +  (expressed as a percentage of cgroup memory) at which background 
>> >> writeback
>> >> +  kernel threads will start writing out dirty data.
>> >> +
>> >> +- memory.dirty_background_limit_in_bytes: the amount of dirty memory 
>> >> (expressed
>> >> +  in bytes) in the cgroup at which background writeback kernel threads 
>> >> will
>> >> +  start writing out dirty data.  Suffix (k, K, m, M, g, or G) can be 
>> >> used to
>> >> +  indicate that value is kilo, mega or gigabytes.
>> >> +
>> >> +  Note: memory.dirty_background_limit_in_bytes is the counterpart of
>> >> +  memory.dirty_background_ratio.  Only one of them may be specified at a 
>> >> time.
>> >> +  When one is written it is immediately taken into account to evaluate 
>> >> the dirty
>> >> +  memory limits and the other appears as 0 when read.
>> >> +
>> >> +A cgroup may contain more dirty memory than its dirty limit.  This is 
>> >> possible
>> >> +because of the principle that the first cgroup to touch a page is 
>> >> charged for
>> >> +it.  Subsequent page counting events (dirty, writeback, nfs_unstable) 
>> >> are also
>> >> +counted to the originally charged cgroup.
>> >> +
>> >> +Example: If page is allocated by a cgroup A task, then the page is 
>> >> charged to
>> >> +cgroup A.  If the page is later dirtied by a task in cgroup B, then the 
>> >> cgroup A
>> >> +dirty count will be incremented.  If cgroup A is over its dirty limit 
>> >> but cgroup
>> >> +B is not, then dirtying a cgroup A page from a cgroup B task may push 
>> >> cgroup A
>> >> +over its dirty limit without throttling the dirtying cgroup B task.
>> >> +
>> >>  6. Hierarchy support
>> >>  
>> >>  The memory controller supports a deep hierarchy and hierarchical 
>> >> accounting.
>> >> -- 
>> >> 1.7.1
>> >> 
>> > Can you clarify whether we can limit the "total" dirty pages under 
>> > hierarchy
>> > in use_hierarchy==1 case ?
>> > If we can, I think it would be better to note it in this documentation.
>> >
>> >
>> > Thanks,
>> > Daisuke Nishimura.
>> 
>> Here is a second version of this -v3 doc patch:
>> 
>> Author: Greg Thelen <[email protected]>
>> Date:   Sat Apr 10 15:34:28 2010 -0700
>> 
>>     memcg: document cgroup dirty memory interfaces
>>     
>>     Document cgroup dirty memory interfaces and statistics.
>>     
>>     Signed-off-by: Andrea Righi <[email protected]>
>>     Signed-off-by: Greg Thelen <[email protected]>
>> 
>
> nitpicks. and again, why you always drop Acks ?


I dropped acks because the patch changed and I did not want to assume
that it was still acceptable.  Is this incorrect protocol?

>> diff --git a/Documentation/cgroups/memory.txt 
>> b/Documentation/cgroups/memory.txt
>> index 7781857..8bf6d3b 100644
>> --- a/Documentation/cgroups/memory.txt
>> +++ b/Documentation/cgroups/memory.txt
>> @@ -385,6 +385,10 @@ mapped_file     - # of bytes of mapped file (includes 
>> tmpfs/shmem)
>>  pgpgin              - # of pages paged in (equivalent to # of charging 
>> events).
>>  pgpgout             - # of pages paged out (equivalent to # of uncharging 
>> events).
>>  swap                - # of bytes of swap usage
>> +dirty               - # of bytes that are waiting to get written back to 
>> the disk.
>
> extra tab ?

There is no extra tab here.  It's a display artifact.  When the patch is
applied the columns line up.

>> +writeback   - # of bytes that are actively being written back to the disk.
>> +nfs_unstable        - # of bytes sent to the NFS server, but not yet 
>> committed to
>> +            the actual storage.
>>  inactive_anon       - # of bytes of anonymous memory and swap cache memory 
>> on
>>              LRU list.
>>  active_anon - # of bytes of anonymous and swap cache memory on active
>> @@ -406,6 +410,9 @@ total_mapped_file        - sum of all children's "cache"
>>  total_pgpgin                - sum of all children's "pgpgin"
>>  total_pgpgout               - sum of all children's "pgpgout"
>>  total_swap          - sum of all children's "swap"
>> +total_dirty         - sum of all children's "dirty"
>> +total_writeback             - sum of all children's "writeback"
>
> here, too.

There is no extra tab here.  It's a display artifact.  When the patch is
applied the columns line up.

>> +total_nfs_unstable  - sum of all children's "nfs_unstable"
>>  total_inactive_anon - sum of all children's "inactive_anon"
>>  total_active_anon   - sum of all children's "active_anon"
>>  total_inactive_file - sum of all children's "inactive_file"
>> @@ -453,6 +460,71 @@ memory under it will be reclaimed.
>>  You can reset failcnt by writing 0 to failcnt file.
>>  # echo 0 > .../memory.failcnt
>>  
>> +5.5 dirty memory
>> +
>> +Control the maximum amount of dirty pages a cgroup can have at any given 
>> time.
>> +
>> +Limiting dirty memory is like fixing the max amount of dirty (hard to 
>> reclaim)
>> +page cache used by a cgroup.  So, in case of multiple cgroup writers, they 
>> will
>> +not be able to consume more than their designated share of dirty pages and 
>> will
>> +be forced to perform write-out if they cross that limit.
>> +
>> +The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.  
>> It
>> +is possible to configure a limit to trigger both a direct writeback or a
>> +background writeback performed by per-bdi flusher threads.  The root cgroup
>> +memory.dirty_* control files are read-only and match the contents of
>> +the /proc/sys/vm/dirty_* files.
>> +
>> +Per-cgroup dirty limits can be set using the following files in the 
>> cgroupfs:
>> +
>> +- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage 
>> of
>> +  cgroup memory) at which a process generating dirty pages will itself start
>> +  writing out dirty data.
>> +
>> +- memory.dirty_limit_in_bytes: the amount of dirty memory (expressed in 
>> bytes)
>> +  in the cgroup at which a process generating dirty pages will start itself
>> +  writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to 
>> indicate
>> +  that value is kilo, mega or gigabytes.
>> +
>> +  Note: memory.dirty_limit_in_bytes is the counterpart of 
>> memory.dirty_ratio.
>> +  Only one of them may be specified at a time.  When one is written it is
>> +  immediately taken into account to evaluate the dirty memory limits and the
>> +  other appears as 0 when read.
>> +
>> +- memory.dirty_background_ratio: the amount of dirty memory of the cgroup
>> +  (expressed as a percentage of cgroup memory) at which background writeback
>> +  kernel threads will start writing out dirty data.
>> +
>> +- memory.dirty_background_limit_in_bytes: the amount of dirty memory 
>> (expressed
>> +  in bytes) in the cgroup at which background writeback kernel threads will
>> +  start writing out dirty data.  Suffix (k, K, m, M, g, or G) can be used to
>> +  indicate that value is kilo, mega or gigabytes.
>> +
>> +  Note: memory.dirty_background_limit_in_bytes is the counterpart of
>> +  memory.dirty_background_ratio.  Only one of them may be specified at a 
>> time.
>> +  When one is written it is immediately taken into account to evaluate the 
>> dirty
>> +  memory limits and the other appears as 0 when read.
>> +
>> +A cgroup may contain more dirty memory than its dirty limit.  This is 
>> possible
>> +because of the principle that the first cgroup to touch a page is charged 
>> for
>> +it.  Subsequent page counting events (dirty, writeback, nfs_unstable) are 
>> also
>> +counted to the originally charged cgroup.
>> +
>> +Example: If page is allocated by a cgroup A task, then the page is charged 
>> to
>> +cgroup A.  If the page is later dirtied by a task in cgroup B, then the 
>> cgroup A
>> +dirty count will be incremented.  If cgroup A is over its dirty limit but 
>> cgroup
>> +B is not, then dirtying a cgroup A page from a cgroup B task may push 
>> cgroup A
>> +over its dirty limit without throttling the dirtying cgroup B task.
>> +
>> +When use_hierarchy=0, each cgroup has independent dirty memory usage and 
>> limits.
>> +
>> +When use_hierarchy=1, a parent cgroup increasing its dirty memory usage will
>> +compare its total_dirty memory (which includes sum of all child cgroup dirty
>> +memory) to its dirty limits.  This keeps a parent from explicitly exceeding 
>> its
>> +dirty limits.  However, a child cgroup can increase its dirty usage without
>> +considering the parent's dirty limits.  Thus the parent's total_dirty can 
>> exceed
>> +the parent's dirty limits as a child dirties pages.
>
> Hmm. in short, dirty_ratio in use_hierarchy=1 doesn't work as an user
> expects.  Is this a spec. or a current implementation ?

This limitation is due to the current implementation.  I agree that it
is not perfect.  We could extend the page-writeback.c changes, PATCH
11/11 ( http://marc.info/?l=linux-mm&m=128744907030215 ), to also check
the dirty limit of each parent in the memcg hierarchy.  This would walk
up the tree until root or a cgroup with use_hierarchy=0 is found.
Alternatively, we could provide this functionality in a later patch
series.  The changes to page-writeback.c may be significant.

> I think as following.
>  - add a limitation as "At setting chidlren's dirty_ratio, it must be
>    below parent's.  If it exceeds parent's dirty_ratio, EINVAL is
>    returned."
>
> Could you modify setting memory.dirty_ratio code ?

I assume we are only talking about the use_hierarchy=1 case.  What if
the parent ratio is changed?  If we want to ensure that child ratios are
never larger than parent, then the code must check every child cgroup to
ensure that each child ratio is <= the new parent ratio.  Correct?

Even if we manage to prevent all child ratios from exceeding parent
ratios, we still have the problem of the sum of child ratios may exceed
parent.  Example:
         A (10%)
   B (10%)   C (10%)

There would be nothing to prevent A,B,C dirty ratios from all being set
to 10% as shown.  The current implementation would allow for B and C to
reach 10% thereby pushing the A to 20%.  We could require that each
child dirty limit must fit within parent dirty limit.  So (B+C<=A).
This would allow for:

        A (10%)
   B (7%)   C (3%)

If we had this 10/7/3 limiting code, which statically partitions dirty
memory usage, then we would not needed to walk up the memcg tree
checking each parent.  This nice because it allows us to only
complicates the setting of dirty limits, which is not part of the
performance path.  However, being static partitioning has limitations.
If the system has a dirty ratio of 50% and we create 100 cgroups with
equal dirty limits, the dirty limits for each memcg would be 0.5%.

> Then, parent's dirty_ratio will never exceeds its own. (If I
> understand correctly.)
>
> "memory.dirty_limit_in_bytes" will be a bit more complecated, but I
> think you can.
>
>
> Thanks,
> -Kame

KAMEZAWA Hiroyuki <[email protected]> writes:
> I'd like to consider a patch.  Please mention that "use_hierarchy=1
> case depends on implemenation." for now.

I will clarify the current implementation behavior in the documentation.
A later patch series can change the use_hierarchy=1 behavior.


KAMEZAWA Hiroyuki <[email protected]> writes:
> BTW, how about supporing dirty_limit_in_bytes when use_hierarchy=0 or
> leave it as broken when use_hierarchy=1 ?  It seems we can only
> support dirty_ratio when hierarchy is used.

I am not sure what you mean here.  Are you suggesting that we prohibit
usage of dirty limits/ratios when use_hierarchy=1?  This is appealing
because it does not expose the user to unexpected behavior.  Only the
well supported case would be configurable.
_______________________________________________
Containers mailing list
[email protected]
https://lists.linux-foundation.org/mailman/listinfo/containers

_______________________________________________
Devel mailing list
[email protected]
https://openvz.org/mailman/listinfo/devel

[Devel] Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces

Reply via email to