Re: Review Request 15015: cgroup_isolator: Report additional memory statistics

Eric Biederman Thu, 31 Oct 2013 21:19:47 -0700


> On Oct. 31, 2013, 10:18 p.m., Ian Downes wrote:
> > src/slave/cgroups_isolator.cpp, line 782
> > <https://reviews.apache.org/r/15015/diff/1/?file=372971#file372971line782>
> >
> >     Why don't we correct the naming while we're at it:
> >     
> >     total_memory      <- memory.usage_in_bytes, this is the primary memory 
> > statistic
> >     
> >     # and then the breakdown: 
> >     total_rss         <- total_rss from memory.stat
> >     total_cached      <- total_cache from memory.stat
> >     total_mapped_file <- total_mapped_file from memory.stat
> >     
> >     We probably don't need to include the total_ prefix?
> >     
> >     An executor's memory limit then matches with total_memory and, only if 
> > necessary, one can look at the breakdown.
> 
> Eric Biederman wrote:
>     I don't follow.
>     
>     We absolutely need the total_ prefix when reading the fields from the 
> kernel.  And it is the kernel's memory.stat file where the naming is 
> deceptive I can't fix that with a mesos patch.
>     
>     I did do what I can do with a mesos patch which is name the fields in 
> ResourceStatics and in the http interface in a non-deceptive way.
>     
>     We absolutely do not even want to consider passing the kernel names 
> straight through as those names are wrong.  Especially the borked total_rss 
> field name as that is totally wrong.
>     
>     The current mapping implemented by this patch is:
>     
>     mem_rss_bytes  <- memory.usage_in_bytes
>     mem_anon_bytes <- memory.stat:total_rss
>     mem_file_bytes <- memory.stat:total_cache
>     mem_mapped_file_bytes <- memory.stat:total_mapped_file
>     
>     I would include mem_mapped_anon_bytes but by definition that is exactly 
> the same as mem_anon_bytes.
>     
>     I can see renaming mem_rss_bytes something like mem_usage_bytes, however 
> that is going to break existing users of mesos, and breaking existing users 
> is a bad idea.  And resident set size and usage synonyms so I don't see any 
> pressing need for a rename there.
>     
>     
>
> 
> Ian Downes wrote:
>     Sorry, I was meaning we shouldn't expose them as total_* but you're not 
> doing that so please disregard.
>     
>     With the patch's renaming we'd have
>     cgroups: mem_rss_bytes is not rss but the total
>     process: mem_rss_bytes is rss and not the total
>     
>     So running the same executor under different isolators would report 
> (potentially) very different numbers?

The defining question is: How much memory must be available for a job to run 
successfully and at full performance?

When this question is considered in theory the question is stated as what is 
the minimum resident set size for a job to run successfully at full performance?

The resident set size being how much of the job is in memory at one time.  With 
the rest of the job being on disk.

With my three patches applied the cgroups mem_rss_bytes is exactly the jobs 
current rss.  The other numbers can help guess at the minimum rss, but what we 
measure is in fact how much of the job is in memory at the current instant.  
That is the rss.

The rss of a process and the rss of a job are subtlety different.  When 
measuring a single process the rss is just a count of the number of resident 
pages in the address space of the process.  For a job the rss, pages that are 
mapped into two different processes are counted only once, and pages that are 
used in the for other purposes like I/O are also counted.

The process isolator does not accurately compute the rss of a job.   The 
process isolator reports as mem_rss_bytes the simple sum of the rss of every 
process.  Which means the process isolator over counts pages that are mapped 
into multiple processes and does not count cached files.  Depending on how 
different jobs use memory the process isolator can be widely inaccurate.

For a job consisting of a single process that does minimal file I/O the process 
isolator will report a number that is close to how much memory is being used to 
run the process.

If you are comparing a job of one process that does minimal file I/O the two 
numbers reported by the cgroup isolator and the process isolator should be 
comparable.

Approximations to the minimal rss computed using the cgroup isolator, should 
work for sizing memory anywhere.

Approximations to the minimal rss computed from the process isolator, will only 
be approach being accurate if the jobs do minimal file I/O and have very file 
pages (think text) shared between processes.

For sizing the practical challenge with the cgroup isolator is that if a job is 
over provisioned it may still use all of the memory up to it's memory limit to 
cache files, because caching old file data is typically better than leaving the 
memory sitting empty.

The process isolator can be so wildly in accurate in the worst case that it is 
hard to think about.  In particular the worst case for process isolator looks 
something like:

// Allocate and touch 1 GB of memory so the pages are mapped into our address 
space.
ptr = calloc(1, 1024*1024*1024);
for (;;) {
   fork();
}

In which case there will be a be a large number of processes all with a 1GB 
rss, but only using a few additional kilobyte per process.

Does that help put these numbers into perspective?

- Eric

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15015/#review27967
-----------------------------------------------------------

On Oct. 29, 2013, 6:53 a.m., Eric Biederman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15015/
> -----------------------------------------------------------
> 
> (Updated Oct. 29, 2013, 6:53 a.m.)
> 
> 
> Review request for mesos, Ben Mahler, Ian Downes, and Vinod Kone.
> 
> 
> Bugs: Mesos-758
>     https://issues.apache.org/jira/browse/Mesos-758
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This reports a basic break down of memory usage so jobs whose page
> cache expands to their configured hard limit will have information
> that allows them to size their jobs.
> 
> This does not include all fields from memory.stat and not even all
> total_ fields from memory.stat but rather the fields I have reviewed
> and seen a need for, and have a reasonable expectation that the fields
> will report useful information and work as designed.
> 
> Quite a few of the fields appear only useful for kernel level debugging.
> 
> In particular the fields:
> memory.stat:total_cache         which reports the amount of file backed 
> memory in bytes
> memory.stat:total_rss           which reports the amount of anonymous memory 
> in bytes
> memory.stat:total_mapped_file   which reports the amount of mapped file 
> memory in bytes
> 
> Reading memory.stat is not quite atomic and neither is computing the
> values.  As even in the simplest cases the values are sums of per cpu
> counters of the various statistics.  However for the most part the numbers 
> are a reasonable
> snapshot of the state of the cgroup at a particular instant.
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto fe1d82b 
>   src/slave/cgroups_isolator.cpp fc36f1f 
>   src/slave/monitor.cpp 9cb6256 
>   src/tests/monitor_tests.cpp 3d3f8af 
> 
> Diff: https://reviews.apache.org/r/15015/diff/
> 
> 
> Testing
> -------
> 
> make  and make check pass
> The code is a trivial extension of what already exists.
> 
> 
> Thanks,
> 
> Eric Biederman
> 
>

Re: Review Request 15015: cgroup_isolator: Report additional memory statistics

Reply via email to