> On Oct. 31, 2013, 10:18 p.m., Ian Downes wrote:
> > src/slave/cgroups_isolator.cpp, line 782
> > <https://reviews.apache.org/r/15015/diff/1/?file=372971#file372971line782>
> >
> >     Why don't we correct the naming while we're at it:
> >     
> >     total_memory      <- memory.usage_in_bytes, this is the primary memory 
> > statistic
> >     
> >     # and then the breakdown: 
> >     total_rss         <- total_rss from memory.stat
> >     total_cached      <- total_cache from memory.stat
> >     total_mapped_file <- total_mapped_file from memory.stat
> >     
> >     We probably don't need to include the total_ prefix?
> >     
> >     An executor's memory limit then matches with total_memory and, only if 
> > necessary, one can look at the breakdown.
> 
> Eric Biederman wrote:
>     I don't follow.
>     
>     We absolutely need the total_ prefix when reading the fields from the 
> kernel.  And it is the kernel's memory.stat file where the naming is 
> deceptive I can't fix that with a mesos patch.
>     
>     I did do what I can do with a mesos patch which is name the fields in 
> ResourceStatics and in the http interface in a non-deceptive way.
>     
>     We absolutely do not even want to consider passing the kernel names 
> straight through as those names are wrong.  Especially the borked total_rss 
> field name as that is totally wrong.
>     
>     The current mapping implemented by this patch is:
>     
>     mem_rss_bytes  <- memory.usage_in_bytes
>     mem_anon_bytes <- memory.stat:total_rss
>     mem_file_bytes <- memory.stat:total_cache
>     mem_mapped_file_bytes <- memory.stat:total_mapped_file
>     
>     I would include mem_mapped_anon_bytes but by definition that is exactly 
> the same as mem_anon_bytes.
>     
>     I can see renaming mem_rss_bytes something like mem_usage_bytes, however 
> that is going to break existing users of mesos, and breaking existing users 
> is a bad idea.  And resident set size and usage synonyms so I don't see any 
> pressing need for a rename there.
>     
>     
>
> 
> Ian Downes wrote:
>     Sorry, I was meaning we shouldn't expose them as total_* but you're not 
> doing that so please disregard.
>     
>     With the patch's renaming we'd have
>     cgroups: mem_rss_bytes is not rss but the total
>     process: mem_rss_bytes is rss and not the total
>     
>     So running the same executor under different isolators would report 
> (potentially) very different numbers?
> 
> Eric Biederman wrote:
>     The defining question is: How much memory must be available for a job to 
> run successfully and at full performance?
>     
>     When this question is considered in theory the question is stated as what 
> is the minimum resident set size for a job to run successfully at full 
> performance?
>     
>     The resident set size being how much of the job is in memory at one time. 
>  With the rest of the job being on disk.
>     
>     
>     With my three patches applied the cgroups mem_rss_bytes is exactly the 
> jobs current rss.  The other numbers can help guess at the minimum rss, but 
> what we measure is in fact how much of the job is in memory at the current 
> instant.  That is the rss.
>     
>     
>     The rss of a process and the rss of a job are subtlety different.  When 
> measuring a single process the rss is just a count of the number of resident 
> pages in the address space of the process.  For a job the rss, pages that are 
> mapped into two different processes are counted only once, and pages that are 
> used in the for other purposes like I/O are also counted.
>     
>     
>     The process isolator does not accurately compute the rss of a job.   The 
> process isolator reports as mem_rss_bytes the simple sum of the rss of every 
> process.  Which means the process isolator over counts pages that are mapped 
> into multiple processes and does not count cached files.  Depending on how 
> different jobs use memory the process isolator can be widely inaccurate.
>     
>     
>     For a job consisting of a single process that does minimal file I/O the 
> process isolator will report a number that is close to how much memory is 
> being used to run the process.
>     
>     If you are comparing a job of one process that does minimal file I/O the 
> two numbers reported by the cgroup isolator and the process isolator should 
> be comparable.
>     
>     Approximations to the minimal rss computed using the cgroup isolator, 
> should work for sizing memory anywhere.
>     
>     Approximations to the minimal rss computed from the process isolator, 
> will only be approach being accurate if the jobs do minimal file I/O and have 
> very file pages (think text) shared between processes.
>     
>     
>     For sizing the practical challenge with the cgroup isolator is that if a 
> job is over provisioned it may still use all of the memory up to it's memory 
> limit to cache files, because caching old file data is typically better than 
> leaving the memory sitting empty.
>     
>     The process isolator can be so wildly in accurate in the worst case that 
> it is hard to think about.  In particular the worst case for process isolator 
> looks something like:
>     
>     // Allocate and touch 1 GB of memory so the pages are mapped into our 
> address space.
>     ptr = calloc(1, 1024*1024*1024);
>     for (;;) {
>        fork();
>     }
>     
>     In which case there will be a be a large number of processes all with a 
> 1GB rss, but only using a few additional kilobyte per process.
>     
>     Does that help put these numbers into perspective?
>

Hey Eric, do you have advice on how the process isolator could be fixed to 
accurately report memory?


- Ben


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/15015/#review27967
-----------------------------------------------------------


On Oct. 29, 2013, 6:53 a.m., Eric Biederman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/15015/
> -----------------------------------------------------------
> 
> (Updated Oct. 29, 2013, 6:53 a.m.)
> 
> 
> Review request for mesos, Ben Mahler, Ian Downes, and Vinod Kone.
> 
> 
> Bugs: Mesos-758
>     https://issues.apache.org/jira/browse/Mesos-758
> 
> 
> Repository: mesos-git
> 
> 
> Description
> -------
> 
> This reports a basic break down of memory usage so jobs whose page
> cache expands to their configured hard limit will have information
> that allows them to size their jobs.
> 
> This does not include all fields from memory.stat and not even all
> total_ fields from memory.stat but rather the fields I have reviewed
> and seen a need for, and have a reasonable expectation that the fields
> will report useful information and work as designed.
> 
> Quite a few of the fields appear only useful for kernel level debugging.
> 
> In particular the fields:
> memory.stat:total_cache         which reports the amount of file backed 
> memory in bytes
> memory.stat:total_rss           which reports the amount of anonymous memory 
> in bytes
> memory.stat:total_mapped_file   which reports the amount of mapped file 
> memory in bytes
> 
> Reading memory.stat is not quite atomic and neither is computing the
> values.  As even in the simplest cases the values are sums of per cpu
> counters of the various statistics.  However for the most part the numbers 
> are a reasonable
> snapshot of the state of the cgroup at a particular instant.
> 
> 
> Diffs
> -----
> 
>   include/mesos/mesos.proto fe1d82b 
>   src/slave/cgroups_isolator.cpp fc36f1f 
>   src/slave/monitor.cpp 9cb6256 
>   src/tests/monitor_tests.cpp 3d3f8af 
> 
> Diff: https://reviews.apache.org/r/15015/diff/
> 
> 
> Testing
> -------
> 
> make  and make check pass
> The code is a trivial extension of what already exists.
> 
> 
> Thanks,
> 
> Eric Biederman
> 
>

Reply via email to