> On Oct. 31, 2013, 10:18 p.m., Ian Downes wrote: > > src/slave/cgroups_isolator.cpp, line 782 > > <https://reviews.apache.org/r/15015/diff/1/?file=372971#file372971line782> > > > > Why don't we correct the naming while we're at it: > > > > total_memory <- memory.usage_in_bytes, this is the primary memory > > statistic > > > > # and then the breakdown: > > total_rss <- total_rss from memory.stat > > total_cached <- total_cache from memory.stat > > total_mapped_file <- total_mapped_file from memory.stat > > > > We probably don't need to include the total_ prefix? > > > > An executor's memory limit then matches with total_memory and, only if > > necessary, one can look at the breakdown. > > Eric Biederman wrote: > I don't follow. > > We absolutely need the total_ prefix when reading the fields from the > kernel. And it is the kernel's memory.stat file where the naming is > deceptive I can't fix that with a mesos patch. > > I did do what I can do with a mesos patch which is name the fields in > ResourceStatics and in the http interface in a non-deceptive way. > > We absolutely do not even want to consider passing the kernel names > straight through as those names are wrong. Especially the borked total_rss > field name as that is totally wrong. > > The current mapping implemented by this patch is: > > mem_rss_bytes <- memory.usage_in_bytes > mem_anon_bytes <- memory.stat:total_rss > mem_file_bytes <- memory.stat:total_cache > mem_mapped_file_bytes <- memory.stat:total_mapped_file > > I would include mem_mapped_anon_bytes but by definition that is exactly > the same as mem_anon_bytes. > > I can see renaming mem_rss_bytes something like mem_usage_bytes, however > that is going to break existing users of mesos, and breaking existing users > is a bad idea. And resident set size and usage synonyms so I don't see any > pressing need for a rename there. > > > > > Ian Downes wrote: > Sorry, I was meaning we shouldn't expose them as total_* but you're not > doing that so please disregard. > > With the patch's renaming we'd have > cgroups: mem_rss_bytes is not rss but the total > process: mem_rss_bytes is rss and not the total > > So running the same executor under different isolators would report > (potentially) very different numbers? > > Eric Biederman wrote: > The defining question is: How much memory must be available for a job to > run successfully and at full performance? > > When this question is considered in theory the question is stated as what > is the minimum resident set size for a job to run successfully at full > performance? > > The resident set size being how much of the job is in memory at one time. > With the rest of the job being on disk. > > > With my three patches applied the cgroups mem_rss_bytes is exactly the > jobs current rss. The other numbers can help guess at the minimum rss, but > what we measure is in fact how much of the job is in memory at the current > instant. That is the rss. > > > The rss of a process and the rss of a job are subtlety different. When > measuring a single process the rss is just a count of the number of resident > pages in the address space of the process. For a job the rss, pages that are > mapped into two different processes are counted only once, and pages that are > used in the for other purposes like I/O are also counted. > > > The process isolator does not accurately compute the rss of a job. The > process isolator reports as mem_rss_bytes the simple sum of the rss of every > process. Which means the process isolator over counts pages that are mapped > into multiple processes and does not count cached files. Depending on how > different jobs use memory the process isolator can be widely inaccurate. > > > For a job consisting of a single process that does minimal file I/O the > process isolator will report a number that is close to how much memory is > being used to run the process. > > If you are comparing a job of one process that does minimal file I/O the > two numbers reported by the cgroup isolator and the process isolator should > be comparable. > > Approximations to the minimal rss computed using the cgroup isolator, > should work for sizing memory anywhere. > > Approximations to the minimal rss computed from the process isolator, > will only be approach being accurate if the jobs do minimal file I/O and have > very file pages (think text) shared between processes. > > > For sizing the practical challenge with the cgroup isolator is that if a > job is over provisioned it may still use all of the memory up to it's memory > limit to cache files, because caching old file data is typically better than > leaving the memory sitting empty. > > The process isolator can be so wildly in accurate in the worst case that > it is hard to think about. In particular the worst case for process isolator > looks something like: > > // Allocate and touch 1 GB of memory so the pages are mapped into our > address space. > ptr = calloc(1, 1024*1024*1024); > for (;;) { > fork(); > } > > In which case there will be a be a large number of processes all with a > 1GB rss, but only using a few additional kilobyte per process. > > Does that help put these numbers into perspective? >
Hey Eric, do you have advice on how the process isolator could be fixed to accurately report memory? - Ben ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/15015/#review27967 ----------------------------------------------------------- On Oct. 29, 2013, 6:53 a.m., Eric Biederman wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/15015/ > ----------------------------------------------------------- > > (Updated Oct. 29, 2013, 6:53 a.m.) > > > Review request for mesos, Ben Mahler, Ian Downes, and Vinod Kone. > > > Bugs: Mesos-758 > https://issues.apache.org/jira/browse/Mesos-758 > > > Repository: mesos-git > > > Description > ------- > > This reports a basic break down of memory usage so jobs whose page > cache expands to their configured hard limit will have information > that allows them to size their jobs. > > This does not include all fields from memory.stat and not even all > total_ fields from memory.stat but rather the fields I have reviewed > and seen a need for, and have a reasonable expectation that the fields > will report useful information and work as designed. > > Quite a few of the fields appear only useful for kernel level debugging. > > In particular the fields: > memory.stat:total_cache which reports the amount of file backed > memory in bytes > memory.stat:total_rss which reports the amount of anonymous memory > in bytes > memory.stat:total_mapped_file which reports the amount of mapped file > memory in bytes > > Reading memory.stat is not quite atomic and neither is computing the > values. As even in the simplest cases the values are sums of per cpu > counters of the various statistics. However for the most part the numbers > are a reasonable > snapshot of the state of the cgroup at a particular instant. > > > Diffs > ----- > > include/mesos/mesos.proto fe1d82b > src/slave/cgroups_isolator.cpp fc36f1f > src/slave/monitor.cpp 9cb6256 > src/tests/monitor_tests.cpp 3d3f8af > > Diff: https://reviews.apache.org/r/15015/diff/ > > > Testing > ------- > > make and make check pass > The code is a trivial extension of what already exists. > > > Thanks, > > Eric Biederman > >
