Greetings, erahm recently wrote a nice blog post with measurements showing the overhead of enabling multiple content processes:
http://www.erahm.org/2016/02/11/memory-usage-of-firefox-with-e10s-enabled/ The overhead is high -- 8 content processes *doubles* our physical memory usage -- which limits the possibility of increasing the number of content processes beyond a small number. Now I've done some follow-up measurements to find out what is causing the per-content-process overhead. I did this by measuring memory usage with four trivial web pages open, first with a single content process, then with four content processes, and then getting the diff between content processes of the two. (about:memory's diff algorithm normalizes PIDs in memory reports as "NNN" so multiple content processes naturally get collapsed together, which in this case is exactly what we want.) I call this the "small processes" measurement. If we divide the memory usage increase by 3 (the increase in the number of content processes) we get a rough measure of the minimum per-content process overhead. I then did a similar thing but with four more complex web pages (gmail, Google Docs, TreeHerder, Bugzilla). I call this the "large processes" measurement. ----------------------------------------------------------------------------- LINUX (64-bit), small processes ----------------------------------------------------------------------------- Some top-level numbers from the "small processes" diff are as follows. > 68.54 MB (100.0%) -- explicit > ├──33.54 MB (48.94%) ++ js-non-window > │ ├──22.97 MB (33.52%) -- zones/zone(0xNNN) > │ │ ├──18.54 MB (27.05%) ++ (92 tiny) > │ │ ├───1.94 MB (02.84%) ── unused-gc-things [12] > │ │ ├───1.71 MB (02.49%) ++ strings/string(<non-notable strings>) > │ │ └───0.78 MB (01.14%) ++ object-groups > │ ├───6.97 MB (10.17%) -- runtime > │ │ ├──3.72 MB (05.42%) ── script-data [4] > │ │ ├──1.34 MB (01.95%) -- gc > │ │ │ ├──1.00 MB (01.46%) ── nursery-committed [4] > │ │ │ └──0.34 MB (00.49%) ++ (3 tiny) > │ │ ├──1.05 MB (01.54%) ── atoms-table [4] > │ │ └──0.86 MB (01.26%) ++ (7 tiny) > │ └───3.60 MB (05.25%) -- gc-heap > │ ├──3.00 MB (04.38%) ── unused-chunks [4] > │ └──0.60 MB (00.87%) ++ (2 tiny) > ├──13.58 MB (19.82%) ── heap-unclassified > ├──11.51 MB (16.79%) ++ heap-overhead > │ ├───7.64 MB (11.15%) ── page-cache [4] > │ ├───3.03 MB (04.42%) ── bin-unused [4] > │ └───0.84 MB (01.22%) ── bookkeeping [4] > ├───2.84 MB (04.14%) ── xpti-working-set [4] > ├───2.05 MB (03.00%) ++ layout > ├───1.33 MB (01.95%) ++ (10 tiny) > ├───1.09 MB (01.58%) ── preferences [4] > ├───1.02 MB (01.49%) ++ xpconnect > ├───0.80 MB (01.17%) ++ atom-tables > └───0.77 MB (01.13%) ++ xpcom > > 48.36 MB (100.0%) -- heap-committed > ├──36.86 MB (76.21%) ── allocated [4] > └──11.51 MB (23.79%) ── overhead [4] > > 33.54 MB (100.0%) -- js-main-runtime > ├──17.76 MB (52.94%) ++ compartments > ├───6.97 MB (20.78%) ── runtime [4] > ├───5.22 MB (15.55%) ++ zones > └───3.60 MB (10.73%) ++ gc-heap > > 261 (100.0%) -- js-main-runtime-compartments > ├──255 (97.70%) ++ system > └────6 (02.30%) ++ user > > 310.06 MB ── resident [4] > 114.39 MB ── resident-unique [4] The "[4]" annotations just indicate that these measurements are all repeated four times in the second case, due to the four content processes. Among the internal measurements, "explicit" increases by 69 MiB, which indicates a 23 MiB overhead per content process. As for the OS measurements, "resident" is not a good metric here because it will quadruple-count any memory shared between processes. "resident-unique" shouldn't suffer from that problem, and it suggests a 38 MiB overhead. The 15 MiB gap between these two surprised me. The only thing that would account for that difference is unshared (non-read-only) static data, including vtables, lookup tables that contain pointers, etc. I've started digging and it actually seems plausible. Some of this data will be in our own code, and some is in external libraries that we rely on. Some small improvements are possible but there's an incredibly long tail and so it's unlikely to improve a lot. See bug 1254777 for more details. Digging into the "explicit" numbers some more: - The "js-non-window" memory (11 MiB per process) is all system JS code and data, mostly modules in resource://gre/modules/. We create about 85 JS system compartments for these. A fraction of this is per-compartment overhead, which might be avoidable by merging them into a single compartment (see bug 1186409). (B2G did something similar a long time ago and saw big improvements.) Even if we can fix that, it's just a lot of JS code. We can lazily import JSMs; I wonder if we are failing to do that as much as we could, i.e. are all these modules really needed at start-up? It would be great if we could instrument module-loading code in some way that answers this question. - "heap-unclassified" memory is 4.5 MiB per process. I've analyzed this with DMD and this is mostly GTK and glib memory that we can't measure in our memory reporters. I haven't investigated closely to see if any of this could be avoided. - "heap-overhead" is 4 MiB per process. I've looked at this closely. The numbers tend to be noisy. - "page-cache" is pages that jemalloc holds onto for fast recycling. It is capped at 4 MiB per process and we can reduce that with a jemalloc configuration, though this may make allocation slightly slower. - "bin-unused" is fragmentation in smaller allocations and very hard to reduce. - "bookkeeping" is jemalloc's internal data structures and very hard to reduce. - Then there's the not-so-long tail of things less than 1 MiB per process. Some of these may be shrinkable with effort, or made shareable between processes with effort. (E.g. I reduced xpti-working-set by 216 KiB per process in bug 1249174, and I've heard that making it shared was considered for B2G but never implemented.) It's getting into diminishing returns, though. ----------------------------------------------------------------------------- LINUX (64-bit), large processes ----------------------------------------------------------------------------- > 115.98 MB (100.0%) -- explicit > ├───66.80 MB (57.60%) -- js-non-window > │ ├──39.31 MB (33.90%) -- runtime > │ │ ├──32.69 MB (28.19%) -- gc > │ │ │ ├──32.00 MB (27.59%) ── nursery-committed [4] > │ │ │ └───0.69 MB (00.59%) ++ (3 tiny) > │ │ ├───4.01 MB (03.46%) ── script-data [4] > │ │ ├───1.80 MB (01.56%) ── atoms-table [4] > │ │ └───0.80 MB (00.69%) ++ (9 tiny) > │ ├──24.04 MB (20.73%) -- zones/zone(0xNNN) > │ │ ├──19.59 MB (16.90%) ++ (98 tiny) > │ │ ├───2.35 MB (02.03%) ++ strings > │ │ └───2.10 MB (01.81%) ── unused-gc-things [12] > │ └───3.45 MB (02.97%) -- gc-heap > │ ├──3.00 MB (02.59%) ── unused-chunks [4] > │ └──0.45 MB (00.38%) ++ (2 tiny) > ├───19.93 MB (17.19%) -- heap-overhead > │ ├──11.53 MB (09.94%) ── bin-unused [4] > │ ├───6.96 MB (06.00%) ── page-cache [4] > │ └───1.44 MB (01.24%) ── bookkeeping [4] > ├───15.44 MB (13.31%) ── heap-unclassified > ├────3.16 MB (02.73%) ++ window-objects > ├────4.40 MB (03.80%) ++ (12 tiny) > ├────2.84 MB (02.45%) ── xpti-working-set [4] > ├────2.24 MB (01.93%) ++ layout > └────1.17 MB (01.01%) ++ xpconnect > > 362.36 MB ── resident [4] > 157.92 MB ── resident-unique [4] The "explicit" overhead is now 39 MiB per process, and for "resident-unique" it's 53 MiB per process. The gap between the two is 14 MiB, similar to before, so that's additional evidence that static data accounts for the gap. Both of those overheads are about 16 MiB higher than in the "small processes" case. It's mostly JS, esp. "nursery-committed" -- it looks like all four content processes have 8 MiB nurseries. I know for B2G we allow much smaller nurseries (256 KiB?) so maybe shrinking it down as we increase content processes would also be wise. Other than JS, "heap-overhead" is a bit higher, and most of the other buckets are relatively stable. ----------------------------------------------------------------------------- WINDOWS (32-bit), small processes ----------------------------------------------------------------------------- > 47.79 MB (100.0%) -- explicit > ├──25.14 MB (52.60%) -- js-non-window > │ ├──15.30 MB (32.02%) -- zones/zone(0xNNN) > │ │ ├──12.36 MB (25.85%) ++ (94 tiny) > │ │ ├───1.61 MB (03.37%) -- strings/string(<non-notable strings>) > │ │ │ ├──1.22 MB (02.55%) -- gc-heap > │ │ │ │ ├──1.22 MB (02.55%) ── latin1 [8] > │ │ │ │ └──0.00 MB (00.00%) ── two-byte [4] > │ │ │ └──0.39 MB (00.82%) ── malloc-heap/latin1 [8] > │ │ └───1.34 MB (02.80%) ── unused-gc-things [12] > │ ├──10.49 MB (21.96%) -- runtime > │ │ ├───5.32 MB (11.13%) -- gc > │ │ │ ├──5.00 MB (10.46%) ── nursery-committed [4] > │ │ │ └──0.32 MB (00.67%) ++ (3 tiny) > │ │ ├───3.36 MB (07.02%) ── script-data [4] > │ │ ├───1.00 MB (02.10%) ── atoms-table [4] > │ │ ├───0.52 MB (01.09%) ++ script-sources > │ │ └───0.30 MB (00.62%) ++ (6 tiny) > │ └──-0.66 MB (-1.37%) ++ gc-heap > ├──11.47 MB (23.99%) -- heap-overhead > │ ├───8.51 MB (17.80%) ── page-cache [4] > │ ├───2.59 MB (05.42%) ── bin-unused [4] > │ └───0.37 MB (00.77%) ── bookkeeping [4] > ├───4.43 MB (09.26%) ── heap-unclassified > ├───2.27 MB (04.76%) ── xpti-working-set [4] > ├───1.29 MB (02.70%) ++ layout > ├───0.81 MB (01.69%) ++ (10 tiny) > ├───0.81 MB (01.69%) ── preferences [4] > ├───0.56 MB (01.18%) ++ xpcom > ├───0.53 MB (01.11%) ++ xpconnect > └───0.49 MB (01.02%) ++ atom-tables > > 33.35 MB (100.0%) -- heap-committed > ├──21.89 MB (65.62%) ── allocated [4] > └──11.47 MB (34.38%) ── overhead [4] > > 25.14 MB (100.0%) -- js-main-runtime > ├──11.57 MB (46.05%) ++ compartments > ├──10.49 MB (41.74%) ── runtime [4] > ├───3.73 MB (14.82%) ++ zones > └──-0.66 MB (-2.61%) ++ gc-heap > > 264 (100.0%) -- js-main-runtime-compartments > ├──258 (97.73%) ++ system > └────6 (02.27%) ++ user > > 21.89 MB ── heap-allocated [4] > 151.89 MB ── private [4] > 222.57 MB ── resident [4] > 119.76 MB ── resident-unique [4] The numbers are lower here than for Linux because it's 32-bit and so pointers are smaller, but the same basic patterns apply. Differences of note: - The difference between "explicit" and "resident-unique" per process is 25 MiB, as opposed to the 15 MiB we saw on Linux. I don't know why. Does Windows have some inherent per-process memory cost higher than Linux? - "private" and "resident-unique' are significantly different. Not sure what to make of that. - "heap-unclassified" is a lot lower. The "large processes" numbers for Windows don't show much interesting beyond what we've already seen. ----------------------------------------------------------------------------- MAC (64-bit), small processes ----------------------------------------------------------------------------- > 64.31 MB (100.0%) -- explicit > ├──33.40 MB (51.93%) -- js-non-window > │ ├──23.00 MB (35.76%) -- zones/zone(0xNNN) > │ │ ├──16.53 MB (25.70%) ++ (90 tiny) > │ │ ├───1.97 MB (03.07%) ── unused-gc-things [12] > │ │ ├───1.71 MB (02.66%) ++ strings/string(<non-notable strings>) > │ │ ├───0.78 MB (01.22%) ++ object-groups > │ │ ├───0.67 MB (01.05%) ++ compartment([System Principal], Addon-SDK (from: resource://gre/modules/commonjs/toolkit/loader.js:249)) > │ │ ├───0.67 MB (01.04%) ++ compartment(moz-nullprincipal:{NNNNNNNN-NNNN-NNNN-NNNN-NNNNNNNNNNNN}, XPConnect Compilation Compartment) > │ │ └───0.66 MB (01.03%) ++ compartment([System Principal], resource://gre/modules/commonjs/toolkit/loader.js) > │ ├───6.97 MB (10.84%) -- runtime > │ │ ├──3.72 MB (05.78%) ── script-data [4] > │ │ ├──1.34 MB (02.08%) ++ gc > │ │ ├──1.05 MB (01.64%) ── atoms-table [4] > │ │ └──0.86 MB (01.34%) ++ (7 tiny) > │ └───3.43 MB (05.34%) ++ gc-heap > ├──10.92 MB (16.98%) ── heap-unclassified > ├──10.06 MB (15.65%) -- heap-overhead > │ ├───6.38 MB (09.92%) ── page-cache [4] > │ ├───2.89 MB (04.50%) ── bin-unused [4] > │ └───0.79 MB (01.22%) ── bookkeeping [4] > ├───2.84 MB (04.41%) ── xpti-working-set [4] > ├───2.01 MB (03.12%) ++ layout > ├───1.37 MB (02.14%) ++ (9 tiny) > ├───1.12 MB (01.74%) ── preferences [4] > ├───1.02 MB (01.59%) ++ xpconnect > ├───0.81 MB (01.25%) ++ atom-tables > └───0.77 MB (01.20%) ++ xpcom > > 44.27 MB (100.0%) -- heap-committed > ├──34.21 MB (77.27%) ── allocated [4] > └──10.06 MB (22.73%) ── overhead [4] > > 33.40 MB (100.0%) -- js-main-runtime > ├──17.76 MB (53.16%) ++ compartments > ├───6.97 MB (20.87%) ── runtime [4] > ├───5.24 MB (15.69%) ++ zones > └───3.43 MB (10.28%) ++ gc-heap > > 261 (100.0%) -- js-main-runtime-compartments > ├──255 (97.70%) ++ system > └────6 (02.30%) ++ user > > 282.06 MB ── resident [4] > 147.94 MB ── resident-unique [4] The difference between "explicit" and "resident-unique" per process is 27 MiB, as opposed to the 15 MiB we saw on Linux and 25 MiB we saw on Windows. Again, I don't know why. Other than that, the numbers are quite similar to the Linux numbers. ----------------------------------------------------------------------------- Conclusion ----------------------------------------------------------------------------- The overhead per content process is significant. I can see scope for moderate improvements, but I'm having trouble seeing how big improvements can be made. Without big improvements, scaling the number of content processes beyond 4 (*maybe* 8) won't be possible. - JS overhead is the biggest factor. We execute a lot of JS code just starting up for each content process -- can that be reduced? We should also consider a smaller nursery size limit for content processes. - Heap overhead is significant. Reducing the page-cache size could save a couple of MiBs. Improvements beyond that are hard. Turning on jemalloc4 *might* help a bit, but I wouldn't bank on it, and there are other complications with that. - Static data is a big chunk. It's hard to make much of a dent there because it has a *very* long tail. - The remaining buckets are a lot smaller. I'm happy to gives copies of the raw data files to anyone who wants to look at them in more detail. Nick _______________________________________________ dev-platform mailing list dev-platform@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-platform