Greetings,

erahm recently wrote a nice blog post with measurements showing the
overhead of
enabling multiple content processes:

http://www.erahm.org/2016/02/11/memory-usage-of-firefox-with-e10s-enabled/

The overhead is high -- 8 content processes *doubles* our physical memory
usage -- which limits the possibility of increasing the number of content
processes beyond a small number. Now I've done some follow-up
measurements to find out what is causing the per-content-process overhead.

I did this by measuring memory usage with four trivial web pages open, first
with a single content process, then with four content processes, and then
getting the diff between content processes of the two. (about:memory's diff
algorithm normalizes PIDs in memory reports as "NNN" so multiple content
processes naturally get collapsed together, which in this case is exactly
what
we want.) I call this the "small processes" measurement.

If we divide the memory usage increase by 3 (the increase in the number of
content processes) we get a rough measure of the minimum per-content process
overhead.

I then did a similar thing but with four more complex web pages (gmail,
Google
Docs, TreeHerder, Bugzilla). I call this the "large processes" measurement.

-----------------------------------------------------------------------------
LINUX (64-bit), small processes
-----------------------------------------------------------------------------

Some top-level numbers from the "small processes" diff are as follows.

> 68.54 MB (100.0%) -- explicit
> ├──33.54 MB (48.94%) ++ js-non-window
> │  ├──22.97 MB (33.52%) -- zones/zone(0xNNN)
> │  │  ├──18.54 MB (27.05%) ++ (92 tiny)
> │  │  ├───1.94 MB (02.84%) ── unused-gc-things [12]
> │  │  ├───1.71 MB (02.49%) ++ strings/string(<non-notable strings>)
> │  │  └───0.78 MB (01.14%) ++ object-groups
> │  ├───6.97 MB (10.17%) -- runtime
> │  │   ├──3.72 MB (05.42%) ── script-data [4]
> │  │   ├──1.34 MB (01.95%) -- gc
> │  │   │  ├──1.00 MB (01.46%) ── nursery-committed [4]
> │  │   │  └──0.34 MB (00.49%) ++ (3 tiny)
> │  │   ├──1.05 MB (01.54%) ── atoms-table [4]
> │  │   └──0.86 MB (01.26%) ++ (7 tiny)
> │  └───3.60 MB (05.25%) -- gc-heap
> │      ├──3.00 MB (04.38%) ── unused-chunks [4]
> │      └──0.60 MB (00.87%) ++ (2 tiny)
> ├──13.58 MB (19.82%) ── heap-unclassified
> ├──11.51 MB (16.79%) ++ heap-overhead
> │  ├───7.64 MB (11.15%) ── page-cache [4]
> │  ├───3.03 MB (04.42%) ── bin-unused [4]
> │  └───0.84 MB (01.22%) ── bookkeeping [4]
> ├───2.84 MB (04.14%) ── xpti-working-set [4]
> ├───2.05 MB (03.00%) ++ layout
> ├───1.33 MB (01.95%) ++ (10 tiny)
> ├───1.09 MB (01.58%) ── preferences [4]
> ├───1.02 MB (01.49%) ++ xpconnect
> ├───0.80 MB (01.17%) ++ atom-tables
> └───0.77 MB (01.13%) ++ xpcom
>
> 48.36 MB (100.0%) -- heap-committed
> ├──36.86 MB (76.21%) ── allocated [4]
> └──11.51 MB (23.79%) ── overhead [4]
>
> 33.54 MB (100.0%) -- js-main-runtime
> ├──17.76 MB (52.94%) ++ compartments
> ├───6.97 MB (20.78%) ── runtime [4]
> ├───5.22 MB (15.55%) ++ zones
> └───3.60 MB (10.73%) ++ gc-heap
>
> 261 (100.0%) -- js-main-runtime-compartments
> ├──255 (97.70%) ++ system
> └────6 (02.30%) ++ user
>
>   310.06 MB ── resident [4]
>   114.39 MB ── resident-unique [4]

The "[4]" annotations just indicate that these measurements are all repeated
four times in the second case, due to the four content processes.

Among the internal measurements, "explicit" increases by 69 MiB, which
indicates a 23 MiB overhead per content process.

As for the OS measurements, "resident" is not a good metric here because it
will quadruple-count any memory shared between processes. "resident-unique"
shouldn't suffer from that problem, and it suggests a 38 MiB overhead.

The 15 MiB gap between these two surprised me. The only thing that would
account for that difference is unshared (non-read-only) static data,
including
vtables, lookup tables that contain pointers, etc. I've started digging and
it
actually seems plausible. Some of this data will be in our own code, and
some
is in external libraries that we rely on. Some small improvements are
possible
but there's an incredibly long tail and so it's unlikely to improve a lot.
See
bug 1254777 for more details.

Digging into the "explicit" numbers some more:

- The "js-non-window" memory (11 MiB per process) is all system JS code and
  data, mostly modules in resource://gre/modules/. We create about 85 JS
system
  compartments for these. A fraction of this is per-compartment overhead,
which
  might be avoidable by merging them into a single compartment (see bug
  1186409). (B2G did something similar a long time ago and saw big
  improvements.)

  Even if we can fix that, it's just a lot of JS code. We can lazily import
  JSMs; I wonder if we are failing to do that as much as we could, i.e. are
  all these modules really needed at start-up? It would be great if we
  could instrument module-loading code in some way that answers this
question.

- "heap-unclassified" memory is 4.5 MiB per process. I've analyzed this with
  DMD and this is mostly GTK and glib memory that we can't measure in our
  memory reporters. I haven't investigated closely to see if any of this
could
  be avoided.

- "heap-overhead" is 4 MiB per process. I've looked at this closely.
  The numbers tend to be noisy.

  - "page-cache" is pages that jemalloc holds onto for fast recycling. It is
    capped at 4 MiB per process and we can reduce that with a jemalloc
    configuration, though this may make allocation slightly slower.

  - "bin-unused" is fragmentation in smaller allocations and very hard to
    reduce.

  - "bookkeeping" is jemalloc's internal data structures and very hard to
    reduce.

- Then there's the not-so-long tail of things less than 1 MiB per process.
  Some of these may be shrinkable with effort, or made shareable between
  processes with effort. (E.g. I reduced xpti-working-set by 216 KiB per
  process in bug 1249174, and I've heard that making it shared was
considered
  for B2G but never implemented.) It's getting into diminishing returns,
  though.

-----------------------------------------------------------------------------
LINUX (64-bit), large processes
-----------------------------------------------------------------------------

> 115.98 MB (100.0%) -- explicit
> ├───66.80 MB (57.60%) -- js-non-window
> │   ├──39.31 MB (33.90%) -- runtime
> │   │  ├──32.69 MB (28.19%) -- gc
> │   │  │  ├──32.00 MB (27.59%) ── nursery-committed [4]
> │   │  │  └───0.69 MB (00.59%) ++ (3 tiny)
> │   │  ├───4.01 MB (03.46%) ── script-data [4]
> │   │  ├───1.80 MB (01.56%) ── atoms-table [4]
> │   │  └───0.80 MB (00.69%) ++ (9 tiny)
> │   ├──24.04 MB (20.73%) -- zones/zone(0xNNN)
> │   │  ├──19.59 MB (16.90%) ++ (98 tiny)
> │   │  ├───2.35 MB (02.03%) ++ strings
> │   │  └───2.10 MB (01.81%) ── unused-gc-things [12]
> │   └───3.45 MB (02.97%) -- gc-heap
> │       ├──3.00 MB (02.59%) ── unused-chunks [4]
> │       └──0.45 MB (00.38%) ++ (2 tiny)
> ├───19.93 MB (17.19%) -- heap-overhead
> │   ├──11.53 MB (09.94%) ── bin-unused [4]
> │   ├───6.96 MB (06.00%) ── page-cache [4]
> │   └───1.44 MB (01.24%) ── bookkeeping [4]
> ├───15.44 MB (13.31%) ── heap-unclassified
> ├────3.16 MB (02.73%) ++ window-objects
> ├────4.40 MB (03.80%) ++ (12 tiny)
> ├────2.84 MB (02.45%) ── xpti-working-set [4]
> ├────2.24 MB (01.93%) ++ layout
> └────1.17 MB (01.01%) ++ xpconnect
>
>   362.36 MB ── resident [4]
>   157.92 MB ── resident-unique [4]

The "explicit" overhead is now 39 MiB per process, and for "resident-unique"
it's 53 MiB per process. The gap between the two is 14 MiB, similar to
before,
so that's additional evidence that static data accounts for the gap.

Both of those overheads are about 16 MiB higher than in the "small
processes"
case. It's mostly JS, esp. "nursery-committed" -- it looks like all four
content processes have 8 MiB nurseries. I know for B2G we allow much smaller
nurseries (256 KiB?) so maybe shrinking it down as we increase content
processes would also be wise.

Other than JS, "heap-overhead" is a bit higher, and most of the other
buckets
are relatively stable.

-----------------------------------------------------------------------------
WINDOWS (32-bit), small processes
-----------------------------------------------------------------------------

> 47.79 MB (100.0%) -- explicit
> ├──25.14 MB (52.60%) -- js-non-window
> │  ├──15.30 MB (32.02%) -- zones/zone(0xNNN)
> │  │  ├──12.36 MB (25.85%) ++ (94 tiny)
> │  │  ├───1.61 MB (03.37%) -- strings/string(<non-notable strings>)
> │  │  │   ├──1.22 MB (02.55%) -- gc-heap
> │  │  │   │  ├──1.22 MB (02.55%) ── latin1 [8]
> │  │  │   │  └──0.00 MB (00.00%) ── two-byte [4]
> │  │  │   └──0.39 MB (00.82%) ── malloc-heap/latin1 [8]
> │  │  └───1.34 MB (02.80%) ── unused-gc-things [12]
> │  ├──10.49 MB (21.96%) -- runtime
> │  │  ├───5.32 MB (11.13%) -- gc
> │  │  │   ├──5.00 MB (10.46%) ── nursery-committed [4]
> │  │  │   └──0.32 MB (00.67%) ++ (3 tiny)
> │  │  ├───3.36 MB (07.02%) ── script-data [4]
> │  │  ├───1.00 MB (02.10%) ── atoms-table [4]
> │  │  ├───0.52 MB (01.09%) ++ script-sources
> │  │  └───0.30 MB (00.62%) ++ (6 tiny)
> │  └──-0.66 MB (-1.37%) ++ gc-heap
> ├──11.47 MB (23.99%) -- heap-overhead
> │  ├───8.51 MB (17.80%) ── page-cache [4]
> │  ├───2.59 MB (05.42%) ── bin-unused [4]
> │  └───0.37 MB (00.77%) ── bookkeeping [4]
> ├───4.43 MB (09.26%) ── heap-unclassified
> ├───2.27 MB (04.76%) ── xpti-working-set [4]
> ├───1.29 MB (02.70%) ++ layout
> ├───0.81 MB (01.69%) ++ (10 tiny)
> ├───0.81 MB (01.69%) ── preferences [4]
> ├───0.56 MB (01.18%) ++ xpcom
> ├───0.53 MB (01.11%) ++ xpconnect
> └───0.49 MB (01.02%) ++ atom-tables
>
> 33.35 MB (100.0%) -- heap-committed
> ├──21.89 MB (65.62%) ── allocated [4]
> └──11.47 MB (34.38%) ── overhead [4]
>
> 25.14 MB (100.0%) -- js-main-runtime
> ├──11.57 MB (46.05%) ++ compartments
> ├──10.49 MB (41.74%) ── runtime [4]
> ├───3.73 MB (14.82%) ++ zones
> └──-0.66 MB (-2.61%) ++ gc-heap
>
> 264 (100.0%) -- js-main-runtime-compartments
> ├──258 (97.73%) ++ system
> └────6 (02.27%) ++ user
>
>    21.89 MB ── heap-allocated [4]
>   151.89 MB ── private [4]
>   222.57 MB ── resident [4]
>   119.76 MB ── resident-unique [4]

The numbers are lower here than for Linux because it's 32-bit and so
pointers
are smaller, but the same basic patterns apply. Differences of note:

- The difference between "explicit" and "resident-unique" per process is 25
  MiB, as opposed to the 15 MiB we saw on Linux. I don't know why. Does
Windows
  have some inherent per-process memory cost higher than Linux?

- "private" and "resident-unique' are significantly different. Not sure
what to
  make of that.

- "heap-unclassified" is a lot lower.

The "large processes" numbers for Windows don't show much interesting beyond
what we've already seen.

-----------------------------------------------------------------------------
MAC (64-bit), small processes
-----------------------------------------------------------------------------

> 64.31 MB (100.0%) -- explicit
> ├──33.40 MB (51.93%) -- js-non-window
> │  ├──23.00 MB (35.76%) -- zones/zone(0xNNN)
> │  │  ├──16.53 MB (25.70%) ++ (90 tiny)
> │  │  ├───1.97 MB (03.07%) ── unused-gc-things [12]
> │  │  ├───1.71 MB (02.66%) ++ strings/string(<non-notable strings>)
> │  │  ├───0.78 MB (01.22%) ++ object-groups
> │  │  ├───0.67 MB (01.05%) ++ compartment([System Principal], Addon-SDK
(from: resource://gre/modules/commonjs/toolkit/loader.js:249))
> │  │  ├───0.67 MB (01.04%) ++
compartment(moz-nullprincipal:{NNNNNNNN-NNNN-NNNN-NNNN-NNNNNNNNNNNN},
XPConnect Compilation Compartment)
> │  │  └───0.66 MB (01.03%) ++ compartment([System Principal],
resource://gre/modules/commonjs/toolkit/loader.js)
> │  ├───6.97 MB (10.84%) -- runtime
> │  │   ├──3.72 MB (05.78%) ── script-data [4]
> │  │   ├──1.34 MB (02.08%) ++ gc
> │  │   ├──1.05 MB (01.64%) ── atoms-table [4]
> │  │   └──0.86 MB (01.34%) ++ (7 tiny)
> │  └───3.43 MB (05.34%) ++ gc-heap
> ├──10.92 MB (16.98%) ── heap-unclassified
> ├──10.06 MB (15.65%) -- heap-overhead
> │  ├───6.38 MB (09.92%) ── page-cache [4]
> │  ├───2.89 MB (04.50%) ── bin-unused [4]
> │  └───0.79 MB (01.22%) ── bookkeeping [4]
> ├───2.84 MB (04.41%) ── xpti-working-set [4]
> ├───2.01 MB (03.12%) ++ layout
> ├───1.37 MB (02.14%) ++ (9 tiny)
> ├───1.12 MB (01.74%) ── preferences [4]
> ├───1.02 MB (01.59%) ++ xpconnect
> ├───0.81 MB (01.25%) ++ atom-tables
> └───0.77 MB (01.20%) ++ xpcom
>
> 44.27 MB (100.0%) -- heap-committed
> ├──34.21 MB (77.27%) ── allocated [4]
> └──10.06 MB (22.73%) ── overhead [4]
>
> 33.40 MB (100.0%) -- js-main-runtime
> ├──17.76 MB (53.16%) ++ compartments
> ├───6.97 MB (20.87%) ── runtime [4]
> ├───5.24 MB (15.69%) ++ zones
> └───3.43 MB (10.28%) ++ gc-heap
>
> 261 (100.0%) -- js-main-runtime-compartments
> ├──255 (97.70%) ++ system
> └────6 (02.30%) ++ user
>
>   282.06 MB ── resident [4]
>   147.94 MB ── resident-unique [4]

The difference between "explicit" and "resident-unique" per process is 27
MiB, as opposed to the 15 MiB we saw on Linux and 25 MiB we saw on Windows.
Again, I don't know why.

Other than that, the numbers are quite similar to the Linux numbers.

-----------------------------------------------------------------------------
Conclusion
-----------------------------------------------------------------------------

The overhead per content process is significant. I can see scope for
moderate
improvements, but I'm having trouble seeing how big improvements can be
made.
Without big improvements, scaling the number of content processes beyond
4 (*maybe* 8) won't be possible.

- JS overhead is the biggest factor. We execute a lot of JS code just
starting
  up for each content process -- can that be reduced? We should also
consider a
  smaller nursery size limit for content processes.

- Heap overhead is significant. Reducing the page-cache size could save a
  couple of MiBs. Improvements beyond that are hard. Turning on jemalloc4
  *might* help a bit, but I wouldn't bank on it, and there are other
  complications with that.

- Static data is a big chunk. It's hard to make much of a dent there because
  it has a *very* long tail.

- The remaining buckets are a lot smaller.

I'm happy to gives copies of the raw data files to anyone who wants to look
at
them in more detail.

Nick
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to