Hello #jsapi.
I wanted to report on some of the JS-centric results of the Google
Suites work week in Taipei, and hopefully get more eyes on these problems.
First, some background: Sean and I met up with about two dozen other
people involved in the Google Suites performance improvement effort. We
had the Hasal people from Taipei, a bunch of project manager type folks,
and a scattering of developers from across Gecko. The GSuites effort is
considered extremely high priority, for good reasons that I won't go
into here. In Taipei, we looked at lots of profiles, worked on internal
tools, argued over what to test and how to measure, and ate stinky tofu.
High level findings:
- we have a cluster of test cases that run around 20% slower than
Chrome, with some quite a bit worse than that (think 40-60%, with
outliers at 2x)
- most of the existing test cases appear to be JS bound
- some other areas showed up here and there, eg layout and GC
- most of our current test cases boil down to document load times
- we don't have much visibility into what actual users are hitting
- we still don't have a great way to say exactly what parts of an
application are slower in Firefox than in Chrome
- results derived only from the Gecko Profiler / Cleopatra are not
always trustable or useful
- the more profilers, the better; we seem to find slightly different
things in each
Details:
[most tests measure load time]
We have test cases that do things like load up a Google Slides document,
go to the end, and add a slide. Almost all of the time (4-6sec on
typical machines) is in loading the document. This is true for the
majority of the test cases we are looking at. We are not yet splitting
out the time for each operation (to the extent that that is even
possible; GC for example is going to leak into later operations, and
cached results "leak" time into earlier operations because the initial
operation will be expensive even though the amortized cost may be low.)
This is problematic because we expect that actual users' complaints are
probably more about interacting with these apps, not the initial load.
There are exceptions. In bug 1269695, the Hasal test is heavily
dependent on document load, but when I profile it I do not time the
initial load. Once it is loaded and the spinner has stopped, I time how
long it takes to Ctrl-End to the end of the document. Bug 1330539 is
looking at I-cache misses when scrolling through a document. Bug 1330252
is from piling up key events when scrolling through slides via the keyboard.
[actual user problems]
Our Hasal test cases indeed show slowdowns relative to Chrome, but we
don't know whether the slowdowns for those test cases are related to the
actual issues people are running into and complaining about. We are
sending out surveys to try to figure this out, and adding additional
telemetry to infer it from the field. The most promising pointers I've
come across are multiple users sharing a document, and gradual perf
degredation over time. Neither scenario is in our automated tests.
[what is slower in Fx vs Chrome]
In general with these pages (apps), we only know that it takes some
number like 38% longer in Firefox than in Chrome. We still do not have a
good way of breaking that down into eg "JS runs for 750ms more, layout
is a little faster, GC is about the same". We would like to get to that
level; we have a way to categorize time, and so does Chrome, but both
have issues and are not directly comparable. Going further, we would
like to be able to say that a specific script took X ms longer in Fx
than in Chrome, but we don't have tooling for that either. We discussed
using a proxy to instrument the JS for this purpose, and I may work on
that. We can also look at other aggregate metrics for the whole load, eg
cycles per instruction or cache miss rates for the whole test, but
that'll never be more than a clue as to what specifically to work on.
[Unreliability/features of profiler]
We identified a number of ways in which the Gecko Profiler can mislead
or omit useful information:
- symbolication is unreliable and racy
- symbolication is buggy (meta bug 1307215 for this and the above)
- we cannot walk the native stacks of many of the builds in users' hands
- the sampling overhead is dependent on what is being sampled, which
biases the results (eg bug 1332489, bug 1330532)
- similarly, some activities during profiling take way too long (bug
1330576 and dependencies)
- bailouts seem to be annotated with script names that do not match
script names in the samples (bug 1329921)
- the profiler associates some things, like scripts, using non-unique
keys that conflate multiple things (eg bug 1329924)
Also, native profilers have various features missing from
Cleopatra/Gecko Profiler that have revealed differing results. (And note
that I'm not much of an expert here; I don't know enough about the
various profilers on the various platforms, nor exactly what turned out
to be useful for the things we looked at.)
- perf counter data, eg cycles per retired instruction
- data for the kernel or other processes
- butterfly view, a la Zoom on linux (or gprof, for that matter --
it's where you see all immediate callers and all immediate callees of a
given function, rather than a tree view based on one or the other)
- charge-to-caller (sometimes profiles get obscured by having little
pieces of time whittled away into leaf functions like memset or
whatever. It is handy to be able to merge this time into the caller. At
least, I *think* this is what I've heard a few people request.)
The other profilers people used, or that I've heard recent mention of
people using, include:
- perf
- Zoom (perf frontend)
- Instruments
- VTune
- Windows Concurrency Visualizer (I don't think anyone has used this
seriously yet)
Oh, and for looking into Chrome performance (to compare with Firefox), I
am a little suspicious of their categorizations of time because you only
get those with the Timeline tool, which has an overhead of something
like 5-7x on the test case I was looking at. (So a 6 second document
load took longer than 30 seconds.) *Maybe* they're able to adjust for
the tracing overhead.
[JS performance]
The general impression is that we are spending more time executing JS
code than Chrome is. For the most part, this is not determined directly,
but rather we look at tests where we are substantially slower than
Chrome, then examine a profile to see where most of our time is going.
Sometimes I'll look deeper at the percentage of time broken down by
various categories (script execution, GC, layout, etc.), to compare the
percentage of our time spent running scripts vs Chrome's, but see above
why I'm not willing to fully trust Chrome's percentage breakdowns. I
have also used our and Chrome's devtools to narrow in on specific
executions of scripts (just by eyeballing the basic pattern and
correlating the timelines), and from that I have seen instances where we
spend much more time executing a script (and everything under it) than
Chrome does.
We've really only come up with two general high-level explanations for
why we might be slower.
(1) We are getting bad IPC counts for at least one test. IIRC we're at
something like 10, and Chrome is around 2. Except the latest info says
that that's mostly due to interference from graphics code.
(2) Google suites code is the output of Clojure Compiler, which produces
functions with poor type locality, if you'll forgive me for coining a
term. (As in, lots of different code with lots of different types flows
into the leaf functions, so compiled code for leaves ends up
megamorphic, but wouldn't be if we magically inlined everything.)
I don't know how to prove or disprove either of these hypotheses. And
even if they do end up explaining a lot, I'm guessing there will be a
lot of smaller reasons that'll add up as well.
Typical examples:
https://clptr.io/2jqJ0BD is a non-e10s run for the document from bug
1269695. It has 50% of the total time in stuff related to script. 19% is
running ion code, 14% running baseline code, which is actually pretty
good for the things I have looked at. (Contrast with an Octane run
https://clptr.io/2jqRUPk where 40% of the total time is in ion, 16% in
baseline, 18% in GC!)
Bug 1326346 is a gslides example that ehsan has analyzed, finding some
reflow but mostly script.
https://clptr.io/2jqGWJX shows 31% of the time spent in script: 17%
executing baseline, 5% executing ion, and to my eye, not very many total
bailouts. (This is from loading
https://docs.google.com/presentation/d/10QdoQTau98IeEp862ChQ7vG0eR3_mJZABsuduXGK1VE/edit#slide=id.g115bc93893_11_5
).
What the JS team could provide:
- Understand what various patterns mean. What are tell-tale signs that
we are doing something dumb/fisable? And which can we detect
automatically? nbp has some ideas in bug 1184569. Additional
perspectives would be helpful.
- In general, a better way to triage and diagnose performance problems
in single page apps, a way that does not require having an analogous
shell test case.
- Revive vtune integration (sstangl is working on it)
- Push column numbers through to the profiler to be able to identify
specific scripts in minified code (bug 785922?)
- Annotate more things with markers, eg bug 1329923, so we can get some
of the advantages of tracing with our sampling. (eg generate a timeline
of samples for a given script, with all compiles/bailouts/collections
annotated in the timeline)
- Produce bailout markers in a structured format (bug 1329921)
- A good description of the overall flow of the engine, as it relates to
performance. Every time we go over this with someone new, the same ideas
and suggestions arise.
- compile sooner
- compile later
- specialize the warmup thresholds to specific URLs (or a similar key)
- compile while idle and keep multiple differently-specialized copies
- cache compiled code
- cache type information
- suppress GC
- GC when idle
- eliminate all or most bailouts
I can come up with reasons against every one of these, but I also feel
like there's a kernel of a good idea in each. It would be nice to have
ammunition in the form of a clear description of how the engine works
*in practice*, and preferably specific numbers we used to tune various
things. This isn't for shooting down these ideas, but rather for making
them more nuanced and increasing the chance of someone coming up with
something that *would* help. Without explaining the "why nots" over and
over again.
Some of these things are doable, would help some, but just aren't worth
the cost. Like say we saved initial type information + steady state type
information separately, and unioned each with what we observe at
runtime. So the first compile would take into account the first batch of
type info, the second would switch to the second batch. Lots of work,
more memory, more complexity -- could it be worth those costs? Seems
unlikely to me, but I'm no expert.
I would imagine the engine description looking something like "x% of
code never even runs, so we don't want to waste time doing anything
eager with it. We have to at least syntax parse before we can fetch all
of the code. x% of code only runs once, and the interpreter will execute
it faster than we could baseline compile & run it. Baseline is typically
k x faster than the interpreter, but has to generate ICs before it gets
up to that speed. It also has to see accurate enough type info before
ion compiling, so that...." or whatever. Including subtleties such as
startup type info vs steady state type info. Essentially, "these are the
patterns of code the engine expects to see, and this is how it makes
them go fast."
I have a meta bug, bug 1299643, where I try to list the Google Suites
cases that seem to be JS bound. The latest round of Hasal results is at
<https://docs.google.com/spreadsheets/d/1chXshqmq2PNrGe7M6ETjLcJkkP3Myur8tPd3HmkaxYc/edit#gid=244337553>;
note that I haven't gone through them to sort out the latest set of
significantly slower, JS-bound cases, because they're using the old
profiler and I don't have any good way of categorizing the time there.
_______________________________________________
dev-tech-js-engine-internals mailing list
dev-tech-js-engine-internals@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-tech-js-engine-internals