I suggest that we need to start out by reducing the scope of the "unknowns". In particular I am scared because we have a *lot* of crashes in JIT code and we don't know what kind of crashes these are, in general:
* valid JIT code touching a dead object (a GC bug that happens to be seen while executing JIT code) * bad JIT code generation * executing JIT code after we cleaned it up We're prototyping a project now that does all of our crash stackwalking and analysis on the client instead of the server. This can potentially allow us to inspect more memory (not just the stack memory traditionally saved in a minidump) as well as have data on all crashes, not just the submitted ones. I've advocated in bugs being a lot more aggressive about poisoning deleted memory in release builds. I know we already do this for free(), and I believe we do it for some types of GCed objects. Do we do it for all GCed objects nowadays, and for JIT code that we believe is done? We have work to do to make the poison value point to inaccessible memory instead of a NOP slide (ping me for a bug#). If we can use different poison values for dead JIT (executable) memory and dead object (non-executable) memory, that would also help distinguish things. If there are CPU-efficient ways to do this, we should also consider more aggressively checking pointers against poison values (earlier), so that we can figure out what *kind* of access pattern is causing GC errors, rather than only discovering it in rather opaque JIT or GC code. --BDS On Thu, Apr 28, 2016 at 2:48 AM, Nicholas Nethercote <[email protected] > wrote: > Hi, > > Project Uptime (https://wiki.mozilla.org/Platform/Uptime) is underway. > Its goal is to reduce the crash rate of Firefox (desktop and mobile). > And SpiderMonkey accounts for a significant fraction of those crashes. > > SM provides some particular challenges, in particular the JITs and the GC. > > First, JITs and GC are both inherently unsafe things. Lots of raw > memory manipulation, code manipulation, areas where we have less > protection than normal C++. (These are things that even Rust wouldn't > help with much, because we'd have to write big chunks of them in > unsafe code.) > > Second, crash reports from bugs in the JITs and the GC often have less > info than normal crash reports. For the JITs that's because the stack > traces are unhelpful -- e.g. so many crashes aggregate under > EnterBaseline. For the GCs that's because a GC crash often is > triggered by a buggy code (be it in the GC itself, or elsewhere) that > ran substantially earlier. > > This is a good moment to think hard about how we can improve things. > > - Can we use static and dynamic analysis tools more? (Even simple > things like bug 1267551 can help.) > > - How can we get better data in JIT and GC crash reports? > > - Would "extended assertions" help? By this I mean verification passes > over complex data structures. Compilers often have these, e.g. after > each pass you can optionally run a pass that does a thorough sanity > check of the IR. Do we have that for the JITs? Would something like > that make sense for GC? ("Code generators and garbage collectors > should crash as early and as loudly as possible.") > > - What defensive programming measures can we add in? What code > patterns are error-prone and should be avoided? > > - How can we respond to problems? E.g. bug 1232229 as an example where > a more aggressive approach to backouts would likely have resulted in a > topcrash diagnosis occurring a lot earlier than it eventually did. > > - Could user telemetry be used to identify parts of SM that aren't > exercised much in Nightly/Aurora/Beta? > > I'd love to hear ideas. > > Nick > _______________________________________________ > dev-tech-js-engine-internals mailing list > [email protected] > https://lists.mozilla.org/listinfo/dev-tech-js-engine-internals > _______________________________________________ dev-tech-js-engine-internals mailing list [email protected] https://lists.mozilla.org/listinfo/dev-tech-js-engine-internals

