Re: All about crashes

Nicholas Nethercote Wed, 25 May 2016 18:47:06 -0700

On Wed, May 25, 2016 at 6:58 AM, Lawrence Mandel <lman...@mozilla.com> wrote:
>
> Wasn't sure how you wanted feedback. Here's some in email form.


Email is great, thank you.


> "Crashes are caused by defects"
>
> Reading this I think it implies defects in Firefox. This is not always the
> case. Crashes are also the result of interactions with third party software.
> Both that that we designed for (like NPAPI plug-ins) and that we didn't (AV,
> malware), which you mention lower down in the doc.

Fair enough. I've modified my terminology. I now talk about crash
"causes" as the most generic category, and reserve "defects" for
erronous code and hardware.

<aside>
In general it's good to be careful with terminology when dealing with
bugs and crashes, because it can help clarify things. E.g. when
talking about buggy code, there are three stages for a "bug" to
manifest:

- Erroneous code. When executed, it can (but doesn't necessarily) lead to...
- Erroneous runtime state. This can (but doesn't necessarily) lead to...
- Erronous runtime behaviour. This includes crashes, but also other things.

At times "bug" and "buggy" are used to refer to all three. I tend to
use "defect" for erroneous code. People sometimes use "error" or
"fault" for erroneous behaviour. Erroneous runtime state is often
overlooked as a separate category, and the only specific term I've
heard for it is "infection" which I don't like much.
</aside>

> "Improve ranking of crash clusters."
>
> I think this is weighting or estimating impact of a crash instead of volume
> of submissions, which is how we have historically processed the clusters.
> Severity is one component with startup crashes being worse than content
> crashes being worse than shutdown crashes. (Need to figure out weighting of
> gfx and other buckets of crashes.)

Yes, this is intended to cover more sophisticated ranking techniques,
ones that take into account more than just volume. I've tweaked the
wording slightly in an attempt to clarify this.

> Potential impacted population is another
> and we have data on the differences in population size on Beta vs Release
> for dimensions like OS version, gfx hardware, and gfx driver to make use of
> for the weighting.

Yes. There's a mention of the "Crystal Ball" tool which might be one
way to do this, though other ways are possible.

> "Improve reproducibility of these crashes.
>
> Use rr to record crashes so they can be played back reliably."
>
> We're going to spin up a project to work on debugging in automation. We've
> talked about having the ability to run a test until it fails and pause at
> that point. I think that would be very helpful for this case.

Is this something that would happen for normal test jobs? Would there
be a way to attach a debugger to the paused job? Is it aimed at
reducing intermittent test failures?

It sounds interesting but how it works it a bit unclear to me from
your short description :)

> I didn't see a lot on mitigation strategies but I think those this is key
> piece as well. We will get some mitigation when e10s ships and content
> crashes no longer take down the whole browser. We've discussed mitigations
> for startup crashes (and implemented one for gfx). What other mitigations
> should we put in place to recover gracefully?

Good point. I added one dot point about that, but I don't have any
ideas beyond "disable bad gfx drivers" and "e10s all the things".

Nick
_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Re: All about crashes

Reply via email to