Re: Time for amber2?

Marek Olšák Thu, 14 May 2026 16:14:46 -0700

Here's a more detailed description of the problem and a possible solution.

First, the worst case scenario: A small one-line commit that’s correct
and trivial causes a test failure in the CI. The maintainer of the
affected driver is asked for help, who concludes that it’s likely a HW
bug and is forwarded to the HW team of the corresponding GPU company.
Now the management of the GPU company has to allocate staff to
investigate the failure. 3 months later, we may have a workaround. Or
not.


Second, the scale: The CI has lots of undocumented devices with
undocumented erratas and drivers with hacks and incomplete
implementations. (that’s normal for any project) Any of those devices
can fail at any time for reasons that might not make sense, and any of
the drivers can fail for random reasons too. It’s not fair to ask the
contributor to keep everything conformant at every MR. Even if the
devices were documented with open source implementations (e.g. uarch
specs, HDL, RTL) and well documented drivers, it’s not reasonable to
ask the contributor to study them all.

Thus, we can’t expect the contributor to be solely responsible for
conformance of all devices at every MR in main.

It’s useful to keep drivers that have regular contributors conformant
at most commits in main, but why do we need to keep drivers without
contributors conformant? If somebody cares about those drivers but not
enough to contribute in main, they can contribute fixes during the RC
window or on their own schedule.

We need a two-tier system:

Tier 1:
- Devices are tested by the CI pre-merge.
- A contact person is required for CI failure assessment and closure
within a reasonable time. (if the person is on leave, a backup person
must be available, or else the device is moved to Tier 2)
- Highly recommended: A fully functional drm-shim for each CI job with
a user guide, how to print compiled shaders, etc.
- Links to HW documentation if available.
- If maintainers end up xfailing a significant number of failures
regularly, the device is moved to Tier 2. (due to not using the CI to
maintain conformance)

Tier 2:
- Pre-merge CI can’t run on the target devices / implementations. main
doesn’t have to work. The quality of release branches is up to
maintainers. The RC window can be extended.
- Only unit tests can run per-merge, as well as any deviceless driver
tests, like the following.
- Optionally develop deviceless driver validation tests that verify
driver output (shader instructions, command buffers). LLVM LIT tests
are the perfect example - they validate all LLVM backends and prevent
regressions without any physical devices.


Marek

On Fri, May 1, 2026 at 5:21 AM Daniel Stone <[email protected]> wrote:
>
> Hi,
>
> On Thu, 30 Apr 2026 at 23:34, Timur Kristóf <[email protected]> wrote:
> > On 2026. április 30., csütörtök 23:07:12 közép-európai nyári idő Marek Olšák
> > wrote:
> > > First of all, no contributor to shared code is required to fix issues
> > > in all drivers that their commit breaks. The goal is to stop using the
> > > pre-merge CI as a justification to force unrelated contributors to
> > > work on all drivers just because they are contributors. It would be a
> > > bit exploitative to assume that every contributor must debug all
> > > drivers that turn red due to a change. I think I understand that well
> > > because I have debugged 5+ drivers by myself in the past that are not
> > > my responsibility to maintain, and it does feel exploitative.
>
> There's a bit more nuance in this though. If one set of people is
> breaking 17 drivers every day because they can't be bothered to do the
> basics to keep things working and just want to yolo whatever they just
> thought of into the tree, it's 'unethical' and unfair on the rest of
> the people who then spend their entire time bisecting and fixing up
> what the others broke. (Those people then probably get accused of
> being freeloaders and exploiting the labour of the people breaking
> everything, because they don't get to spend any time on fun new stuff,
> given all their time is spent fixing what the others broke.)
>
> I think we've all taken it as axiomatic that there's a balance to be
> struck there: don't make others miserable because you can't be
> bothered spending five minutes thinking about why your new code breaks
> existing users, but on the other hand you absolutely should expect
> support from the relevant people to help work it out and resolve it.
>
> I'm pretty sure no-one is suggesting ripping up that social contract,
> but we should be clear about what we mean.
>
> > > Therefore, we could establish that each driver/HW combo in pre-merge
> > > CI has the following options:
> > > 1) a contact person for prompt CI issue resolution
> > > 2) unconditional xfail by the author (or removal from pre-merge CI if
> > > logs lack the information necessary to add xfail)
> >
> > I think we should establish both of those, in that order.
> > That is, if the contact person does not reply promptly, just let's add the
> > expected failure.
>
> Yeah, that's a pretty obvious baseline. So far it seems to have worked
> out in the usual way (people know who works on what so it's easy to
> ping them however), but if that's not working out, maybe someone could
> suggest a more formal document along the lines of MAINTAINERS or
> CODEOWNERS or whatever?
>
> Cheers,
> Daniel

Re: Time for amber2?

Reply via email to