Re: Time for amber2?

Iago Toral Tue, 19 May 2026 22:37:00 -0700

El jue, 14-05-2026 a las 19:13 -0400, Marek Olšák escribió:
> Here's a more detailed description of the problem and a possible
> solution.
> 
> First, the worst case scenario: A small one-line commit that’s
> correct
> and trivial causes a test failure in the CI. The maintainer of the
> affected driver is asked for help, who concludes that it’s likely a
> HW
> bug and is forwarded to the HW team of the corresponding GPU company.
> Now the management of the GPU company has to allocate staff to
> investigate the failure. 3 months later, we may have a workaround. Or
> not.
> 
> Second, the scale: The CI has lots of undocumented devices with
> undocumented erratas and drivers with hacks and incomplete
> implementations. (that’s normal for any project) Any of those devices
> can fail at any time for reasons that might not make sense, and any
> of
> the drivers can fail for random reasons too. It’s not fair to ask the
> contributor to keep everything conformant at every MR. Even if the
> devices were documented with open source implementations (e.g. uarch
> specs, HDL, RTL) and well documented drivers, it’s not reasonable to
> ask the contributor to study them all.
> 
> Thus, we can’t expect the contributor to be solely responsible for
> conformance of all devices at every MR in main.
> 
> It’s useful to keep drivers that have regular contributors conformant
> at most commits in main, but why do we need to keep drivers without
> contributors conformant? If somebody cares about those drivers but
> not
> enough to contribute in main, they can contribute fixes during the RC
> window or on their own schedule.
> 
> We need a two-tier system:
> 
> Tier 1:
> - Devices are tested by the CI pre-merge.
> - A contact person is required for CI failure assessment and closure
> within a reasonable time. (if the person is on leave, a backup person
> must be available, or else the device is moved to Tier 2)



I think this makes sense, but we need to agree on what "reasonable
time" means to make sure everyone is on the same page.

Iago

> - Highly recommended: A fully functional drm-shim for each CI job
> with
> a user guide, how to print compiled shaders, etc.
> - Links to HW documentation if available.
> - If maintainers end up xfailing a significant number of failures
> regularly, the device is moved to Tier 2. (due to not using the CI to
> maintain conformance)
> 
> Tier 2:
> - Pre-merge CI can’t run on the target devices / implementations.
> main
> doesn’t have to work. The quality of release branches is up to
> maintainers. The RC window can be extended.
> - Only unit tests can run per-merge, as well as any deviceless driver
> tests, like the following.
> - Optionally develop deviceless driver validation tests that verify
> driver output (shader instructions, command buffers). LLVM LIT tests
> are the perfect example - they validate all LLVM backends and prevent
> regressions without any physical devices.
> 
> 
> Marek
> 
> On Fri, May 1, 2026 at 5:21 AM Daniel Stone <[email protected]>
> wrote:
> > 
> > Hi,
> > 
> > On Thu, 30 Apr 2026 at 23:34, Timur Kristóf
> > <[email protected]> wrote:
> > > On 2026. április 30., csütörtök 23:07:12 közép-európai nyári idő
> > > Marek Olšák
> > > wrote:
> > > > First of all, no contributor to shared code is required to fix
> > > > issues
> > > > in all drivers that their commit breaks. The goal is to stop
> > > > using the
> > > > pre-merge CI as a justification to force unrelated contributors
> > > > to
> > > > work on all drivers just because they are contributors. It
> > > > would be a
> > > > bit exploitative to assume that every contributor must debug
> > > > all
> > > > drivers that turn red due to a change. I think I understand
> > > > that well
> > > > because I have debugged 5+ drivers by myself in the past that
> > > > are not
> > > > my responsibility to maintain, and it does feel exploitative.
> > 
> > There's a bit more nuance in this though. If one set of people is
> > breaking 17 drivers every day because they can't be bothered to do
> > the
> > basics to keep things working and just want to yolo whatever they
> > just
> > thought of into the tree, it's 'unethical' and unfair on the rest
> > of
> > the people who then spend their entire time bisecting and fixing up
> > what the others broke. (Those people then probably get accused of
> > being freeloaders and exploiting the labour of the people breaking
> > everything, because they don't get to spend any time on fun new
> > stuff,
> > given all their time is spent fixing what the others broke.)
> > 
> > I think we've all taken it as axiomatic that there's a balance to
> > be
> > struck there: don't make others miserable because you can't be
> > bothered spending five minutes thinking about why your new code
> > breaks
> > existing users, but on the other hand you absolutely should expect
> > support from the relevant people to help work it out and resolve
> > it.
> > 
> > I'm pretty sure no-one is suggesting ripping up that social
> > contract,
> > but we should be clear about what we mean.
> > 
> > > > Therefore, we could establish that each driver/HW combo in pre-
> > > > merge
> > > > CI has the following options:
> > > > 1) a contact person for prompt CI issue resolution
> > > > 2) unconditional xfail by the author (or removal from pre-merge
> > > > CI if
> > > > logs lack the information necessary to add xfail)
> > > 
> > > I think we should establish both of those, in that order.
> > > That is, if the contact person does not reply promptly, just
> > > let's add the
> > > expected failure.
> > 
> > Yeah, that's a pretty obvious baseline. So far it seems to have
> > worked
> > out in the usual way (people know who works on what so it's easy to
> > ping them however), but if that's not working out, maybe someone
> > could
> > suggest a more formal document along the lines of MAINTAINERS or
> > CODEOWNERS or whatever?
> > 
> > Cheers,
> > Daniel
>

Re: Time for amber2?

Reply via email to