Module: Mesa
Branch: main
Commit: 7a3fb60ac85300f0030c5edd2587bf4913c17f69
URL:    
http://cgit.freedesktop.org/mesa/mesa/commit/?id=7a3fb60ac85300f0030c5edd2587bf4913c17f69

Author: Eric Anholt <[email protected]>
Date:   Thu Oct 19 10:21:04 2023 +0200

docs/ci: Add some links in the CI docs to how to track job flakes

and also figuring out how many boards are available for sharding
management.

Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/25806>

---

 docs/ci/docker.rst |  2 +-
 docs/ci/index.rst  | 34 +++++++++++++++++++++++++++-------
 2 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/docs/ci/docker.rst b/docs/ci/docker.rst
index 4a3c842416d..4e181335fa2 100644
--- a/docs/ci/docker.rst
+++ b/docs/ci/docker.rst
@@ -34,7 +34,7 @@ at the job's log for which specific tests failed).
 DUT requirements
 ----------------
 
-In addition to the general :ref:`CI-farm-expectations`, using
+In addition to the general :ref:`CI-job-user-expectations`, using
 Docker requires:
 
 * DUTs must have a stable kernel and GPU reset (if applicable).
diff --git a/docs/ci/index.rst b/docs/ci/index.rst
index 2b8797200f7..bd7e3d49103 100644
--- a/docs/ci/index.rst
+++ b/docs/ci/index.rst
@@ -148,10 +148,10 @@ If you're having issues with the Intel CI, your best bet 
is to ask about
 it on ``#dri-devel`` on OFTC and tag `Nico Cortes
 <https://gitlab.freedesktop.org/ngcortes>`__ (``ngcortes`` on IRC).
 
-.. _CI-farm-expectations:
+.. _CI-job-user-expectations:
 
-CI farm expectations
---------------------
+CI job user expectations:
+-------------------------
 
 To make sure that testing of one vendor's drivers doesn't block
 unrelated work by other vendors, we require that a given driver's test
@@ -160,11 +160,23 @@ driver had CI and failed once a week, we would be seeing 
someone's
 code getting blocked on a spurious failure daily, which is an
 unacceptable cost to the project.
 
+To ensure that, driver maintainers with CI enabled should watch the Flakes 
panel
+of the `CI flakes dashboard
+<https://ci-stats-grafana.freedesktop.org/d/Ae_TLIwVk/mesa-ci-quality-false-positives?orgId=1>`__,
+particularly the "Flake jobs" pane, to inspect jobs in their driver where the
+automatic retry of a failing job produced a success a second time.
+Additionally, most CI reports test-level flakes to an IRC channel, and flakes
+reported as NEW are not expected and could cause spurious failures in jobs.
+Please track the NEW reports in jobs and add them as appropriate to the
+``-flakes.txt`` file for your driver.
+
 Additionally, the test farm needs to be able to provide a short enough
-turnaround time that we can get our MRs through marge-bot without the
-pipeline backing up.  As a result, we require that the test farm be
-able to handle a whole pipeline's worth of jobs in less than 15 minutes
-(to compare, the build stage is about 10 minutes).
+turnaround time that we can get our MRs through marge-bot without the pipeline
+backing up.  As a result, we require that the test farm be able to handle a
+whole pipeline's worth of jobs in less than 15 minutes (to compare, the build
+stage is about 10 minutes).  Given boot times and intermittent network delays,
+this generally means that the test runtime as reported by deqp-runner should be
+kept to 10 minutes.
 
 If a test farm is short the HW to provide these guarantees, consider dropping
 tests to reduce runtime.  dEQP job logs print the slowest tests at the end of
@@ -179,6 +191,14 @@ artifacts.  Or, you can add the following to your job to 
only run some fraction
 
 to just run 1/10th of the test list.
 
+For Collabora's LAVA farm, the `device types
+<https://lava.collabora.dev/scheduler/device_types>`__ page can tell you how
+many boards of a specific tag are currently available by adding the "Idle" and
+"Busy" columns.  For bare-metal, a gitlab admin can look at the `runners
+<https://gitlab.freedesktop.org/admin/runners>`__ page.  A pipeline should
+probably not create more jobs for a board type than there are boards, unless 
you
+clearly have some short-runtime jobs.
+
 If a HW CI farm goes offline (network dies and all CI pipelines end up
 stalled) or its runners are consistently spuriously failing (disk
 full?), and the maintainer is not immediately available to fix the

Reply via email to