gem_eio: Try harder to measure median resume time

Janusz Krzysztofik Fri, 17 Oct 2025 21:33:49 -0700

Hi Kamil,

On Thursday, 9 October 2025 19:05:21 CEST Kamil Konieczny wrote:
> Hi Janusz,
> On 2025-10-09 at 12:22:12 +0200, Janusz Krzysztofik wrote:
> > Hi Kamil,
> > 
> > On Wednesday, 8 October 2025 18:42:42 CEST Kamil Konieczny wrote:
> > > Hi Janusz,
> > > On 2025-10-08 at 14:52:44 +0200, Janusz Krzysztofik wrote:
> > > > Hi Kamil,
> > > > 
> > > > On Wednesday, 8 October 2025 14:14:47 CEST Kamil Konieczny wrote:
> > > > > Hi Janusz,
> > > > > On 2025-10-07 at 13:38:25 +0200, Janusz Krzysztofik wrote:
> > > > > > Subtests that measure time of resume after engine reset require 
> > > > > > results
> > > > > > from at least 9 reset-resume cycles for reasonable calculation of a 
> > > > > > median
> > > > > > value to be compared against a presumed limit.  On most Gen12+ 
> > > > > > platforms,
> > > > > > as well as on some older ones like JSL, CHV, ILK or ELK, the 
> > > > > > current limit
> > > > > > of 5 seconds for collecting those results occurs too short.
> > > > > > 
> > > > > > Raise the limit to an empirically determined value of 20 seconds 
> > > > > > and break
> > > > > > the loop as soon as 9 results are collected.
> > > > > > 
> > > > > > v2: Split out a change in handling of not enough measurements to a
> > > > > >     separate patch (Kamil),
> > > > > >   - reword commit message to be more distinct from other patches in
> > > > > >     series (Kamil),
> > > > > >   - reword commit message and description so they no longer state 
> > > > > > the
> > > > > >     scope of the issue is limited to Gen12+, and list other 
> > > > > > (non-Gen12+)
> > > > > >     platforms found also affected.
> > > > > > 
> > > > > > Signed-off-by: Janusz Krzysztofik 
> > > > > > <[email protected]>
> > > > > > ---
> > > > > >  tests/intel/gem_eio.c | 5 ++++-
> > > > > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/tests/intel/gem_eio.c b/tests/intel/gem_eio.c
> > > > > > index 0a00ef026e..79dcef8fa6 100644
> > > > > > --- a/tests/intel/gem_eio.c
> > > > > > +++ b/tests/intel/gem_eio.c
> > > > > > @@ -929,7 +929,7 @@ static void reset_stress(int fd, uint64_t ahnd, 
> > > > > > const 
> > > > intel_ctx_t *ctx0,
> > > > > >     gem_write(fd, obj.handle, 0, &bbe, sizeof(bbe));
> > > > > >  
> > > > > >     igt_stats_init(&stats);
> > > > > > -   igt_until_timeout(5) {
> > > > > > +   igt_until_timeout(20) {
> > > > > 
> > > > > What I wanted here was actually (in pseudocode):
> > > > > 
> > > > > mtime = gen < 5 || gen >= 12 ? 20 : 5;
> > > > 
> > > > That's incorrect.  JSL, now mentioned in commit description (see also 
> > > > changelog), is gen11, and it's the only one of that gen that exhibits 
> > > > the 
> > > > problem.  Moreover, some affected CI machines need more time in 
> > > > unwedge-stress 
> > > > and not necessarily in reset-stress, some vice versa, and still some 
> > > > need more 
> > > > time in both.  That may sound strange, but that's how results from my 
> > > > many 
> > > > trybot attempts look like.  Also, not all pre-gen5 machines require a 
> > > > higher 
> > > > limit on resume time, as it is handled now and extended over gen12+ in 
> > > > next 
> > > > patch.  So before I try to fulfil your expectation and use a formula 
> > > > here, not 
> > > > a constant, we have to agree on how much precise that formula should 
> > > > be.  If 
> > > > you don't accept a simplified approach then I have to spend more time 
> > > > on 
> > > > finding out what exactly takes time on kernel side in each of those 
> > > > distinct 
> > > > cases and maybe then I will be able to formulate exact conditions when 
> > > > we 
> > > > should wait longer and when not.
> > > > 
> > > 
> > > One more note - maybe it is related with two GTs: GT0 and GT1?
> > 
> > Maybe, but that's only one of potential factors, not covering cases like 
> > DG2 
> > or ILK as an example of two cases completely different, I believe.
> > 
> > > 
> > > It could go with simplified formula here and just use some value,
> > > 20 or 10?
> > 
> > I still don't understand what your goal here is.  What issue do you expect 
> > to 
> > be avoided or resolved by replacing the new, higher constant value with a 
> > formula?  If I understood your point than I should be able to propose a 
> > solution.
> > 
> > > 
> > > Btw did you see results for v1? The gem_eio@kms subtests
> > > is failing due to disk limit in CI, and in logs there are
> > > 21 'Forcing GPU reset' messages.
> > > 
> > > https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_13866/shard-dg2-5/igt@[email protected]
> > > https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_13866/shard-tglu-8/igt@[email protected]
> > > 
> > > Is it not related to your series?
> > 
> > Yes, that's related, in the sense that before, there was a shorter, 5s 
> > limit 
> > for performing resets and measuring resume time, so less noise was 
> > accumulated 
> > in dmesg than now when we wait up to 20s for 9 measurements collected in 
> > order 
> > to avoid falsely reporting success when we don't evaluate results because 
> > less 
> > than 9 have been collected.
> > 
> > > Maybe number of resets should also be lowered?
> > 
> > The kms subtest consist of 3 exercises.  The first one -- inflight -- 
> > triggers 
> > 7 resets (1 reset per ring), the remaining two ones are equivalents of 
> > reset-
> > stress and unwedge-stress, with 9 resets per each of the 2 scenarios 
> > required 
> > as a minimum for stable median calculation, so 25 resets in total.
> > 
> > Are you sure we are free to lower that limit of 9 measurements required for 
> > stable median?
> 
> Not sure.
> 
> > 
> > I think we should rather convince CI and display developers to limit the 
> > amount of noise in dmesg generated in CI runs by display debugging.
> > 
> > > Also test took over 20 seconds after it was killed.
> > 
> > I can see "Disk usage limit exceeded" event reported at timestamp 
> > 340.998570, 
> > and next subtest scheduled at 342.958470.  Where do you see those 20 
> > seconds?
> > 
> > Thanks,
> > Janusz
> > 
> 
> I did see it in dmesg4.txt, excerpt:
> 
> 33669:<7>[  494.568318] [IGT] gem_eio: executing
> 33835:<7>[  496.717429] [IGT] gem_eio: starting subtest kms
> 39187:<7>[  498.714559] [IGT] Forcing GPU reset
> 39767:<7>[  498.903888] [IGT] Checking that the GPU recovered
> 
> and later:
> 
> 87173:<7>[  516.070903] [IGT] Checking that the GPU recovered
> 88860:<7>[  516.628325] [IGT] Forcing GPU reset
> 89701:<7>[  518.764372] [IGT] kms_frontbuffer_tracking: executing
> 
> so 516 - 496 is 20.
> 
> Re-checked it:
> grep IGT 25-10-08-gem-eio-dmesg4.txt |grep ': executing'|grep -A5 gem_eio
> <7>[  494.568318] [IGT] gem_eio: executing
> <7>[  518.764372] [IGT] kms_frontbuffer_tracking: executing


OK, but that doesn't show when gem_eio was killed, then doesn't mean gem_eio 
"took over 20 seconds after it was killed".  Timestamps found in igt_runner 
log show it took less than 2 seconds.

[494.326197] [085/130] (533s left) gem_eio (kms)
[496.519345] Starting subtest: kms
[516.673500] Disk usage limit exceeded.
[518.473021] Closing watchdogs
[518.475503] Initializing watchdogs
[518.475569]   /dev/watchdog0
[518.503334] [FACT before any test] new: hardware.pci.gpu_at_addr.0000:03:00.0: 
8086:56a0 Intel Dg2 (Gen12) DG2 [Arc A770]
[518.516396] [FACT before any test] new: 
hardware.pci.drm_card_at_addr.0000:03:00.0: card0
[518.518650] [FACT before any test] new: kernel.kmod_is_loaded.i915: true
[518.519215] [FACT before any test] new: kernel.kmod_is_loaded.vgem: true
[518.521639] [086/130] (508s left) kms_frontbuffer_tracking 
(fbcpsr-2p-primscrn-cur-indfb-draw-mmap-wc)

> 
> Btw there were some KMS changes which lowered DRM logging
> to not trigger disklimit, maybe it should be also used in
> this subtest for example after first reset.

Did you mean there were changes in the kernel, or in IGT?  It would be very 
helpful if we could reuse an already accepted method for decreasing verbosity 
of display code from IGT, if there is one.

Thanks,
Janusz


> 
> Regards,
> Kamil
> 
> > 
> > > 
> > > Regards,
> > > Kamil
> > > 
> > > > > 
> > > > >       igt_until_timeout(mtime) {
> > > > > 
> > > > > >             const intel_ctx_t *ctx = context_create_safe(fd);
> > > > > >             igt_spin_t *hang;
> > > > > >             unsigned int i;
> > > > > > @@ -978,6 +978,9 @@ static void reset_stress(int fd, uint64_t ahnd, 
> > > > > > const 
> > > > intel_ctx_t *ctx0,
> > > > > >             gem_sync(fd, obj.handle);
> > > > > >             igt_spin_free(fd, hang);
> > > > > >             intel_ctx_destroy(fd, ctx);
> > > > > > +
> > > > > > +           if (stats.n_values > 8)
> > > > > 
> > > > > Can it be a define as it is used in other places, for example:
> > > > > #define NUMER_OF_MEASURED_CYCLES_NEEDED 9
> > > > > 
> > > > > so you will use it elsewhere, and here it will be:
> > > > > 
> > > > >               if (stats.n_values >= NUMER_OF_MEASURED_CYCLES_NEEDED)
> > > > >                       break;
> > > > 
> > > > OK.
> > > > 
> > > > > 
> > > > > >     }
> > > > > >     check_wait_elapsed(name, fd, &stats);
> > > > > 
> > > > > I did give you r-b for patch 1/5 but I am not sure how
> > > > > reliable are measurements, should it be an assert instead of skip?
> > > > > Maybe function check_wait_elapsed() should return bool to tell if
> > > > > median is ready, and in each place subtests itself decide if it
> > > > > should skip or assert? Up to you.
> > > > 
> > > > check_wait_elapsed() is called only from reset_stress(), which in turn 
> > > > is 
> > > > called only by 3 subtests, all in scope of this series.  Can you 
> > > > suggest some 
> > > > criteria when you think a subtest should skip and when it should fail 
> > > > if not 
> > > > enough results have been collected?  I've chosen skip because we 
> > > > couldn't do 
> > > > much with fail other than blocklisting the failing subtest, while CI 
> > > > can 
> > > > handle skips as expected skips on selected platforms if we really can't 
> > > > find 
> > > > a balance among the loop long enough for collecting enough measurements 
> > > > and 
> > > > short enough for not exceeding per test timeout on platforms with many 
> > > > engines.
> > > > 
> > > > Thanks,
> > > > Janusz
> > > > 
> > > > 
> > > > > 
> > > > > Regards,
> > > > > Kamil
> > > > > 
> > > > > >     igt_stats_fini(&stats);
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > 
> > 
> > 
> > 
>

Re: [PATCH i-g-t v2 2/5] tests/gem_eio: Try harder to measure median resume time

Reply via email to