On 15/06/2026 16:32, Timur Kristóf wrote:
On Monday, June 15, 2026 5:23:52 PM Central European Summer Time Tvrtko
Ursulin wrote:
On 15/06/2026 15:52, Timur Kristóf wrote:
On Monday, June 15, 2026 4:32:23 PM Central European Summer Time Tvrtko

Ursulin wrote:
On 25/05/2026 12:45, Timur Kristóf wrote:
Different interrupts may have different timestamp sources,
which shouldn't be compared.

If we compare the timestamps of retry faults to timestamps
of other interrupts, it may result in all retry fault
interrupts being filtered out, because of the different
time stamp source.

This issue was observed on Strix Halo.
Solved by storing the timestamp of the last page fault interrupt.

Hi,

This one may require access to AMD docs to review. For example I am
immediately curious as to how many different clock sources on a single
IH there are

As far as I know there are various timestamp sources in the GPU and some
interrupts use different ones. I am not aware of any documentation on this
topic, unfortunately.

how does that relate to the timestamp_src field

The timestamp_src field is set differently when the timestamp source is
different. So, it could happen that we accidentally filter out all page
faults when we shouldn't.

and if there are indeed multiple clock domains should the patch perhaps
be
generalized to something like
ih->processed_timestamp[entry->timestamp_src] or something?

For the context of this patch, I think it doesn't matter how many
different
kinds of time stamps there are. What's important is that we just shouldn't
compare timestamps of page faults with time stamps of other interrupts.

True, thank you!

Another question is why the backward timestamp check is needed only for
fault interrupts? I do not see it elsewhere.

Correct, this is only used for retry fault interrupts and only when they are
dispatched to the soft IH ring.

The reason this was added is because when retry faults are enabled and the GPU
hits a VM fault, it keeps spamming the CPU with many interrupts for the same
fault until the fault is resolved. The CPU needs to filter out the faults which
it is already handling, otherwise we would end up handling the same fault
multiple times.

Got it, thank you!

(As a side note, I should also probably look into how to reduce the frequency
of how often these interrupts are repeated.)


Let me also ask two more things below.

As far as I see the timestamp doesn't really matter for other interrupts
as we only use it to filter out page faults and nothing else.

Hope this helps,
Timur

---

    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c | 5 ++++-
    drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h | 1 +
    2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c index
13bec8461cde..52258f1341c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.c
@@ -437,9 +437,12 @@ bool amdgpu_gmc_filter_faults(struct amdgpu_device
*adev,>

        uint32_t hash;
        
        /* Stale retry fault if timestamp goes backward */

-       if (amdgpu_ih_ts_after(timestamp, ih->processed_timestamp))
+       if (timestamp == adev->gmc.processed_fault_timestamp ||
+               amdgpu_ih_ts_after(timestamp, adev-

gmc.processed_fault_timestamp))

First thing is whether you are confident the equality check is either
safe or required?

I don't see why it wouldn't be safe. But maybe it isn't required.
What do you suggest instead?

Safe as is whether it has potential to swallow a legitimate unseen faults.

Looking at amdgpu_gmc_filter_faults() a bit lower down, it does appear to filter out repeated faults on the same address. Would it be safe to rely on that instead of the timestamp equality check?

I appreciate that may cause a transient interrupt handling storm if the clock granularity is poor, but maybe that is better than losing a fault.

For example can two blocks fault with the same timestamp on different
addresses?

They might. But keep in mind that the GFX block just keeps spamming the
interrupts until the fault is handled. So, if we filter one out by mistake, we
know we will just receive the same fault again very soon.

Or from a different angle, is the clock granularity good enough to not
coalesce two separate faults to a single timestamp?

I am not sure about that.

I guess if the equality filter can be removed then this concern also goes away.

Regards,

Tvrtko


                return true;

+       adev->gmc.processed_fault_timestamp = MAX(timestamp,
adev->gmc.processed_fault_timestamp); +

Doesn't a plain assign work here? The if above has already verified new
timestamp is larger than the old.

Regards,

Tvrtko

        /* If we don't have space left in the ring buffer return

immediately */

        stamp = max(timestamp, AMDGPU_GMC_FAULT_TIMEOUT + 1) -
        
                AMDGPU_GMC_FAULT_TIMEOUT;

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h index
676e3aaa1f27..77eb15380284 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gmc.h
@@ -361,6 +361,7 @@ struct amdgpu_gmc {

        u64 noretry_flags;
        u64 init_pte_flags;

+       u64 processed_fault_timestamp;

        bool flush_tlb_needs_extra_type_0;
        bool flush_tlb_needs_extra_type_2;





Reply via email to