On 2/16/26 11:58 AM, Matt Coster wrote:

[...]

I added the regressions list onto CC, because this seems like a problem
worth tracking.

Noticed that and wondered what change caused the regression.

 From our side at least, I don't believe this is a regression at all. We
haven't been able to reproduce this issue on any of the platforms we
have available (although we did stumble on a somewhat related bugfix
while trying).

There are already two proposed fixes available, and both Geert and me are available for testing any other fixes too.

And yes, this is a regression, because this causes a kernel crash which did not happen before.


I think this one:

330e76d31697 ("drm/imagination: Add power domain control")

This commit added support for multiple power domains on Imagination GPUs
in essentially the same manner as other drivers (as mentioned by Geert
below). Nothing in there is specific to the Renesas platforms where this
bug can be produces and it is required to support other fully-functional
platforms such as TI AM68.

That does not imply, that the code is not buggy.

[...]

Well, it's a judgement call. 330e76d31697 was merged less then a year
ago, so I'd not be surprised at all if Linus would revert it in a case
like this. But it seems it doesn't revert clearly anymore, which
complicates things.

Reverting this change feels backwards, we're saying that Imagination's
support for multiple power domains is broken because one platform has
issues when we attempt to control its power domains. I fully agree that
we should be working towards resolving this issue, but I don't agree
that ripping out this handling (that, again, is essentially the same as
handling in other GPU drivers) is a reasonable solution (even in the
short term).

As it currently stands, we're only declaring two GPUs as "fully
supported": AXE-1-16M and BXS-4-64 as made explicit in commit
1c21f240fbc1e ("drm/imagination: Warn or error on unsupported
hardware").

Actually, this change landed only very recently and completely broke the GX6250 on R-Car M3W, which used to probe and work before. Now the driver does not even probe with "Unsupported GPU" output, which is clearly bogus.

I would say the aforementioned commit introduced another new regression -- something that used to work before no longer works, functionality was lost because of the change.

In order to reach the stage where that check occurs, the GPU
must be powered on so registers containing the GPU ID can be read.
However, the original error report happens long after this and would now
require the exp_hw_support module parameter to be set for this point to
be reached (as the driver would not attempt to initialise an
"unsupported" device without this explicit opt-in).

I don't think it is a good solution to selectively block specific hardware, only to avoid triggering the power domain issue. The power domain issue should be fixed, not hidden.

It's not currently clear to me whether the crash can be reproduced on
the affected Renesas platform without this opt-in; it's feasable that
this is the case but we know it occurs during an error unwind after
device firmware fails to initialise and the exp_hw_support check happens
much earlier than this.

Either way, we should consider adding additional "supported" checks
based on compatibile strings earlier during initialization to prevent
fundamental issues like this one from even being possible on unsupported
and/or unvalidated hardware.

Actually no, the hardware support should be as broad as possible, without imposing any artificial limitations like that. This allows users to test the driver and find errors, which if they are fixed, ultimately improve quality of the driver.

All that being said, ripping out power domain handling will completely
break one of the only two platforms we currently fully support: BXS-4-64
which also has two power domains and is present in several TI SoCs (and
their DTs).

I am not saying the code should be removed, I am saying the regression should be fixed. There are already two proposed fixes available, so it would be good to move them forward.

1c21f240fbc1e ("drm/imagination: Warn or error on unsupported hardware") should be reverted, this is not helping anything. Warning message would be fine though.

But SoCs which have hierarchical
power domains and which manage to probe this driver without having a
firmware available for the GPU will simply end with crashed kernel,
which is really not good.

Does the patch Matt mentioned fix the crash? His "this does not fix the
underlying issue [...]" (see quote earlier) makes it sound like the
crash or some other problem (theoretical or practical? regression or
not?) remains. If that's the case and no quick fix in sight I guess it
would be best if someone affected could post a revert and then we can
ask Linus if he wants to pick it up.

The patch I posted (and applied) is somewhat orthogonal to the issue at
hand. Geert originally suggested using the _attach_list() helper as a
way to mitigate the issue, and we agree that it's a reasonable cleanup
effort to make use of it, but I believe the "fix" in the originally
proposed patch came from removing the device links that are required on
our platform to ensure the GPU power domains come up (and down) in the
correct sequence.


I don't think that patch would fix the crash.  The Adreno and Panfrost
GPU drivers do similar things (explicit multi-PM Domain handling),
so I am wondering if the issue can be triggered with them too (e.g. on
unbind).

My current understanding of the situation is that the fix proposed by
Marek in the Reneasas driver[2] works, but is not suitable since
pm_runtime_barrier() should be inserted by the caller, not the power
driver. But it seems that's not always possible (particularly when using
devm), so I don't really understand where we go from here. I still don't
see anything we're doing substantially differently (before or after the
commit I mentioned above) from anybody else.

As far as I understand that, Geert was waiting for input from PM people?

At the beginning of this thread, almost a month ago, there was the following exchange, hence I was under the impression that a fix is coming:

"
>> Can you add the links back in for a V2 or I can properly send the
>> attached patch instead, I don't mind either way.
>
> Please move forward with your patch, you are the expert.
> I prefer not to be blamed for any breakage ;-)
"

Reply via email to