Re: [e-users] enlightenment + nvidia + resume from suspend => problem

Carsten Haitzler Wed, 05 Jan 2022 11:53:06 -0800

On Wed, 5 Jan 2022 17:21:46 +0100 "[email protected]" <[email protected]>
said:


> Carsten Haitzler <[email protected]> ezt írta (időpont: 2022. jan.
> 5., Sze, 14:50):
> >
> > On Wed, 5 Jan 2022 13:57:39 +0100 "[email protected]"
> > <[email protected]> said:
> >
> > > Carsten Haitzler <[email protected]> ezt írta (időpont: 2022. jan.
> > > 5., Sze, 11:54):
> > > >
> > > > On Wed, 5 Jan 2022 08:41:05 +0100 "[email protected]"
> > > > <[email protected]> said:
> > > >
> > > > > Carsten Haitzler <[email protected]> ezt írta (időpont: 2022. jan.
> > > > > 5., Sze, 0:37):
> > > > > >
> > > > > > On Tue, 4 Jan 2022 22:31:26 +0100 "[email protected]"
> > > > > > <[email protected]> said:
> > > > > >
> > > > > > > Carsten Haitzler <[email protected]> ezt írta (időpont: 2022.
> > > > > > > jan. 4., K, 15:21):
> > > > > > > >
> > > > > > > > On Tue, 4 Jan 2022 11:56:00 +0100 "[email protected]"
> > > > > > > > <[email protected]> said:
> > > > > > > >
> > > > > > > > > Carsten Haitzler <[email protected]> ezt írta (időpont:
> > > > > > > > > 2022. jan. 3., H, 22:49):
> > > > > > > > > >
> > > > > > > > > > On Mon, 3 Jan 2022 22:28:19 +0100 "[email protected]"
> > > > > > > > > > <[email protected]> said:
> > > > > > > > > >
> > > > > > > > > > > Carsten Haitzler <[email protected]> ezt írta (időpont:
> > > > > > > > > > > 2022. jan. 3., H, 21:36):
> > > > > > > > > > > >
> > > > > > > > > > > > On Mon, 3 Jan 2022 19:34:41 +0100 "[email protected]"
> > > > > > > > > > > > <[email protected]> said:
> > > > > > > > > > > >
> > > > > > > > > > > > > Carsten Haitzler <[email protected]> ezt írta
> > > > > > > > > > > > > (időpont:
> > > > > > > > > > > > > 2022. jan. 3., H, 19:13):
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Mon, 3 Jan 2022 17:07:43 +0100
> > > > > > > > > > > > > > "[email protected]" <[email protected]> said:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I've a brand new amd laptop with an nvidia mobile
> > > > > > > > > > > > > > > GPU. It arrived with TuxedoOS (ubuntu 20.04 +
> > > > > > > > > > > > > > > budgie wm) preinstalled. That setup works fine
> > > > > > > > > > > > > > > out of the box, but I want to replace budgie with
> > > > > > > > > > > > > > > enlightenment, because that's what I always use
> > > > > > > > > > > > > > > on linux.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I've compiled E 0.25 from git (using
> > > > > > > > > > > > > > > https://github.com/batden/esteem), and it seemed
> > > > > > > > > > > > > > > to work fine. Unfortunately, when I tested
> > > > > > > > > > > > > > > suspend+resume, I had a problem. The desktop
> > > > > > > > > > > > > > > resumes, but only with minimal brightness, and
> > > > > > > > > > > > > > > then it seems to freeze (no keyboard/mouse). I
> > > > > > > > > > > > > > > can ssh into the laptop, and killing
> > > > > > > > > > > > > > > enlightenment sends me back to the lightdm login
> > > > > > > > > > > > > > > prompt.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > dmesg has this:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > [11814.110778] PM: suspend exit
> > > > > > > > > > > > > > > [11814.630838] NVRM: GPU at PCI:0000:01:00:
> > > > > > > > > > > > > > > GPU-589fde69-1161-f26b-1773-e5bcda70d601
> > > > > > > > > > > > > > > [11814.630845] NVRM: Xid (PCI:0000:01:00): 13,
> > > > > > > > > > > > > > > pid=5525, Graphics Exception: Shader Program
> > > > > > > > > > > > > > > Header 11 Error [11814.630855] NVRM: Xid (PCI:
> > > > > > > > > > > > > > > 0000:01:00): 13, pid=5525, Graphics Exception:
> > > > > > > > > > > > > > > Shader Program Header 18 Error [11814.630865]
> > > > > > > > > > > > > > > NVRM: Xid (PCI: 0000:01:00): 13, pid=5525,
> > > > > > > > > > > > > > > Graphics Exception: ESR 0x405840=0xa2040800
> > > > > > > > > > > > > > > [11814.630877] NVRM: Xid (PCI: 0000:01:00): 13,
> > > > > > > > > > > > > > > pid=5525, Graphics Exception: ESR
> > > > > > > > > > > > > > > 0x405848=0x80000000
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The problem happens with both the sw and the
> > > > > > > > > > > > > > > opengl compositors.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > When I suspend from the lightdm prompt or from the
> > > > > > > > > > > > > > > budgie desktop, resuming works fine. So it seems
> > > > > > > > > > > > > > > something is happening/not happening with the
> > > > > > > > > > > > > > > nvidia card when the suspend is started from E.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Anyone has any idea, how to debug this?
> > > > > > > > > > > > > > i suspect it may have to do with vblank interrupts.
> > > > > > > > > > > > > > the nvidia driver doesn't produce them anymore? a
> > > > > > > > > > > > > > quick way to test this:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > touch ~/.ecore-no-vsync
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > restart e then do your suspend/resume
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks for your reply. Unfortunately the problem
> > > > > > > > > > > > > seems to be somewhere else, as resuming still fails
> > > > > > > > > > > > > the same way. Anything else to try? Could rebuilding
> > > > > > > > > > > > > E in debugging mode help?
> > > > > > > > > > > >
> > > > > > > > > > > > probably not - btw - those shader exceptions might have
> > > > > > > > > > > > to do with it. evas caches binaries for shaders. rm -rf
> > > > > > > > > > > > ~/.cache/evas_gl_common_caches/ - but beyond that the
> > > > > > > > > > > > only thing left is your driver. those are its shaders it
> > > > > > > > > > > > compiled.
> > > > > > > > > > > >
> > > > > > > > > > > > google for it: "Graphics Exception: Shader Program
> > > > > > > > > > > > Header 11 Error"
> > > > > > > > > > > >
> > > > > > > > > > > > seems to actually be OS independent and happen on
> > > > > > > > > > > > windows too.
> > > > > > > > > > > >
> > > > > > > > > > > > https://forums.developer.nvidia.com/t/screen-system-is-dead-on-resume-unable-to-resume-with-all-current-drivers/29872/57?page=3
> > > > > > > > > > > >
> > > > > > > > > > > > this has been there for a long time... and it seems it
> > > > > > > > > > > > doesn't get resolved.
> > > > > > > > > > > >
> > > > > > > > > > > > https://github.com/Bumblebee-Project/Bumblebee/issues/739
> > > > > > > > > > >
> > > > > > > > > > > Yeah, I've tried googling for this too, but found no
> > > > > > > > > > > solutions either.
> > > > > > > > > > >
> > > > > > > > > > > > it could be that evas uses egl+gles and the nvidia
> > > > > > > > > > > > driver implementation for egl+gles is buggy - you can
> > > > > > > > > > > > rebuild efl to use full desktop opengl+glx
> > > > > > > > > > > > (-Dopengl=full).
> > > > > > > > > > >
> > > > > > > > > > > I've deleted the evas cache, and set the compositor to SW
> > > > > > > > > > > to make sure that it's not an evas egl problem. The
> > > > > > > > > > > exceptions are still there. Actually there are 3
> > > > > > > > > > > exceptions for the kernel thread "[irq/92-nvidia]", and 1
> > > > > > > > > > > for Xorg. When the compositor was set to opengl there
> > > > > > > > > > > were more exceptions, and one of them is was for the
> > > > > > > > > > > enlightenment process.
> > > > > > > > > > >
> > > > > > > > > > > So my guess is, that this may not be a problem in E, but
> > > > > > > > > > > maybe a missing/extra step during suspend/resume. I'll
> > > > > > > > > > > look into this tomorrow.
> > > > > > > > > > >
> > > > > > > > > > > Thanks for your help, Laszlo
> > > > > > > > > >
> > > > > > > > > > hmm i wonder why the nvidia driver is complaining -
> > > > > > > > > > something is using a shader program of some sot and it's
> > > > > > > > > > not happy at all. there i something deeper going on here.
> > > > > > > > > > but yes - with e using opengl for compositing it'll be
> > > > > > > > > > driving the gpu (via opengl) and thus more chance of
> > > > > > > > > > something going wrong.
> > > > > > > > >
> > > > > > > > > I've found another strange thing. In my original
> > > > > > > > > configuration I used amdgpu+nvidia X drivers. Now I switched
> > > > > > > > > to modesetting+nvidia. Resuming fails again, but there is a
> > > > > > > > > different new problem. After starting E from lightdm as
> > > > > > > > > usual, I press ctrl+alt+end to restart E, it fades to black
> > > > > > > > > as usual, then it switches to something that looks like a
> > > > > > > > > console (empty black screen with a cursor line) and stays
> > > > > > > > > there. I can not restore the desktop until I kill E.  No
> > > > > > > > > exceptions from nvidia in the dmesg this time. Any idea for
> > > > > > > > > this?
> > > > > > > >
> > > > > > > > so this is an optimus setup of some sort but now with amd +
> > > > > > > > nvidia... i might imagine something goes wrong setting up randr
> > > > > > > > maybe? simotek found his optimus setup required a forced
> > > > > > > > refresh of randr info ... and e has that in it (otherwise edid
> > > > > > > > info would not be populated right). check ~/.e-log.log - it
> > > > > > > > will tell you what e is doing randr-wise and what it sees, but
> > > > > > > > you should end up with some kind of screen. perhaps go back
> > > > > > > > away from modesetting to amdgpu + nvidia?
> > > > > > >
> > > > > > > I've switched off the optimus stuff, and checked what happens
> > > > > > > with the nvidia only setup. Unfortunately it failed with the
> > > > > > > usual GPU error.
> > > > > > >
> > > > > > > Then I switched back to amdgpu+nvidia again, and saved the log
> > > > > > > file. Maybe you can see something in it:
> > > > > > >
> > > > > > > https://drive.google.com/file/d/1r69Bw43uMS8xWM2wemqxUvIAr0xH76pp/view?usp=sharing
> > > > > >
> > > > > > resume has nothing odd to do with randr.. but this smells a bit
> > > > > > weird:
> > > > > >
> > > > > > ERROR: ecore_animator thread - epoll_wait(..., 200) at 3870,51700
> > > > > > should have slept ~ 0,01667s but took 1,65593s!
> > > > > >
> > > > > > that smells very wrong - the animator thread asked to sleep for
> > > > > > 16.67ms but slept 1650ms instead ... and this is measuring
> > > > > > monotonic time - not wall clock. monotonic stops ticking when
> > > > > > suspended. this thread is dedicated to ticking for animation so
> > > > > > will not be blocked by the mainloop... this is kernel not sleeping
> > > > > > for anywhere near the time it should.
> > > > > >
> > > > > > so with amdgpu+nvidia it works? i'm not sure from your mail.
> > > > >
> > > > > None of amdgpu+nvidia, modesetting+nvidia, and nvidia alone work - GPU
> > > > > shader error when resuming. Desktop is at minimal brightness, no
> > > > > inputs accepted.
> > > >
> > > > Well it could be E is hung - you will only know if you send a SEGV
> > > > signal (kill -SEGV `pidof enlgithenment`) then collect a backtrace with
> > > > gdb and see where it's at.
> > >
> > > Actually it seems that not E is hung, but rather the X server. When I
> > > kill E, it gets restarted (new PID) but the desktop remains frozen. I
> > > have to kill enlightenment_start to get back to the lighdm login
> > > prompt.
> >
> > wow.. well then... maybe e hit on an xorg/nvidia driver bug? some people
> > have reported bad things with sddm - somehow it has caused e to launch in
> > wayland .. or xwayland (i dont know how it could do the latter so i assume
> > it launched in wl mode).
> >
> > > > > With modesetting+nvidia there is a new problem: restarting E with
> > > > > ctrl+alt+end does not work (switches to console mode). Suspend/resume
> > > > > is not involved in this, and there is no GPU error.
> > > >
> > > > I can't help a lot with nvidia - I gave up on them years ago because
> > > > they didn't want to play ball with Wayland like everyone else and
> > > > frankly having their kernel driver keep breaking on kernel upgrades
> > > > (kernel changes api/abi - nvidia driver can't build anymore and i'm
> > > > forced to manually downgrade my kernel). I can say that all of my
> > > > machines run arch linux (except some of my arm devices - they are
> > > > special and mostly used as testbeds and not stable systems) and they
> > > > all use either amd or intel graphics and suspend/resume works.
> > >
> > > Well, I originally wanted to buy an amd CPU+amd GPU laptop, but none
> > > of I found ticked all the boxes. Now I have amd CPU+nvidia GPU and an
> > > ugly shader error... :-/
> >
> > well this is personal - but i'd just veto any choices that involve an nvidia
> > gpu. if nvidia drivers were all oss like amd - i wouldn't have as much of an
> > issue. i know it doesn't help you now, but maybe in future choices.
> 
> I agree with you, full amd would have been better. But unfortunately I
> was in a hurry, because on my old laptop the power selector chip died,
> and now the laptop can not be used from battery any more. So it became
> a desktop, and I needed mobility.

Something to keep in mind for the future. :) At least nvidia now are beginning
to play nice Wayland-wise by supporting gbm, but I made my decision already
years ago and have been happy ever since. :)

> > > After some googling, I found that it's possible to disable the nvidia
> > > GPU in nvidia-settings, and use amdgpu exclusively. I've tried this,
> > > and E+resume works like as it should! Unfortunately I have no externel
> > > monitor outputs in this mode, because only nvidia is wired to the
> > > hdmi/DP ports. Oh well.
> >
> > well wow.. so something to do with nvidia maybe optimus ... but... hmmm.
> > but at least see if you can get a backtrace from e to see where it is stuck
> > - if it is. that will tell me some information at least.
> 
> I changed back nvidia-settings to use nvidia optimus mode (to generate
> a backtrace for you), but guess what, resuming works now!!! There is
> no shader error in dmesg. After looking around more closely, it seems
> I've changed the "Prime profiles" from "Intel (Power Saving Mode)" -
> [this was actually the amdgpu only mode, where E worked] to "NVIDIA On
> Demand" mode. There is a third option here which is "NVIDIA
> (Performance mode)" - this is the mode I was using before. So in
> Performance Mode, nvidia-smi shows that E has some parts which run on
> the nvidia GPU, but in "On Demand" mode E is run on amdgpu. And
> resuming works this way.
> 
> The only problem I see is that after resuming the external monitor
> stays black, but xrandr thinks it is connected. I'm looking at this
> now.

Well - you're getting somewhere. You seem to have stumbled on some
nvidia/optimus related driver bug. it seems maybe e just happens to trigger it
by luck (or un-luck). This does happen - thing sonly get tested with specific
workloads. When a new workload appears, then it sometimes triggers different
code paths that SHOULD work but have a bug and now the bug is exposed. It
requires people to then test, reproduce and then fix it. It may be deep in the
nvidia blob. Maybe in the glue binding it to the amdgpu driver with optimus. I
don't know. I haven't seen this issue, but I have known to keep away from
anything optimus related as while there is, in theory, some cool stuff here
tech-wise, it's problematic and has a history of problems.

-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
Carsten Haitzler - [email protected]



_______________________________________________
enlightenment-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/enlightenment-users

Re: [e-users] enlightenment + nvidia + resume from suspend => problem

Reply via email to