Re: [e-users] enlightenment + nvidia + resume from suspend => problem

[email protected] Sat, 08 Jan 2022 08:05:11 -0800

Carsten Haitzler <[email protected]> ezt írta (időpont: 2022. jan.
6., Cs, 21:58):
>
> On Thu, 6 Jan 2022 20:50:43 +0100 "[email protected]" <[email protected]>
> said:
>
> > Carsten Haitzler <[email protected]> ezt írta (időpont: 2022. jan.
> > 5., Sze, 20:51):
> > >
> > > On Wed, 5 Jan 2022 17:21:46 +0100 "[email protected]"
> > > <[email protected]> said:
> > >
> > > > Carsten Haitzler <[email protected]> ezt írta (időpont: 2022. jan.
> > > > 5., Sze, 14:50):
> > > > >
> > > > > On Wed, 5 Jan 2022 13:57:39 +0100 "[email protected]"
> > > > > <[email protected]> said:
> > > > >
> > > > > > Carsten Haitzler <[email protected]> ezt írta (időpont: 2022. 
> > > > > > jan.
> > > > > > 5., Sze, 11:54):
> > > > > > >
> > > > > > > On Wed, 5 Jan 2022 08:41:05 +0100 "[email protected]"
> > > > > > > <[email protected]> said:
> > > > > > >
> > > > > > > > Carsten Haitzler <[email protected]> ezt írta (időpont: 2022.
> > > > > > > > jan. 5., Sze, 0:37):
> > > > > > > > >
> > > > > > > > > On Tue, 4 Jan 2022 22:31:26 +0100 "[email protected]"
> > > > > > > > > <[email protected]> said:
> > > > > > > > >
> > > > > > > > > > Carsten Haitzler <[email protected]> ezt írta (időpont:
> > > > > > > > > > 2022. jan. 4., K, 15:21):
> > > > > > > > > > >
> > > > > > > > > > > On Tue, 4 Jan 2022 11:56:00 +0100 "[email protected]"
> > > > > > > > > > > <[email protected]> said:
> > > > > > > > > > >
> > > > > > > > > > > > Carsten Haitzler <[email protected]> ezt írta 
> > > > > > > > > > > > (időpont:
> > > > > > > > > > > > 2022. jan. 3., H, 22:49):
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Mon, 3 Jan 2022 22:28:19 +0100 
> > > > > > > > > > > > > "[email protected]"
> > > > > > > > > > > > > <[email protected]> said:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Carsten Haitzler <[email protected]> ezt írta
> > > > > > > > > > > > > > (időpont:
> > > > > > > > > > > > > > 2022. jan. 3., H, 21:36):
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Mon, 3 Jan 2022 19:34:41 +0100
> > > > > > > > > > > > > > > "[email protected]" <[email protected]> said:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Carsten Haitzler <[email protected]> ezt írta
> > > > > > > > > > > > > > > > (időpont:
> > > > > > > > > > > > > > > > 2022. jan. 3., H, 19:13):
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > On Mon, 3 Jan 2022 17:07:43 +0100
> > > > > > > > > > > > > > > > > "[email protected]" <[email protected]>
> > > > > > > > > > > > > > > > > said:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Hi,
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I've a brand new amd laptop with an nvidia
> > > > > > > > > > > > > > > > > > mobile GPU. It arrived with TuxedoOS (ubuntu
> > > > > > > > > > > > > > > > > > 20.04 + budgie wm) preinstalled. That setup
> > > > > > > > > > > > > > > > > > works fine out of the box, but I want to
> > > > > > > > > > > > > > > > > > replace budgie with enlightenment, because
> > > > > > > > > > > > > > > > > > that's what I always use on linux.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > I've compiled E 0.25 from git (using
> > > > > > > > > > > > > > > > > > https://github.com/batden/esteem), and it
> > > > > > > > > > > > > > > > > > seemed to work fine. Unfortunately, when I
> > > > > > > > > > > > > > > > > > tested suspend+resume, I had a problem. The
> > > > > > > > > > > > > > > > > > desktop resumes, but only with minimal
> > > > > > > > > > > > > > > > > > brightness, and then it seems to freeze (no
> > > > > > > > > > > > > > > > > > keyboard/mouse). I can ssh into the laptop,
> > > > > > > > > > > > > > > > > > and killing enlightenment sends me back to
> > > > > > > > > > > > > > > > > > the lightdm login prompt.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > dmesg has this:
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > [11814.110778] PM: suspend exit
> > > > > > > > > > > > > > > > > > [11814.630838] NVRM: GPU at PCI:0000:01:00:
> > > > > > > > > > > > > > > > > > GPU-589fde69-1161-f26b-1773-e5bcda70d601
> > > > > > > > > > > > > > > > > > [11814.630845] NVRM: Xid (PCI:0000:01:00): 
> > > > > > > > > > > > > > > > > > 13,
> > > > > > > > > > > > > > > > > > pid=5525, Graphics Exception: Shader Program
> > > > > > > > > > > > > > > > > > Header 11 Error [11814.630855] NVRM: Xid 
> > > > > > > > > > > > > > > > > > (PCI:
> > > > > > > > > > > > > > > > > > 0000:01:00): 13, pid=5525, Graphics 
> > > > > > > > > > > > > > > > > > Exception:
> > > > > > > > > > > > > > > > > > Shader Program Header 18 Error 
> > > > > > > > > > > > > > > > > > [11814.630865]
> > > > > > > > > > > > > > > > > > NVRM: Xid (PCI: 0000:01:00): 13, pid=5525,
> > > > > > > > > > > > > > > > > > Graphics Exception: ESR 0x405840=0xa2040800
> > > > > > > > > > > > > > > > > > [11814.630877] NVRM: Xid (PCI: 0000:01:00):
> > > > > > > > > > > > > > > > > > 13, pid=5525, Graphics Exception: ESR
> > > > > > > > > > > > > > > > > > 0x405848=0x80000000
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > The problem happens with both the sw and the
> > > > > > > > > > > > > > > > > > opengl compositors.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > When I suspend from the lightdm prompt or
> > > > > > > > > > > > > > > > > > from the budgie desktop, resuming works 
> > > > > > > > > > > > > > > > > > fine.
> > > > > > > > > > > > > > > > > > So it seems something is happening/not
> > > > > > > > > > > > > > > > > > happening with the nvidia card when the
> > > > > > > > > > > > > > > > > > suspend is started from E.
> > > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > > Anyone has any idea, how to debug this?
> > > > > > > > > > > > > > > > > i suspect it may have to do with vblank
> > > > > > > > > > > > > > > > > interrupts. the nvidia driver doesn't produce
> > > > > > > > > > > > > > > > > them anymore? a quick way to test this:
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > touch ~/.ecore-no-vsync
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > restart e then do your suspend/resume
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Thanks for your reply. Unfortunately the problem
> > > > > > > > > > > > > > > > seems to be somewhere else, as resuming still
> > > > > > > > > > > > > > > > fails the same way. Anything else to try? Could
> > > > > > > > > > > > > > > > rebuilding E in debugging mode help?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > probably not - btw - those shader exceptions might
> > > > > > > > > > > > > > > have to do with it. evas caches binaries for
> > > > > > > > > > > > > > > shaders. rm -rf ~/.cache/evas_gl_common_caches/ -
> > > > > > > > > > > > > > > but beyond that the only thing left is your 
> > > > > > > > > > > > > > > driver.
> > > > > > > > > > > > > > > those are its shaders it compiled.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > google for it: "Graphics Exception: Shader Program
> > > > > > > > > > > > > > > Header 11 Error"
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > seems to actually be OS independent and happen on
> > > > > > > > > > > > > > > windows too.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > https://forums.developer.nvidia.com/t/screen-system-is-dead-on-resume-unable-to-resume-with-all-current-drivers/29872/57?page=3
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > this has been there for a long time... and it 
> > > > > > > > > > > > > > > seems
> > > > > > > > > > > > > > > it doesn't get resolved.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > https://github.com/Bumblebee-Project/Bumblebee/issues/739
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Yeah, I've tried googling for this too, but found no
> > > > > > > > > > > > > > solutions either.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > it could be that evas uses egl+gles and the nvidia
> > > > > > > > > > > > > > > driver implementation for egl+gles is buggy - you
> > > > > > > > > > > > > > > can rebuild efl to use full desktop opengl+glx
> > > > > > > > > > > > > > > (-Dopengl=full).
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I've deleted the evas cache, and set the compositor
> > > > > > > > > > > > > > to SW to make sure that it's not an evas egl 
> > > > > > > > > > > > > > problem.
> > > > > > > > > > > > > > The exceptions are still there. Actually there are 3
> > > > > > > > > > > > > > exceptions for the kernel thread "[irq/92-nvidia]",
> > > > > > > > > > > > > > and 1 for Xorg. When the compositor was set to 
> > > > > > > > > > > > > > opengl
> > > > > > > > > > > > > > there were more exceptions, and one of them is was
> > > > > > > > > > > > > > for the enlightenment process.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So my guess is, that this may not be a problem in E,
> > > > > > > > > > > > > > but maybe a missing/extra step during 
> > > > > > > > > > > > > > suspend/resume.
> > > > > > > > > > > > > > I'll look into this tomorrow.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Thanks for your help, Laszlo
> > > > > > > > > > > > >
> > > > > > > > > > > > > hmm i wonder why the nvidia driver is complaining -
> > > > > > > > > > > > > something is using a shader program of some sot and 
> > > > > > > > > > > > > it's
> > > > > > > > > > > > > not happy at all. there i something deeper going on
> > > > > > > > > > > > > here. but yes - with e using opengl for compositing
> > > > > > > > > > > > > it'll be driving the gpu (via opengl) and thus more
> > > > > > > > > > > > > chance of something going wrong.
> > > > > > > > > > > >
> > > > > > > > > > > > I've found another strange thing. In my original
> > > > > > > > > > > > configuration I used amdgpu+nvidia X drivers. Now I
> > > > > > > > > > > > switched to modesetting+nvidia. Resuming fails again, 
> > > > > > > > > > > > but
> > > > > > > > > > > > there is a different new problem. After starting E from
> > > > > > > > > > > > lightdm as usual, I press ctrl+alt+end to restart E, it
> > > > > > > > > > > > fades to black as usual, then it switches to something
> > > > > > > > > > > > that looks like a console (empty black screen with a
> > > > > > > > > > > > cursor line) and stays there. I can not restore the
> > > > > > > > > > > > desktop until I kill E.  No exceptions from nvidia in 
> > > > > > > > > > > > the
> > > > > > > > > > > > dmesg this time. Any idea for this?
> > > > > > > > > > >
> > > > > > > > > > > so this is an optimus setup of some sort but now with amd 
> > > > > > > > > > > +
> > > > > > > > > > > nvidia... i might imagine something goes wrong setting up
> > > > > > > > > > > randr maybe? simotek found his optimus setup required a
> > > > > > > > > > > forced refresh of randr info ... and e has that in it
> > > > > > > > > > > (otherwise edid info would not be populated right). check
> > > > > > > > > > > ~/.e-log.log - it will tell you what e is doing randr-wise
> > > > > > > > > > > and what it sees, but you should end up with some kind of
> > > > > > > > > > > screen. perhaps go back away from modesetting to amdgpu +
> > > > > > > > > > > nvidia?
> > > > > > > > > >
> > > > > > > > > > I've switched off the optimus stuff, and checked what 
> > > > > > > > > > happens
> > > > > > > > > > with the nvidia only setup. Unfortunately it failed with the
> > > > > > > > > > usual GPU error.
> > > > > > > > > >
> > > > > > > > > > Then I switched back to amdgpu+nvidia again, and saved the 
> > > > > > > > > > log
> > > > > > > > > > file. Maybe you can see something in it:
> > > > > > > > > >
> > > > > > > > > > https://drive.google.com/file/d/1r69Bw43uMS8xWM2wemqxUvIAr0xH76pp/view?usp=sharing
> > > > > > > > >
> > > > > > > > > resume has nothing odd to do with randr.. but this smells a 
> > > > > > > > > bit
> > > > > > > > > weird:
> > > > > > > > >
> > > > > > > > > ERROR: ecore_animator thread - epoll_wait(..., 200) at
> > > > > > > > > 3870,51700 should have slept ~ 0,01667s but took 1,65593s!
> > > > > > > > >
> > > > > > > > > that smells very wrong - the animator thread asked to sleep 
> > > > > > > > > for
> > > > > > > > > 16.67ms but slept 1650ms instead ... and this is measuring
> > > > > > > > > monotonic time - not wall clock. monotonic stops ticking when
> > > > > > > > > suspended. this thread is dedicated to ticking for animation 
> > > > > > > > > so
> > > > > > > > > will not be blocked by the mainloop... this is kernel not
> > > > > > > > > sleeping for anywhere near the time it should.
> > > > > > > > >
> > > > > > > > > so with amdgpu+nvidia it works? i'm not sure from your mail.
> > > > > > > >
> > > > > > > > None of amdgpu+nvidia, modesetting+nvidia, and nvidia alone work
> > > > > > > > - GPU shader error when resuming. Desktop is at minimal
> > > > > > > > brightness, no inputs accepted.
> > > > > > >
> > > > > > > Well it could be E is hung - you will only know if you send a SEGV
> > > > > > > signal (kill -SEGV `pidof enlgithenment`) then collect a backtrace
> > > > > > > with gdb and see where it's at.
> > > > > >
> > > > > > Actually it seems that not E is hung, but rather the X server. When 
> > > > > > I
> > > > > > kill E, it gets restarted (new PID) but the desktop remains frozen. 
> > > > > > I
> > > > > > have to kill enlightenment_start to get back to the lighdm login
> > > > > > prompt.
> > > > >
> > > > > wow.. well then... maybe e hit on an xorg/nvidia driver bug? some 
> > > > > people
> > > > > have reported bad things with sddm - somehow it has caused e to launch
> > > > > in wayland .. or xwayland (i dont know how it could do the latter so i
> > > > > assume it launched in wl mode).
> > > > >
> > > > > > > > With modesetting+nvidia there is a new problem: restarting E 
> > > > > > > > with
> > > > > > > > ctrl+alt+end does not work (switches to console mode).
> > > > > > > > Suspend/resume is not involved in this, and there is no GPU 
> > > > > > > > error.
> > > > > > >
> > > > > > > I can't help a lot with nvidia - I gave up on them years ago 
> > > > > > > because
> > > > > > > they didn't want to play ball with Wayland like everyone else and
> > > > > > > frankly having their kernel driver keep breaking on kernel 
> > > > > > > upgrades
> > > > > > > (kernel changes api/abi - nvidia driver can't build anymore and 
> > > > > > > i'm
> > > > > > > forced to manually downgrade my kernel). I can say that all of my
> > > > > > > machines run arch linux (except some of my arm devices - they are
> > > > > > > special and mostly used as testbeds and not stable systems) and 
> > > > > > > they
> > > > > > > all use either amd or intel graphics and suspend/resume works.
> > > > > >
> > > > > > Well, I originally wanted to buy an amd CPU+amd GPU laptop, but none
> > > > > > of I found ticked all the boxes. Now I have amd CPU+nvidia GPU and 
> > > > > > an
> > > > > > ugly shader error... :-/
> > > > >
> > > > > well this is personal - but i'd just veto any choices that involve an
> > > > > nvidia gpu. if nvidia drivers were all oss like amd - i wouldn't have
> > > > > as much of an issue. i know it doesn't help you now, but maybe in
> > > > > future choices.
> > > >
> > > > I agree with you, full amd would have been better. But unfortunately I
> > > > was in a hurry, because on my old laptop the power selector chip died,
> > > > and now the laptop can not be used from battery any more. So it became
> > > > a desktop, and I needed mobility.
> > >
> > > Something to keep in mind for the future. :) At least nvidia now are
> > > beginning to play nice Wayland-wise by supporting gbm, but I made my
> > > decision already years ago and have been happy ever since. :)
> > >
> > > > > > After some googling, I found that it's possible to disable the 
> > > > > > nvidia
> > > > > > GPU in nvidia-settings, and use amdgpu exclusively. I've tried this,
> > > > > > and E+resume works like as it should! Unfortunately I have no 
> > > > > > externel
> > > > > > monitor outputs in this mode, because only nvidia is wired to the
> > > > > > hdmi/DP ports. Oh well.
> > > > >
> > > > > well wow.. so something to do with nvidia maybe optimus ... but... 
> > > > > hmmm.
> > > > > but at least see if you can get a backtrace from e to see where it is
> > > > > stuck
> > > > > - if it is. that will tell me some information at least.
> > > >
> > > > I changed back nvidia-settings to use nvidia optimus mode (to generate
> > > > a backtrace for you), but guess what, resuming works now!!! There is
> > > > no shader error in dmesg. After looking around more closely, it seems
> > > > I've changed the "Prime profiles" from "Intel (Power Saving Mode)" -
> > > > [this was actually the amdgpu only mode, where E worked] to "NVIDIA On
> > > > Demand" mode. There is a third option here which is "NVIDIA
> > > > (Performance mode)" - this is the mode I was using before. So in
> > > > Performance Mode, nvidia-smi shows that E has some parts which run on
> > > > the nvidia GPU, but in "On Demand" mode E is run on amdgpu. And
> > > > resuming works this way.
> > > >
> > > > The only problem I see is that after resuming the external monitor
> > > > stays black, but xrandr thinks it is connected. I'm looking at this
> > > > now.
> > >
> > > Well - you're getting somewhere. You seem to have stumbled on some
> > > nvidia/optimus related driver bug. it seems maybe e just happens to 
> > > trigger
> > > it by luck (or un-luck). This does happen - thing sonly get tested with
> > > specific workloads. When a new workload appears, then it sometimes 
> > > triggers
> > > different code paths that SHOULD work but have a bug and now the bug is
> > > exposed. It requires people to then test, reproduce and then fix it. It 
> > > may
> > > be deep in the nvidia blob. Maybe in the glue binding it to the amdgpu
> > > driver with optimus. I don't know. I haven't seen this issue, but I have
> > > known to keep away from anything optimus related as while there is, in
> > > theory, some cool stuff here tech-wise, it's problematic and has a history
> > > of problems.
> >
> > Just an update on the resume+external monitor stays black issue. It
> > seems I can make the monitor work correctly by unplugging then
> > re-plugging the hdmi cable. Unfortunately using xrandr only  to try to
> > fix the problem without touching the cable causes the usual nvidia
> > shader exception, which randomly triggers a sigsegv in the X server.
> > It's unpredictable as hell.
>
> oh... this seems like you definitely have deeper problems. e is just seemingly
> good at finding/exposing them.
>
> > On the other hand the budgie wm (which seems to be based on mutter)
> > has no problems with correctly resuming the external monitor. Looking
> > at the source code of mutter I see some nvidia specific quirks, like
> > NV_robustness_video_memory_purge. I'm going to try to hack this out of
> > mutter and see whether it would fail.
>
> indeed efl (evas) has nothing like this, but if xorg is crashing.. then you 
> have
> deeper issues that e can't really solve like with the above. :)


Finally, I was able to get rid of the shader errors.  By adding the
"NVreg_EnableS0ixPowerManagement=1" parameter to nvidia.ko  E can
finally suspend+resume without deadly problems. The external monitor
is still not detected after resume, but reapplying the screen setup
fixes that without segfaulting the X server. That's good enough for
me.

Thanks for your help+ideas,
Laszlo


_______________________________________________
enlightenment-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/enlightenment-users

Re: [e-users] enlightenment + nvidia + resume from suspend => problem

Reply via email to