On Thu, May 20, 2021 at 10:10:14AM +0200, Matthieu Herrb wrote:
> On Thu, May 20, 2021 at 09:53:02AM +0200, Peter N. M. Hansteen wrote:
> > On Thu, May 20, 2021 at 08:53:14AM +1000, Jonathan Gray wrote:
> > > On Wed, May 19, 2021 at 06:32:01PM +0200, Peter N. M. Hansteen wrote:
> > > > On Wed, May 19, 2021 at 04:43:44PM +0200, Peter N. M. Hansteen wrote:
> > > > > > outdated...)
> > > > > 
> > > > > I tried the first, that only seemed to have the effect of having
> > > > > the freeze come faster. So I commented out that part of the xorg.conf
> > > > > and I'm trying the steps in the README now, but for some reasons I
> > > > > don't get any dumps in /var/crash as expected. Then again I could well
> > > > > be missing some crucial step.
> > > > 
> > > > Still no luck getting coredumps but when I sshed in after the last 
> > > > freeze
> > > > the last two lines of dmesg were 
> > > > 
> > > > [drm] *ERROR* ring sdma0 timeout, signaled seq=110053, emitted 
> > > > seq=110053
> > > > [drm] *ERROR* Process information: process  pid 0 thread  pid 0
> > > > 
> > > > the [drm] part has me suspect this is related (but I don't know what 
> > > > sdma
> > > > signifies in this context)
> > > 
> > > sdma is the asynchronous System DMA engine
> > > 
> > > Ring timeouts like this are a known problem with amdgpu which persist
> > > across multiple major drm versions.
> > 
> > Looking at what appears in the log (/var/log/messages) the time when X
> > freezes corresponds very well with when those messages are recorded.
> > 
> > The question is, how do I usefully debug this? I've gone over the README's 
> > procedure a few times now and it unfortunately does not produce any 
> > coredumps
> > or traces.
> 
> When the X server is locked up I can still ssh into the machine and
> attach a debugger to the running process. I've got a few backtraces
> from that, but without full symbols it's even harder to understand
> what's going on.
> 
> I suspect issues with our futex implementation; in every case I find
> one thread stuck in a drm ioctl while others are blocked on futex
> waits.
> 
> Running an X server + Mesa fully built with debug symbols seem to make
> the issue less frequent and when it happenned I didn't have time to
> launch a debugger on it so far...
> 
> 
> > 
> > One option is of course to trade up or sideways to something like
> > https://www.power.no/data-og-tilbehoer/pc-og-mac/baerbar-pc/asus-zenbook-s-ux393ea-pure2-13-laptop/p-1115705/
> > (Intel Core i7-1165G7 with Iris Xe graphics), but would that have a better
> > chance of success (or for that matter be helpful to the project)?
> 
> Which window manager / desktop environment are you using ? I've tried
> to back to WindowMaker (from xfwm4) and it also seems to not trigger
> the lock ups on a Ryzen Vega. But it hasn't been long enough. Sometime
> I can run for days without a lockup and sometimes it locks up after
> minutes.

To trigger ring timeouts on amdgpu I use graphics/piglit.
piglit run -s quick <outdir>
which will rapidly open and close windows and take quite a while if left
to complete.

> 
> BTW this also leads me to wonder if KARL could have an impact on the
> issue in case there is some un-initialized memory access somewhere in
> the code...
> 
> 
> > 
> > All the best,
> > Peter
> > 
> > -- 
> > Peter N. M. Hansteen, member of the first RFC 1149 implementation team
> > http://bsdly.blogspot.com/ http://www.bsdly.net/ http://www.nuug.no/
> > "Remember to set the evil bit on all malicious network traffic"
> > delilah spamd[29949]: 85.152.224.147: disconnected after 42673 seconds.
> 
> -- 
> Matthieu Herrb
> 
> 

Reply via email to