Re: [e-users] E crash with Nvidia

Florian Schaefer Thu, 09 Sep 2021 16:55:19 -0700

On Fri, Sep 10, 2021 at 12:36:24AM +0100, Carsten Haitzler wrote:
> On Fri, 10 Sep 2021 08:28:30 +0900 Florian Schaefer <list...@netego.de> said:
> 
> > On Thu, Sep 09, 2021 at 08:32:47AM +0100, Carsten Haitzler wrote:
> > > On Thu, 9 Sep 2021 09:20:28 +0900 Florian Schaefer <list...@netego.de> 
> > > said:
> > > 
> > > > On Wed, Sep 08, 2021 at 11:08:00AM +0100, Carsten Haitzler wrote:
> > > > > On Wed, 8 Sep 2021 17:35:12 +0900 Florian Schaefer <list...@netego.de>
> > > > > said:
> > > > > 
> > > > > > Seems to me to have been good last words this time. ;) So I am 
> > > > > > running
> > > > > > this all day now and I think I did not have a segfault due to 
> > > > > > procstat
> > > > > > so far. Thanks for the fixes and I like the new indicator icon. :)
> > > > > > 
> > > > > > That being said, I still had some crashes today and I am thinking 
> > > > > > that
> > > > > > perhaps finally I might have something true to the topic of this
> > > > > > thread. At least it crashes within libnvidia and I do not get an 
> > > > > > ASAN
> > > > > > trace.
> > > > > > 
> > > > > > For what it's worth, I tried to record a trace as good as I can.
> > > > > > 
> > > > > > https://pastebin.com/p41b7GKW
> > > > > > 
> > > > > > This happens reproducibly when I change from X running E to the text
> > > > > > console and then back to the graphics screen. (I did quite a lot of
> > > > > > these switches lately for running gdb while E is stil crashed.) 
> > > > > > When I
> > > > > > have an "empty" E running it is fine. However, as soon as some 
> > > > > > window
> > > > > > is open it reliably segfaults upon returning to X. Any ideas?
> > > > > 
> > > > > time to stop asan and use valgrind. that can at least say if the 
> > > > > memory
> > > > > nvidia is accessing is beyond some array e provided - the shader flush
> > > > > basically has e provide a block of mem containing vertexes etc. for 
> > > > > the
> > > > > gpu to draw. this array is expanded as new triangle are added then
> > > > > flushed to the gpu at some point during rendering. that might be the
> > > > > only thing i can think of that might be an efl bug - we use a dud
> > > > > pointer? but then you could figure this out from valgrind + gdb...
> > > > > maybe. valgrind would see the errant pointer and perhaps if its just
> > > > > beyond some other block of mem or if that block was freed recently 
> > > > > etc.
> > > > 
> > > > So there are things that valgrind can that asan cannot. More stuff to
> > > > learn. :)
> > > 
> > > Yeah. Valgrind is actually a cpu interpreter. it literally interprets 
> > > every
> > > instruction and while doing that tracks memory state. it also traps
> > > malloc/free and so on too and tracks what memory has been allocated, freed
> > > down to the byte, if it has been written to or not etc. - doing qll of 
> > > this
> > > is can see every issue. it may have no DEBUG to tell you more than "code 
> > > in
> > > this library causers problem X", or with full gdb debug it can use that
> > > memory address to tell you the file, line number, function name and so on
> > > too. This is why valgrind is slow. it's literally interpreting everything 
> > > a
> > > process under valgrind does.
> > > 
> > > Asan has the compiler do the above instead. So when the compiler generates
> > > the binary code for an application or library, it ADDS code that runs
> > > natively that does tracking. This means tat simple instructions that just
> > > do add/sub/compare etc. just get generated as normal. instructions that
> > > access memory get tracking code added like valgrind. this means only the
> > > code that the compiler generates will get tracked (e.g. efl and
> > > enlightenment), and other code that efl calls (stuff in libc, libjpeg,
> > > opengl libs etc.) will not be. this is a major difference in design and
> > > makes asan massively faster. it's actually usable day to day on a decently
> > > fast machine. it does mean e uses a lot more memory as asan needs extra
> > > memory in the process to do the tracking of every byte and its history and
> > > it does need to execute more instructions whenever reading/writing to some
> > > memory etc. ... but not all the code your cpu runs will have this extra
> > > work because it's only these actions and any libraries called that do not
> > > have asan build will also not do this extra work. thus - asan can't find
> > > anything in a library you did not build with asan support. thus sometimes
> > > you still have to pull out ye-olde valgrind. valgrind is an amazing tool.
> > > it's just slow. if you seem to have issues in e/efl the first port of call
> > > is to try asan. it's fast enough to run day to day and not very intrusive
> > > in that you can rebuild efl+e and then just ctrl+alt+end to restart e and
> > > presto - asan is on. as long as you have pre set-up a proper ASAN_OPTIONS
> > > env var ... also i suggest you:
> > > 
> > > export EINA_FREEQ_TOTAL_MAX=0
> > > export EINA_FREEQ_MEM_MAX=0
> > > export EINA_FREEQ_FILL_MAX=0
> > > 
> > > as well. this may make e/efl a little more crashy and will also remove a
> > > minor optimization (freeq is a ... free queue - it takes things that need
> > > to be freed and adds them to a queue to free some time later = freeq will
> > > collect things to free up until some limit. it will, when items are added
> > > to the queue, fill their memory with some pattern like 0x555555 or 
> > > 0x777777
> > > etc. - or well up to the first N bytes of that memory object, and then 
> > > when
> > > it actually does the free later will check that that pattern still is
> > > there. if it's not, something wrote to that memory that SHOULD have been
> > > left alone as the object was queued to be freed - it can give you an
> > > indication that something is wrong but not exactly where). as freeq waits
> > > until the app is idle (has nothing to do but wait for input or things to
> > > happen) it runs through the queue then freeing objects so avoiding the 
> > > work
> > > of the free until then. it's an efl self-check mechanism put in to hunt
> > > down bugs and get a little optimzation in return for the extra work it has
> > > to do. by setting the above to zero you basically disable freeq and force
> > > it to free immediately which is what you want for both valgrind and asan 
> > > so
> > > they detect the problems right. note efl knows when it runs under valgrind
> > > and auto disables freeq on its own. but with asan, it does not.
> > > 
> > > i hope that helps explain the above (roughly - i glossed over a lot of
> > > details to make it easier to explain in a short amount of time)
> > 
> > Ahm, yeah, thanks for the explanations. I wasn't expecting such a ...
> > verbose ... reply. But it is appreciated. Even though I did probably not
> > fully understand everything I now see that valgrind is more than meets
> > the eye and that the same is true for eina. ;)
> > 
> > > > Anyway, I tried to follow the debugging instructions on E.org as good as
> > > > I can (after having finally recompiled everything without asan, but
> > > > leaving the debugging symbols in place).
> > > > 
> > > > Three observations:
> > > > 
> > > > 1. The valgrind option --db-attach seems to be deprecated since 2015 and
> > > > is not avaiable any more. So I just omitted this. I hope that's fine.
> > > 
> > > i know. :( you now need a separate shell running gdb to attach gdb to the
> > > process then tell it to run. painful. :(
> > > 
> > > > 2. Then I tried to use the ".xinitrc-debug" method. Upon starting E the
> > > > startup apparently went into an infinite loop, generating pages and
> > > > pages of valgrind and E startup messages (a few valgrind messages with
> > > > something-something exiting 0) and generating many 120MB core dumps. So
> > > > I never got to the point where I would actually get anything but a black
> > > > screen from X.
> > > 
> > > aaah with valgrind you want to probably bypass enlightenment_start - this
> > > means any issue will drop you out of your login session but you will have 
> > > a
> > > chance to debug it. to avoid enlightenment_start do:
> > > 
> > > export E_START=1
> > > valgrind --tool=memcheck ... enlightenment
> > > 
> > > 
> > > FYI when i valgrind i do:
> > > 
> > > valgrind --suppressions=$HOME/.zsh/vgd.supp --tool=memcheck 
> > > --num-callers=64
> > > --show-reachable=no --read-var-info=yes --leak-check=yes
> > > --leak-resolution=high
> > > --undef-value-errors=yes --track-origins=yes --vgdb-error=0  --vgdb=full
> > > --redzone-size=512 --freelist-vol=100000000
> > > 
> > > :) the suppressions file is a file i keep to tell valgrind to ignore that
> > > issue
> > > - e.g. it's a common optimization in libc or freetype or something that it
> > > should just pretend is not an issue. you can drop that option because you
> > > won't maintain that file and that file is highly system specific.
> > 
> > Hmm, this valgrind stuff is more difficult then I expected. First I was
> > struggling to get the X server and enlightenment to start properly. I
> > finally settled on just creating the .xinitrc and let the rest be sorted
> > out with startx.
> > 
> > But then, again, if I just start enlightenment without valgrind it
> > works. With valgrind enabled everything stops at a black screen and the
> > only way to get a responsive interface again is to reboot the machine.
> > 
> > So here's what I do: https://pastebin.com/yzhy4gj1
> > 
> > The first part shows my .xinitrc. At the end you see two alternative
> > exec commands. The one with valgrind causes everything to hang. The one
> > without works just fine.
> > 
> > Even though with valgrind enabled I cannot really do anything at least
> > there is still heaps of stuff in the logfile, so that output is also
> > included. Many "lost bytes" (not really dangerous, right?) and an
> > unhandled instruction in e_comp_x_randr.c. Hmmm.
> 
> unhanded instruction. that means your compiler is outputting instructions
> valgrind does not know how to interpret. e.g. it is optimizing for a newer x86
> instruction. you might want to compile with -mpentium in CFLAGS or something
> very conservative. you also might want to avoid --trace-children=yes if you 
> are
> running enlightenment directly (avoiding enlightenment_start).


OK, thanks for the additional suggestions. And another recompile... ;)
Let's see whether this and omitting the --trace-children makes a
difference. I don't know whether I will manage to do this today but I
will let you know the results when I have something.

Cheers
Florian


_______________________________________________
enlightenment-users mailing list
enlightenment-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-users

Re: [e-users] E crash with Nvidia

Reply via email to