On 30 July 2013 17:14, Irek Szczesniak <[email protected]> wrote: > On Tue, Jul 30, 2013 at 4:49 PM, Glenn Fowler <[email protected]> wrote: >> >> On Tue, 30 Jul 2013 16:16:28 +0200 Roland Mainz wrote: >>> On Tue, Jul 30, 2013 at 4:12 PM, Glenn Fowler <[email protected]> wrote: >>> > >>> > On Tue, 30 Jul 2013 15:40:48 +0200 Roland Mainz wrote: >>> >> --047d7b6d8d7cb1934204e2bac182 >>> >> Content-Type: text/plain; charset=ISO-8859-1 >>> > >>> >> On Mon, Jul 29, 2013 at 3:34 PM, Roland Mainz <[email protected]> >>> >> wrote: >>> >> > On Sat, Jul 27, 2013 at 6:17 PM, Glenn Fowler <[email protected]> >>> >> > wrote: >>> >> [snip] >>> >> > I hit (again) one of the nasty and hard to reproduce problems (which >>> >> > means: No... I don't have a testcase... ;-( ) related to job >>> >> > management: >>> >> > -- snip -- >>> >> [snip] >>> >> > -- snip -- >>> >> > >>> >> > ... this hang occured because ksh93 tries to do job management from >>> >> > within the signal handler... which is IMO _far_ from being safe to >>> >> > do... >>> >> > >>> >> > David: Is there *ANY* way the job management can be moved out of the >>> >> > signal handler and just rely on the siginfo data ? >>> > >>> >> Attached (as >>> >> "astksh20130727_prototype_process_sigchld_outside_signalhandler001.diff.txt") >>> >> is a prototype (hacked) patch for discussion. >>> > >>> >> * The good news are: >>> >> - SIGCHLD processing now seems to be reliable. I can't find a simple >>> >> testcase to break it (but see the item about the testsuite below) >>> >> - SIGCHLD processing now works without problems under "valgrind" control >>> >> - The patch seems to fix a lot of Cedric's issues with signals being >>> >> lost. Even the testcase where a $ ... while ! wait ; sleep 8 ; done # >>> >> was loosing signals now works... if I replace the "sleep 8" with >>> >> "compound -a dummy ; poll -t 8 dummy" (so there seems to be a seperate >>> >> problem with "sleep" ... somehow) >>> > >>> >> * The bad news are: >>> >> - The patch doesn't work very well for interactive shells >>> >> - The "sigchld.sh" testsuite module fails in several places... AFAIK >>> >> the reason is that we don't check the list of queued CHLD signals in >>> >> all places where they should be checked (e.g. each time before we >>> >> execute a shell statement and before and after running external >>> >> processes and each time wait(1) internally wakes up from a signal) >>> >> - $ jobs -l # reports already purged processes... erm... no clue why... >>> >> ;-( >>> > >>> >> * Notes about interactive mode: >>> >> An interactive shell AFAIK requires to report terminating childs >>> >> "immediately" if some shell options are set (erm... David: Which >>> >> option is this again ?). AFAIK this can be archived by two ways: >>> >> 1. wake up from time-to-time (e.g. polling) ... but this has a bad >>> >> effect in CPU usage (imagine 1000 shell sessions in a multi-user >>> >> system waking-up 10 times per second... or imagine the impact for >>> >> battery life in a laptop) >>> >> 2. add some glue to the editor code to wake up for each signald and >>> >> process it when |sfpkrd()| returns... >>> >> -- snip -- >>> >> (gdb) where >>> >> #0 0x00007fc2be588860 in __poll_nocancel () at >>> >> ../sysdeps/unix/syscall-template.S:81 >>> >> #1 0x000000000050280d in sfpkrd (fd=0, argbuf=0x7fff33545080, n=80, >>> >> rc=0, tm=-1, action=1) >>> >> at >>> >> /home/test001/work/ast_ksh_20130727/build_i386_64bit_debug_patched/src/lib/libast/sfio/sfpkrd.c:142 >>> >> #2 0x0000000000410fb9 in ed_read (context=0x7fc2bf063960, fd=0, >>> >> buff=0x7fff33545080 "\240PT3\377\177", size=80, reedit=0) >>> >> at >>> >> /home/test001/work/ast_ksh_20130727/build_i386_64bit_debug_patched/src/cmd/ksh93/edit/edit.c:884 >>> >> -- snip -- >>> >> ... that's the strategy my patch attempts... >>> > >>> > for the current code do you see failures when the vmalloc debug method is >>> > not used? >> >>> Sporadically... yes. But I'm not sure whether this is an issue in >>> vmalloc or something else. I'm not even sure whether the job lkist >>> handling code in ksh93 is the root cause of most of the trouble. This >>> is all made much harder because Linux doesn't have much usefull tools >>> to instrument signal handlers so I have to walk the code by hand. >> >> it might be good to nail down this problem first >> otherwise we might be applying fixes for something that's not broken > > Chime in on this topic. sh_fault() is one piece of brokenness from head to > tail. > Just from looking at it for five minutes I can tell you: > - SIGWINCH handling is broken. You can't seriously assume that > nv_putval() is save to use in a signal handler. Practical testing > shows that if you run unset LINES COLUMNS and recreate the variables > in a loop and resize the window (just drag the right bottom border > around in Gnome or KDE) the storm of WINCH signals will trigger a SEGV > sooner or later in a random location of shell code > - A lot of variables like shp->st.trapcom[sig] are accessed from > inside and outside the signal handler without protection, i.e. atomic > load/store instructions. You permanently risk a broken state if > signals of different types meet each other in sh_fault() > - shp->trapnote & | = anything <-- what happens in nested signal > handler calls? Guess? > - Another notorious abuse: > 482 if(shp->siginfo) > 483 { > 484 do ip = shp->siginfo[sig]; > 485 while > (asocasptr(&shp->siginfo[sig],ip,0)!=ip); > 486 } > How is this suppose to be signal safe? You read from shp->siginfo[sig] > unsafe but do the write using atomic instructions? Likewise filling > the queue must be done with atomic instructions to handle nested > signal handler calls. > - All variables read or written from both a signal handler and user > code must be of storage type volatile. Missing here, too. > Surprisingly, if you look at the assembler code, the compiler caches a > lot of values in registers or hidden temporary stack variables. May be > another source of trouble. May? Not may. Correction: It is. > > From what I can glance in 5 minutes sh_fault() is notoriously signal > unsafe by violating every known rule about writing signal handlers > (i.e. keep the code as short as possible, use only async signal safe > functions, use volatile storage and atomic r/w instructions for > communication) and should be thrown away. No amount of one line fixes > can rescue it. Please. Just throw this **** away and rewrite it from > scratch.
I'd refrain from using strong language but yes, sh_fault() is a source of raised eyebrows. Why does it have to be so complex? Why can't sh_fault() just record all siginfo structures in a queue and let the sh_chksig() do all the work. The prodding of the shp->sig* variables from both signal handler and user code sides is unsafe beyond measure and combined with the code prodding in the job.* variables makes it a nightmare. +1 for Irek's analysis. Ced -- Cedric Blancher <[email protected]> Institute Pasteur _______________________________________________ ast-developers mailing list [email protected] http://lists.research.att.com/mailman/listinfo/ast-developers
