Re: [ast-developers] Hang because job management _within_ the SIGCHLD signal handler... / was: Re: AT&T Software Technology ast alpha software download update

Cedric Blancher Tue, 30 Jul 2013 09:22:47 -0700

On 30 July 2013 17:14, Irek Szczesniak <[email protected]> wrote:
> On Tue, Jul 30, 2013 at 4:49 PM, Glenn Fowler <[email protected]> wrote:
>>
>> On Tue, 30 Jul 2013 16:16:28 +0200 Roland Mainz wrote:
>>> On Tue, Jul 30, 2013 at 4:12 PM, Glenn Fowler <[email protected]> wrote:
>>> >
>>> > On Tue, 30 Jul 2013 15:40:48 +0200 Roland Mainz wrote:
>>> >> --047d7b6d8d7cb1934204e2bac182
>>> >> Content-Type: text/plain; charset=ISO-8859-1
>>> >
>>> >> On Mon, Jul 29, 2013 at 3:34 PM, Roland Mainz <[email protected]> 
>>> >> wrote:
>>> >> > On Sat, Jul 27, 2013 at 6:17 PM, Glenn Fowler <[email protected]> 
>>> >> > wrote:
>>> >> [snip]
>>> >> > I hit (again) one of the nasty and hard to reproduce problems (which
>>> >> > means: No... I don't have a testcase... ;-( ) related to job
>>> >> > management:
>>> >> > -- snip --
>>> >> [snip]
>>> >> > -- snip --
>>> >> >
>>> >> > ... this hang occured because ksh93 tries to do job management from
>>> >> > within the signal handler... which is IMO _far_ from being safe to
>>> >> > do...
>>> >> >
>>> >> > David: Is there *ANY* way the job management can be moved out of the
>>> >> > signal handler and just rely on the siginfo data ?
>>> >
>>> >> Attached (as 
>>> >> "astksh20130727_prototype_process_sigchld_outside_signalhandler001.diff.txt")
>>> >> is a prototype (hacked) patch for discussion.
>>> >
>>> >> * The good news are:
>>> >> - SIGCHLD processing now seems to be reliable. I can't find a simple
>>> >> testcase to break it (but see the item about the testsuite below)
>>> >> - SIGCHLD processing now works without problems under "valgrind" control
>>> >> - The patch seems to fix a lot of Cedric's issues with signals being
>>> >> lost. Even the testcase where a $ ... while ! wait ; sleep 8 ; done #
>>> >> was loosing signals now works... if I replace the "sleep 8" with
>>> >> "compound -a dummy ; poll -t 8 dummy" (so there seems to be a seperate
>>> >> problem with "sleep" ... somehow)
>>> >
>>> >> * The bad news are:
>>> >> - The patch doesn't work very well for interactive shells
>>> >> - The "sigchld.sh" testsuite module fails in several places... AFAIK
>>> >> the reason is that we don't check the list of queued CHLD signals in
>>> >> all places where they should be checked (e.g. each time before we
>>> >> execute a shell statement and before and after running external
>>> >> processes and each time wait(1) internally wakes up from a signal)
>>> >> - $ jobs -l # reports already purged processes... erm... no clue why... 
>>> >> ;-(
>>> >
>>> >> * Notes about interactive mode:
>>> >> An interactive shell AFAIK requires to report terminating childs
>>> >> "immediately" if some shell options are set (erm... David: Which
>>> >> option is this again ?). AFAIK this can be archived by two ways:
>>> >> 1. wake up from time-to-time (e.g. polling) ... but this has a bad
>>> >> effect in CPU usage (imagine 1000 shell sessions in a multi-user
>>> >> system waking-up 10 times per second... or imagine the impact for
>>> >> battery life in a laptop)
>>> >> 2. add some glue to the editor code to wake up for each signald and
>>> >> process it when |sfpkrd()| returns...
>>> >> -- snip --
>>> >> (gdb) where
>>> >> #0  0x00007fc2be588860 in __poll_nocancel () at
>>> >> ../sysdeps/unix/syscall-template.S:81
>>> >> #1  0x000000000050280d in sfpkrd (fd=0, argbuf=0x7fff33545080, n=80,
>>> >> rc=0, tm=-1, action=1)
>>> >>     at 
>>> >> /home/test001/work/ast_ksh_20130727/build_i386_64bit_debug_patched/src/lib/libast/sfio/sfpkrd.c:142
>>> >> #2  0x0000000000410fb9 in ed_read (context=0x7fc2bf063960, fd=0,
>>> >> buff=0x7fff33545080 "\240PT3\377\177", size=80, reedit=0)
>>> >>     at 
>>> >> /home/test001/work/ast_ksh_20130727/build_i386_64bit_debug_patched/src/cmd/ksh93/edit/edit.c:884
>>> >> -- snip --
>>> >> ... that's the strategy my patch attempts...
>>> >
>>> > for the current code do you see failures when the vmalloc debug method is 
>>> > not used?
>>
>>> Sporadically... yes. But I'm not sure whether this is an issue in
>>> vmalloc or something else. I'm not even sure whether the job lkist
>>> handling code in ksh93 is the root cause of most of the trouble. This
>>> is all made much harder because Linux doesn't have much usefull tools
>>> to instrument signal handlers so I have to walk the code by hand.
>>
>> it might be good to nail down this problem first
>> otherwise we might be applying fixes for something that's not broken
>
> Chime in on this topic. sh_fault() is one piece of brokenness from head to 
> tail.
> Just from looking at it for five minutes I can tell you:
> - SIGWINCH handling is broken. You can't seriously assume that
> nv_putval() is save to use in a signal handler. Practical testing
> shows that if you run unset LINES COLUMNS and recreate the variables
> in a loop and resize the window (just drag the right bottom border
> around in Gnome or KDE) the storm of WINCH signals will trigger a SEGV
> sooner or later in a random location of shell code
> - A lot of variables like shp->st.trapcom[sig] are accessed from
> inside and outside the signal handler without protection, i.e. atomic
> load/store instructions. You permanently risk a broken state if
> signals of different types meet each other in sh_fault()
> - shp->trapnote & | = anything <-- what happens in nested signal
> handler calls? Guess?
> - Another notorious abuse:
>    482                                  if(shp->siginfo)
>    483                                  {
>    484                                          do ip = shp->siginfo[sig];
>    485                                          while
> (asocasptr(&shp->siginfo[sig],ip,0)!=ip);
>    486                                  }
> How is this suppose to be signal safe? You read from shp->siginfo[sig]
> unsafe but do the write using atomic instructions? Likewise filling
> the queue must be done with atomic instructions to handle nested
> signal handler calls.
> - All variables read or written from both a signal handler and user
> code must be of storage type volatile. Missing here, too.
> Surprisingly, if you look at the assembler code, the compiler caches a
> lot of values in registers or hidden temporary stack variables. May be
> another source of trouble. May? Not may. Correction: It is.
>
> From what I can glance in 5 minutes sh_fault() is notoriously signal
> unsafe by violating every known rule about writing signal handlers
> (i.e. keep the code as short as possible, use only async signal safe
> functions, use volatile storage and atomic r/w instructions for
> communication) and should be thrown away. No amount of one line fixes
> can rescue it. Please. Just throw this **** away and rewrite it from
> scratch.


I'd refrain from using strong language but yes, sh_fault() is a source
of raised eyebrows. Why does it have to be so complex? Why can't
sh_fault() just record all siginfo structures in a queue and let the
sh_chksig() do all the work. The prodding of the shp->sig* variables
from both signal handler and user code sides is unsafe beyond measure
and combined with the code prodding in the job.* variables makes it a
nightmare.

+1 for Irek's analysis.

Ced
-- 
Cedric Blancher <[email protected]>
Institute Pasteur
_______________________________________________
ast-developers mailing list
[email protected]
http://lists.research.att.com/mailman/listinfo/ast-developers

Re: [ast-developers] Hang because job management _within_ the SIGCHLD signal handler... / was: Re: AT&T Software Technology ast alpha software download update

Reply via email to