Re: shell hung in fork system call

2023-07-10 Thread Konstantin Belousov
On Mon, Jul 10, 2023 at 09:39:35AM +, John F Carr wrote:
> 
> 
> > On Jul 9, 2023, at 19:59, Konstantin Belousov  wrote:
> > 
> > On Sun, Jul 09, 2023 at 11:36:03PM +, John F Carr wrote:
> >> 
> >> 
> >>> On Jul 9, 2023, at 19:25, Konstantin Belousov  wrote:
> >>> 
> >>> On Sun, Jul 09, 2023 at 10:41:27PM +, John F Carr wrote:
>  Kernel and system at a146207d66f320ed239c1059de9df854b66b55b7 plus some 
>  irrelevant local changes, four 64 bit ARM processors, make.conf sets 
>  CPUTYPE?=cortex-a57.
>  
>  I typed ^C while /bin/sh was starting a pipeline and my shell got hung 
>  in the middle of fork().
>  
> > From the terminal:
>  
>  # git log --oneline --|more
>  ^C^C^C
>  load: 3.26  cmd: sh 95505 [fork] 5308.67r 0.00u 0.03s 0% 2860k
>  mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
>  fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
>  load: 3.16  cmd: sh 95505 [fork] 5311.75r 0.00u 0.03s 0% 2860k
>  mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
>  fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
>  
>  According to ps -d on another terminal the shell has no children:
>  
>  PID TT  STAT   TIME COMMAND
>  [...]
>  873 u0  IWs 0:00.00 `-- login [pam] (login)
>  874 u0  I   0:00.17   `-- -sh (sh)
>  95504 u0  I   0:00.01 `-- su -
>  95505 u0  D+  0:00.05   `-- -su (sh)
>  [...]
>  
>  Nothing on the (115200 bps serial) console.  No change in system 
>  performance.
>  
>  The system is busy copying a large amount of data from the network to a 
>  ZFS pool on spinning disks.  The git|more pipeline could have taken some 
>  time to get going while I/O requests worked their way through the queue. 
>   It would not have touched the busy pool, only the zroot pool on an SSD.
>  
>  Has anything changed recently that might cause this?
> >>> 
> >>> There was some change around fork, but your sleep seems to be not from
> >>> that change.  Can you show the wait channel for the process?  Do something
> >>> like
> >>> $ ps alxww
> >>> 
> >> 
> >> UID   PID  PPID  C PRI NI   VSZ   RSS MWCHAN   STAT TTTIME COMMAND
> >>   0 95505 95504  2  20  0 13508  2876 fork D+   u0 0:00.13 -su (sh)
> >> 
> >> This is probably the same information displayed as [fork] in the output 
> >> from ^T.
> >> 
> >> Does it correspond to the source line
> >> 
> >> pause("fork", hz / 2);
> >> 
> >> ?
> > 
> > Yes, it is rate-limiting code.  Still it is interesting to see the whole
> > ps output.
> > 
> > Do you have 7a70f17ac4bd64dc1a5020f in your source?
> 
> No, I do not have that commit.
> 
> The comment mentions livelock.  CPU use as reported by iostat did not change 
> after the process hung.

It is livelocking, but the looping could be rate-limited, similar to
what you see with pause. You need that revision definitely.

Is the problem reproducable on your machine? If yes, you can try one
additional fix, below.

commit 840ce1801ef1c1ab9b10c4c6e7b02403d2e749ee
Author: Konstantin Belousov 
Date:   Mon Jul 10 03:29:43 2023 +0300

sigqueue_delete_set_proc(): initialize sq_proc for worklist

This should fix leaks for the p_killpg_cnt counter, because
sigqueue_flush() drops ksi's.

Sponsored by:   The FreeBSD Foundation
MFC after:  1 week

diff --git a/sys/kern/kern_sig.c b/sys/kern/kern_sig.c
index 18756d53e98c..ecfde7a549fc 100644
--- a/sys/kern/kern_sig.c
+++ b/sys/kern/kern_sig.c
@@ -683,7 +683,7 @@ sigqueue_delete_set_proc(struct proc *p, const sigset_t 
*set)
 
PROC_LOCK_ASSERT(p, MA_OWNED);
 
-   sigqueue_init(, NULL);
+   sigqueue_init(, p);
sigqueue_move_set(>p_sigqueue, , set);
 
FOREACH_THREAD_IN_PROC(p, td0)



Re: shell hung in fork system call

2023-07-10 Thread John F Carr



> On Jul 9, 2023, at 19:59, Konstantin Belousov  wrote:
> 
> On Sun, Jul 09, 2023 at 11:36:03PM +, John F Carr wrote:
>> 
>> 
>>> On Jul 9, 2023, at 19:25, Konstantin Belousov  wrote:
>>> 
>>> On Sun, Jul 09, 2023 at 10:41:27PM +, John F Carr wrote:
 Kernel and system at a146207d66f320ed239c1059de9df854b66b55b7 plus some 
 irrelevant local changes, four 64 bit ARM processors, make.conf sets 
 CPUTYPE?=cortex-a57.
 
 I typed ^C while /bin/sh was starting a pipeline and my shell got hung in 
 the middle of fork().
 
> From the terminal:
 
 # git log --oneline --|more
 ^C^C^C
 load: 3.26  cmd: sh 95505 [fork] 5308.67r 0.00u 0.03s 0% 2860k
 mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
 fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
 load: 3.16  cmd: sh 95505 [fork] 5311.75r 0.00u 0.03s 0% 2860k
 mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
 fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
 
 According to ps -d on another terminal the shell has no children:
 
 PID TT  STAT   TIME COMMAND
 [...]
 873 u0  IWs 0:00.00 `-- login [pam] (login)
 874 u0  I   0:00.17   `-- -sh (sh)
 95504 u0  I   0:00.01 `-- su -
 95505 u0  D+  0:00.05   `-- -su (sh)
 [...]
 
 Nothing on the (115200 bps serial) console.  No change in system 
 performance.
 
 The system is busy copying a large amount of data from the network to a 
 ZFS pool on spinning disks.  The git|more pipeline could have taken some 
 time to get going while I/O requests worked their way through the queue.  
 It would not have touched the busy pool, only the zroot pool on an SSD.
 
 Has anything changed recently that might cause this?
>>> 
>>> There was some change around fork, but your sleep seems to be not from
>>> that change.  Can you show the wait channel for the process?  Do something
>>> like
>>> $ ps alxww
>>> 
>> 
>> UID   PID  PPID  C PRI NI   VSZ   RSS MWCHAN   STAT TTTIME COMMAND
>>   0 95505 95504  2  20  0 13508  2876 fork D+   u0 0:00.13 -su (sh)
>> 
>> This is probably the same information displayed as [fork] in the output from 
>> ^T.
>> 
>> Does it correspond to the source line
>> 
>> pause("fork", hz / 2);
>> 
>> ?
> 
> Yes, it is rate-limiting code.  Still it is interesting to see the whole
> ps output.
> 
> Do you have 7a70f17ac4bd64dc1a5020f in your source?

No, I do not have that commit.

The comment mentions livelock.  CPU use as reported by iostat did not change 
after the process hung.






Re: shell hung in fork system call

2023-07-09 Thread Konstantin Belousov
On Sun, Jul 09, 2023 at 11:36:03PM +, John F Carr wrote:
> 
> 
> > On Jul 9, 2023, at 19:25, Konstantin Belousov  wrote:
> > 
> > On Sun, Jul 09, 2023 at 10:41:27PM +, John F Carr wrote:
> >> Kernel and system at a146207d66f320ed239c1059de9df854b66b55b7 plus some 
> >> irrelevant local changes, four 64 bit ARM processors, make.conf sets 
> >> CPUTYPE?=cortex-a57.
> >> 
> >> I typed ^C while /bin/sh was starting a pipeline and my shell got hung in 
> >> the middle of fork().
> >> 
> >>> From the terminal:
> >> 
> >> # git log --oneline --|more
> >> ^C^C^C
> >> load: 3.26  cmd: sh 95505 [fork] 5308.67r 0.00u 0.03s 0% 2860k
> >> mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
> >> fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
> >> load: 3.16  cmd: sh 95505 [fork] 5311.75r 0.00u 0.03s 0% 2860k
> >> mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
> >> fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
> >> 
> >> According to ps -d on another terminal the shell has no children:
> >> 
> >>  PID TT  STAT   TIME COMMAND
> >> [...]
> >>  873 u0  IWs 0:00.00 `-- login [pam] (login)
> >>  874 u0  I   0:00.17   `-- -sh (sh)
> >> 95504 u0  I   0:00.01 `-- su -
> >> 95505 u0  D+  0:00.05   `-- -su (sh)
> >> [...]
> >> 
> >> Nothing on the (115200 bps serial) console.  No change in system 
> >> performance.
> >> 
> >> The system is busy copying a large amount of data from the network to a 
> >> ZFS pool on spinning disks.  The git|more pipeline could have taken some 
> >> time to get going while I/O requests worked their way through the queue.  
> >> It would not have touched the busy pool, only the zroot pool on an SSD.
> >> 
> >> Has anything changed recently that might cause this?
> > 
> > There was some change around fork, but your sleep seems to be not from
> > that change.  Can you show the wait channel for the process?  Do something
> > like
> > $ ps alxww
> > 
> 
>  UID   PID  PPID  C PRI NI   VSZ   RSS MWCHAN   STAT TTTIME COMMAND
>0 95505 95504  2  20  0 13508  2876 fork D+   u0 0:00.13 -su (sh)
> 
> This is probably the same information displayed as [fork] in the output from 
> ^T.
> 
> Does it correspond to the source line
> 
>   pause("fork", hz / 2);
> 
> ?

Yes, it is rate-limiting code.  Still it is interesting to see the whole
ps output.

Do you have 7a70f17ac4bd64dc1a5020f in your source?



Re: shell hung in fork system call

2023-07-09 Thread John F Carr



> On Jul 9, 2023, at 19:25, Konstantin Belousov  wrote:
> 
> On Sun, Jul 09, 2023 at 10:41:27PM +, John F Carr wrote:
>> Kernel and system at a146207d66f320ed239c1059de9df854b66b55b7 plus some 
>> irrelevant local changes, four 64 bit ARM processors, make.conf sets 
>> CPUTYPE?=cortex-a57.
>> 
>> I typed ^C while /bin/sh was starting a pipeline and my shell got hung in 
>> the middle of fork().
>> 
>>> From the terminal:
>> 
>> # git log --oneline --|more
>> ^C^C^C
>> load: 3.26  cmd: sh 95505 [fork] 5308.67r 0.00u 0.03s 0% 2860k
>> mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
>> fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
>> load: 3.16  cmd: sh 95505 [fork] 5311.75r 0.00u 0.03s 0% 2860k
>> mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
>> fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
>> 
>> According to ps -d on another terminal the shell has no children:
>> 
>>  PID TT  STAT   TIME COMMAND
>> [...]
>>  873 u0  IWs 0:00.00 `-- login [pam] (login)
>>  874 u0  I   0:00.17   `-- -sh (sh)
>> 95504 u0  I   0:00.01 `-- su -
>> 95505 u0  D+  0:00.05   `-- -su (sh)
>> [...]
>> 
>> Nothing on the (115200 bps serial) console.  No change in system performance.
>> 
>> The system is busy copying a large amount of data from the network to a ZFS 
>> pool on spinning disks.  The git|more pipeline could have taken some time to 
>> get going while I/O requests worked their way through the queue.  It would 
>> not have touched the busy pool, only the zroot pool on an SSD.
>> 
>> Has anything changed recently that might cause this?
> 
> There was some change around fork, but your sleep seems to be not from
> that change.  Can you show the wait channel for the process?  Do something
> like
> $ ps alxww
> 

 UID   PID  PPID  C PRI NI   VSZ   RSS MWCHAN   STAT TTTIME COMMAND
   0 95505 95504  2  20  0 13508  2876 fork D+   u0 0:00.13 -su (sh)

This is probably the same information displayed as [fork] in the output from ^T.

Does it correspond to the source line

pause("fork", hz / 2);

?




Re: shell hung in fork system call

2023-07-09 Thread Konstantin Belousov
On Sun, Jul 09, 2023 at 10:41:27PM +, John F Carr wrote:
> Kernel and system at a146207d66f320ed239c1059de9df854b66b55b7 plus some 
> irrelevant local changes, four 64 bit ARM processors, make.conf sets 
> CPUTYPE?=cortex-a57.
> 
> I typed ^C while /bin/sh was starting a pipeline and my shell got hung in the 
> middle of fork().
> 
> >From the terminal:
> 
> # git log --oneline --|more
> ^C^C^C
> load: 3.26  cmd: sh 95505 [fork] 5308.67r 0.00u 0.03s 0% 2860k
> mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
> fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
> load: 3.16  cmd: sh 95505 [fork] 5311.75r 0.00u 0.03s 0% 2860k
> mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
> fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
> 
> According to ps -d on another terminal the shell has no children:
> 
>   PID TT  STAT   TIME COMMAND
> [...]
>   873 u0  IWs 0:00.00 `-- login [pam] (login)
>   874 u0  I   0:00.17   `-- -sh (sh)
> 95504 u0  I   0:00.01 `-- su -
> 95505 u0  D+  0:00.05   `-- -su (sh)
> [...]
> 
> Nothing on the (115200 bps serial) console.  No change in system performance.
> 
> The system is busy copying a large amount of data from the network to a ZFS 
> pool on spinning disks.  The git|more pipeline could have taken some time to 
> get going while I/O requests worked their way through the queue.  It would 
> not have touched the busy pool, only the zroot pool on an SSD.
> 
> Has anything changed recently that might cause this?

There was some change around fork, but your sleep seems to be not from
that change.  Can you show the wait channel for the process?  Do something
like
$ ps alxww




shell hung in fork system call

2023-07-09 Thread John F Carr
Kernel and system at a146207d66f320ed239c1059de9df854b66b55b7 plus some 
irrelevant local changes, four 64 bit ARM processors, make.conf sets 
CPUTYPE?=cortex-a57.

I typed ^C while /bin/sh was starting a pipeline and my shell got hung in the 
middle of fork().

>From the terminal:

# git log --oneline --|more
^C^C^C
load: 3.26  cmd: sh 95505 [fork] 5308.67r 0.00u 0.03s 0% 2860k
mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 
load: 3.16  cmd: sh 95505 [fork] 5311.75r 0.00u 0.03s 0% 2860k
mi_switch+0x198 sleepq_switch+0xfc sleepq_timedwait+0x40 _sleep+0x264 
fork1+0x67c sys_fork+0x34 do_el0_sync+0x4c8 handle_el0_sync+0x44 

According to ps -d on another terminal the shell has no children:

  PID TT  STAT   TIME COMMAND
[...]
  873 u0  IWs 0:00.00 `-- login [pam] (login)
  874 u0  I   0:00.17   `-- -sh (sh)
95504 u0  I   0:00.01 `-- su -
95505 u0  D+  0:00.05   `-- -su (sh)
[...]

Nothing on the (115200 bps serial) console.  No change in system performance.

The system is busy copying a large amount of data from the network to a ZFS 
pool on spinning disks.  The git|more pipeline could have taken some time to 
get going while I/O requests worked their way through the queue.  It would not 
have touched the busy pool, only the zroot pool on an SSD.

Has anything changed recently that might cause this?