On Wed, Jul 2, 2025 at 2:38 AM K R <daharmaster...@gmail.com> wrote:
>
> On Tue, Jul 1, 2025 at 3:08 PM Martin Pieuchot <m...@grenadille.net> wrote:
> >
> > On 01/07/25(Tue) 14:29, K R wrote:
> > > On Tue, Jul 1, 2025 at 9:07 AM Claudio Jeker <cje...@diehard.n-r-g.com> 
> > > wrote:
> > > >
> > > > On Tue, Jul 01, 2025 at 04:57:14AM -0300, K R wrote:
> > > > > On Mon, Jun 30, 2025 at 2:39 AM K R <daharmaster...@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Jun 27, 2025 at 3:34 AM Martin Pieuchot 
> > > > > > <m...@grenadille.net> wrote:
> > > > > > >
> > > > > > > On 26/06/25(Thu) 11:02, K R wrote:
> > > > > > > > On Wed, Jun 25, 2025 at 1:30 PM K R <daharmaster...@gmail.com> 
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > [...]
> > > > > > > > > > Hi Alexander,
> > > > > > > > > >
> > > > > > > > > > The good news: I can consistently reproduce the hang 
> > > > > > > > > > problem.  But the
> > > > > > > > > > bad news is that even with a WITNESS kernel and 
> > > > > > > > > > kern.witness.watch=2
> > > > > > > > > > (or even 3) I don't see any message or kernel panic.
> > > > > > >
> > > > > > > Do you mind sharing your recipe to reproduce the hang?
> > > > >
> > > > > Another one.  This time I've included output from 'show witness /b', 
> > > > > below:
> > > > >
> > > > > pmap_tlb_shootwait: spun outStopped at  db_enter+0x14:  popq    %rbp
> > > >
> > > > I think the number one question here is why did pmap_tlb_shootwait spin
> > > > out?
> > > > This is waiting for other CPUs to finish their work.
> > > > So first of all what value has tlb_shoot_wait (x /qx tlb_shoot_wait)
> > > > The probably also the values of tlb_shoot_first_pcid, tlb_shoot_addr1 
> > > > and
> > > > tlb_shoot_addr2 are of interest. (use x /qx for the addrs and x /x for
> > > > first_pcid).
> > >
> > > Hi Claudio, please see below the output from the commands you
> > > requested.  If you need something else, please let me know.
> >
> > Thanks, could you try the diff below and see if it helps?
>
> Hi Martin, thanks!  I've just applied your patch, built a new kernel
> and now the machine is running the stress test.  I'll keep you posted.

Another crash.  Please see below:

pmap_tlb_shootwait: spun outStopped at  db_enter+0x14:  popq    %rbp

ddb{1}> trace
db_enter() at db_enter+0x14
pmap_page_remove(fffffd81618e56f8) at pmap_page_remove+0x38e
uvm_anfree_list(fffffd9c9f929cb0,0) at uvm_anfree_list+0x63
amap_wipeout(fffffd9c9273d6f0) at amap_wipeout+0x198
uvm_unmap_detach(ffff800055e660f0,0) at uvm_unmap_detach+0xbe
sys_munmap(ffff800055db9750,ffff800055e661f0,ffff800055e66170) at sys_munmap+0x
185
syscall(ffff800055e661f0) at syscall+0x5f9
Xsyscall() at Xsyscall+0x128
end of kernel
end trace frame: 0x25ba4903ca0, count: -8

ddb{1}> show uvm
Current UVM status:
  pagesize=4096 (0x1000), pagemask=0xfff, pageshift=12
  32128633 VM pages: 51836 active, 176494 inactive, 1 wired, 12995723 free (296
7136 zero)
  freemin=1070954, free-target=1427938, inactive-target=0, wired-max=10709544
  faults=531662531, traps=528617067, intrs=1230847, ctxswitch=48166333 fpuswitc
h=0
  softint=313110, syscalls=19991158, kmapent=11
  fault counts:
    noram=0, noanon=0, noamap=0, pgwait=0, pgrele=0
    relocks=288463(4638), upgrades=0(0) anget(retries)=285483(0), amapcopy=3623
6106
    neighbor anon/obj pg=36882/128185, gets(lock/unlock)=351322/293131
    cases: anon=269766, anoncow=15717, obj=336363, prcopy=10291, przero=5310293
80
  daemon and swap counts:
    woke=0, revs=0, scans=0, obscans=0, anscans=0
    busy=0, freed=0, reactivate=0, deactivate=0
    pageouts=0, pending=0, nswget=0
    nswapdev=1
    swpages=16777883, swpginuse=0, swpgonly=0 paging=0
  kernel pointers:
    objs(kern)=0xffffffff82ba5810

ddb{1}> show all locks
Process 9957 (nfdump) thread 0xffff800055ded9e8 (463186)
exclusive rwlock amaplk r = 0 (0xfffffd8bd97d62a0)
exclusive rwlock vmmaplk r = 0 (0xfffffd9ca4496c68)
Process 10322 (nfdump) thread 0xffff800055db8a80 (106827)
exclusive rwlock amaplk r = 0 (0xfffffd8bac1cf148)
shared rwlock vmmaplk r = 0 (0xfffffd9ca82444c8)
Process 10322 (nfdump) thread 0xffff800055dec048 (377455)
exclusive kernel_lock &kernel_lock r = 0 (0xffffffff82ba7508)
exclusive rrwlock inode r = 0 (0xfffffd9cf3193e60)
Process 61123 (nfdump) thread 0xffff800055db9750 (48955)
exclusive rwlock amaplk r = 0 (0xfffffd8bd97d6450)
Process 61123 (nfdump) thread 0xffff800055dec2d8 (397827)
exclusive rwlock amaplk r = 0 (0xfffffd8bac1cff58)
shared rwlock vmmaplk r = 0 (0xfffffd9ca12a5120)

ddb{1}> show witness /b
Number of known direct relationships is 406

Lock order reversal between "&mp->mnt_lock"(rwlock) and "&ip->i_lock"(rrwlock)!

witness: incomplete path, depth 4

ddb{1}> ps
   PID     TID   PPID    UID  S       FLAGS  WAIT          COMMAND
 81994  445896  78898      0  3    0x100083  kqread        top
  9957  160463  18893      0  7         0x3                nfdump
  9957  395677  18893      0  3   0x4000083  fsleep        nfdump
  9957  204084  18893      0  3   0x4000003  vmmaplk       nfdump
  9957   56145  18893      0  3   0x4000083  fsleep        nfdump
  9957  318603  18893      0  3   0x4000083  fsleep        nfdump
  9957  254799  18893      0  3   0x4000083  fsleep        nfdump
  9957    8915  18893      0  3   0x4000083  fsleep        nfdump
  9957  463186  18893      0  7   0x4000003                nfdump
 10322  161786  61442      0  7         0x3                nfdump
 10322  238857  61442      0  3   0x4000083  fsleep        nfdump
 10322  106827  61442      0  7   0x4000003                nfdump
 10322  301683  61442      0  3   0x4000083  fsleep        nfdump
 10322  504174  61442      0  3   0x4000083  fsleep        nfdump
 10322    3469  61442      0  3   0x4000083  fsleep        nfdump
 10322  374393  61442      0  3   0x4000083  fsleep        nfdump
 10322  377455  61442      0  7   0x4000003                nfdump
 61123  226290  28815      0  3        0x83  fsleep        nfdump
 61123  488365  28815      0  3   0x4000083  fsleep        nfdump
*61123   48955  28815      0  7   0x4000003                nfdump
 61123  119112  28815      0  3   0x4000083  fsleep        nfdump
 61123  521289  28815      0  3   0x4000083  fsleep        nfdump
 61123  217110  28815      0  3   0x4000083  fsleep        nfdump
 61123   33442  28815      0  3   0x4000083  fsleep        nfdump
 61123  397827  28815      0  7   0x4000003                nfdump
 18893  493617      1      0  3    0x10008b  sigsusp       sh
 61442  378246      1      0  3    0x10008b  sigsusp       sh
 28815   86747      1      0  3    0x10008b  sigsusp       sh
 78898  518168  73998      0  3    0x10008b  sigsusp       ksh
 73998   78599      1      0  3    0x100080  kqread        tmux
 88247  481547      1      0  3    0x100083  ttyin         getty
  2851  350752      1      0  3    0x100083  ttyin         getty
 95636   97892      1      0  3    0x100083  ttyin         getty
 99165  336734      1      0  3    0x100083  ttyin         getty
 94509  344932      1      0  3    0x100083  ttyin         getty
 27762  274695      1      0  3    0x100083  ttyin         ksh
 17715   49168      1      0  3    0x100098  kqread        cron
 69026  332831      1     99  3   0x1100090  kqread        sndiod
 54694  478450      1    110  3    0x100090  kqread        sndiod
  5697  443271  97255     95  3   0x1100092  kqread        smtpd
 46243   99327  97255    103  3   0x1100092  kqread        smtpd
 50084  224069  97255     95  3   0x1100092  kqread        smtpd
 86000  123418  97255     95  3    0x100092  kqread        smtpd
 82343  450981  97255     95  3   0x1100092  kqread        smtpd
 24636   20127  97255     95  3   0x1100092  kqread        smtpd
 97255  111651      1      0  3    0x100080  kqread        smtpd
 12541   37192      1      0  3        0x88  kqread        sshd
 85367  250626      1      0  3    0x100080  kqread        ntpd
 28096  515227  14026     83  3    0x100092  kqread        ntpd
 14026  502374      1     83  3   0x1100092  kqread        ntpd
 86221   70314  72159     74  3   0x1100092  bpf           pflogd
 72159  268073      1      0  3        0x80  sbwait        pflogd
  6108   37249  51475     73  3   0x1100090  kqread        syslogd
 51475   59727      1      0  3    0x100082  sbwait        syslogd
 41019  133599      1      0  3    0x100080  kqread        resolvd
 97128   82772  95469     77  3    0x100092  kqread        dhcpleased
 53784  299579  95469     77  3    0x100092  kqread        dhcpleased
 95469  522301      1      0  3        0x80  kqread        dhcpleased
 40382  114795  49808    115  3    0x100092  kqread        slaacd
 16896  174980  49808    115  3    0x100092  kqread        slaacd
 49808   29152      1      0  3    0x100080  kqread        slaacd
 41856  209048      0      0  3     0x14200  bored         smr
 32903  448281      0      0  3     0x14200  pgzero        zerothread
 83463  270619      0      0  3     0x14200  aiodoned      aiodoned
 98998  117726      0      0  3     0x14200  syncer        update
 89280  387876      0      0  3     0x14200  cleaner       cleaner
 96217  481684      0      0  3     0x14200  reaper        reaper
 57847  323601      0      0  3     0x14200  pgdaemon      pagedaemon
 44359  303835      0      0  3     0x14200  bored         wsdisplay0
 88760  328401      0      0  3     0x14200  usbtsk        usbtask
 33892  238971      0      0  3     0x14200  usbatsk       usbatsk
 40440  389398      0      0  3  0x40014200  acpi0         acpi0
 46014  489943      0      0  7  0x40014200                idle31
 95730  109730      0      0  7  0x40014200                idle30
 69686  366444      0      0  7  0x40014200                idle29
 95754  517295      0      0  7  0x40014200                idle28
 69388  237366      0      0  7  0x40014200                idle27
 97978  394450      0      0  7  0x40014200                idle26
 36311  201255      0      0  7  0x40014200                idle25
 20892  262557      0      0  7  0x40014200                idle24
 82220   30818      0      0  7  0x40014200                idle23
 14077  250183      0      0  7  0x40014200                idle22
 80410  225236      0      0  7  0x40014200                idle21
 87632   27892      0      0  7  0x40014200                idle20
 74401  120036      0      0  7  0x40014200                idle19
 90877  181432      0      0  7  0x40014200                idle18
 86406  145349      0      0  7  0x40014200                idle17
 19938  148411      0      0  7  0x40014200                idle16
 43568  369660      0      0  7  0x40014200                idle15
 88265  368457      0      0  7  0x40014200                idle14
 39599   71270      0      0  3  0x40014200                idle13
 58064  267069      0      0  7  0x40014200                idle12
  9278  103115      0      0  7  0x40014200                idle11
 35091  353089      0      0  3  0x40014200                idle10
 65872  256306      0      0  7  0x40014200                idle9
  1552  514975      0      0  3  0x40014200                idle8
 85383  240017      0      0  7  0x40014200                idle7
 85335  105423      0      0  3  0x40014200                idle6
 75978   17921      0      0  7  0x40014200                idle5
 11959  522673      0      0  3  0x40014200                idle4
 51461  450362      0      0  7  0x40014200                idle3
 47181  512214      0      0  7  0x40014200                idle2
 75561  241987      0      0  3  0x40014200                idle1
 44257    7285      0      0  3     0x14200  bored         sensors
 67558   43722      0      0  3     0x14200  bored         softnet3
 33902  372321      0      0  3     0x14200  bored         softnet2
 89790  128317      0      0  3     0x14200  bored         softnet1
 13715  273891      0      0  3     0x14200  bored         softnet0
 42976  459414      0      0  3     0x14200  bored         systqmp
 31099  120209      0      0  3     0x14200  bored         systq
 99963  372420      0      0  3     0x14200  tmoslp        softclockmp
  3258  343907      0      0  3  0x40014200  tmoslp        softclock
 36969  310856      0      0  3  0x40014200                idle0
     1  511358      0      0  3        0x82  wait          init
     0       0     -1      0  3     0x10200  scheduler     swapper

>
> Thanks again,
> --Kor
>
> >
> > > > [...]
> > > > The else case of the above if block does pmap_tlb_shootwait() but then
> > > > after that a 2nd pmap_tlb_shootwait() is issued. Not sure if this is 
> > > > safe
> > > > in all cases. Especially the first pmap_tlb_shootwait() is done with the
> > > > pm_mtx locked.
> >
> > I believe there's an issue with holding the pm_mtx.
> >
> > Index: arch/amd64/amd64/pmap.c
> > ===================================================================
> > RCS file: /cvs/src/sys/arch/amd64/amd64/pmap.c,v
> > diff -u -p -r1.180 pmap.c
> > --- arch/amd64/amd64/pmap.c     19 Jun 2025 12:01:08 -0000      1.180
> > +++ arch/amd64/amd64/pmap.c     1 Jul 2025 18:06:28 -0000
> > @@ -2947,7 +2947,6 @@ enter_now:
> >                 if (nocache && (opte & PG_N) == 0) /* XXX impossible? */
> >                         wbinvd_on_all_cpus();
> >                 pmap_tlb_shootpage(pmap, va, shootself);
> > -               pmap_tlb_shootwait();
> >                 PTE_BASE[pl1_i(va)] = npte;
> >         }
> >
> >
> >
> >

Reply via email to