Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Mark Johnston
On Mon, Mar 07, 2022 at 09:54:26PM +0100, Ronald Klop wrote:
>  
> Van: Mark Johnston 
> Datum: maandag, 7 maart 2022 16:13
> Aan: Ronald Klop 
> CC: bob prohaska , Mark Millard , 
> freebsd-...@freebsd.org, freebsd-current 
> > I haven't been able to reproduce any crashes running poudriere in an
> > arm64 AWS instance, though.  Could you please try the patch below and
> > confirm whether it fixes your panics?  I verified that the apparent
> > problem described above is gone with the patch.
> > 
> > diff --git a/sys/kern/kern_rmlock.c b/sys/kern/kern_rmlock.c
> > index 0cdcfb8fec62..e51c25136ae0 100644
> > --- a/sys/kern/kern_rmlock.c
> > +++ b/sys/kern/kern_rmlock.c
> > @@ -437,6 +437,7 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker 
> > *tracker, int trylock)
> >  {
> > struct thread *td = curthread;
> > struct pcpu *pc;
> > +   int cpuid;
> >  
> > if (SCHEDULER_STOPPED())
> > return (1);
> > @@ -452,6 +453,7 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker 
> > *tracker, int trylock)
> > atomic_interrupt_fence();
> >  
> > pc = get_pcpu();
> > +   cpuid = pc->pc_cpuid;
> > rm_tracker_add(pc, tracker);
> > sched_pin();
> >  
> > @@ -463,7 +465,7 @@ _rm_rlock(struct rmlock *rm, struct rm_priotracker 
> > *tracker, int trylock)
> >  * conditional jump.
> >  */
> > if (__predict_true(0 == (td->td_owepreempt |
> > -   CPU_ISSET(pc->pc_cpuid, >rm_writecpus
> > +   CPU_ISSET(cpuid, >rm_writecpus
> > return (1);
> >  
> > /* We do not have a read token and need to acquire one. */
> > 
> > 
> > 
> 
> Hi,
> 
> This patch paniced again:
> x0: a5a31500  
>
>   x1: a5a0e000
> 
>   x2:2
> 
>   x3: a00076c4e9a0
> 
>   x4:0
> 
>   x5:e672743c8f9e5
> 
>   x6:dc89f70500ab1
>   x7:   14
>   x8: a5a31518
>   x9:1
>  x10: a5a0e000
>  x11:0
>  x12:0
>  x13:a
>  x14: 1013e6b85a8ecbe4
>  x15: 1dce740d11a5
>  x16: 3ea86e2434bf
>  x17: fff2
>  x18: fe661800 (g_ctx + fcf9fa54)
>  x19: a00076c4e9a0
>  x20: fec39000 (g_ctx + fd577254)
>  x21:2
>  x22: 419b6090 (g_ctx + 402f42e4)
>  x23: 00c0b137 (lockstat_enabled + 0)
>  x24:  100
>  x25: 00c0b000 (version + a0)
>  x26: 00c0b000 (version + a0)
>  x27: 00c0b000 (version + a0)
>  x28:0
>  x29: fe661800 (g_ctx + fcf9fa54)
>   sp: fe661800
>   lr: 0154ea50 (zio_dva_throttle + 154)
>  elr: 0154ea80 (zio_dva_throttle + 184)
> spsr: 6045
>  far: 2b753286b0b8
> panic: Unknown kernel exception 0 esr_el1 200
> cpuid = 1
> time = 1646685857
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x174
> panic() at panic+0x44
> do_el1h_sync() at do_el1h_sync+0x184
> handle_el1h_sync() at handle_el1h_sync+0x10
> --- exception, esr 0x200
> zio_dva_throttle() at zio_dva_throttle+0x184
> zio_execute() at zio_execute+0x58
> KDB: enter: panic
> [ thread pid 0 tid 100129 ]
> Stopped at  kdb_enter+0x44: undefined   f901c11f
> db>  

ZFS doesn't make use of rm locks as far as I can see, so this is a
little weird.  I reverted the original rmlock commit in main, so it may
be worth verifying that the problem really is gone before digging
deeper.  In other words, I'm a bit suspicious that this is a different
bug.



Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Mark Millard



On 2022-Mar-7, at 12:54, Ronald Klop  wrote:

> Van: Mark Johnston 
> Datum: maandag, 7 maart 2022 16:13
> Aan: Ronald Klop 
> CC: bob prohaska , Mark Millard , 
> freebsd-...@freebsd.org, freebsd-current 
> Onderwerp: Re: panic: data abort in critical section or under mutex (was: Re: 
> panic: Unknown kernel exception 0 esr_el1 200 (on 14-CURRENT/aarch64 Feb 
> 28))
> 
> On Mon, Mar 07, 2022 at 02:46:09PM +0100, Ronald Klop wrote:
> > Dear Mark Johnston,
> >
> > I did some binary search in the kernels and came to the conclusion that 
> > https://cgit.freebsd.org/src/commit/?id=1517b8d5a7f58897200497811de1b18809c07d3e
> >  still works and 
> > https://cgit.freebsd.org/src/commit/?id=407c34e735b5d17e2be574808a09e6d729b0a45a
> >  panics.
> >
> > I suspect your commit in 
> > https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a.
> >
> > Last panic:
> >
> > panic: vm_fault failed: 0046e708 error 1
> > cpuid = 1
> > time = 1646660058
> > KDB: stack backtrace:
> > db_trace_self() at db_trace_self
> > db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> > vpanic() at vpanic+0x174
> > panic() at panic+0x44
> > data_abort() at data_abort+0x2e8
> > handle_el1h_sync() at handle_el1h_sync+0x10
> > --- exception, esr 0x9604
> > _rm_rlock_debug() at _rm_rlock_debug+0x8c
> > osd_get() at osd_get+0x5c
> > zio_execute() at zio_execute+0xf8
> > taskqueue_run_locked() at taskqueue_run_locked+0x178
> > taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
> > fork_exit() at fork_exit+0x74
> > fork_trampoline() at fork_trampoline+0x14
> > KDB: enter: panic
> > [ thread pid 0 tid 100129 ]
> > Stopped at  kdb_enter+0x44: undefined   f902011f
> > db>
> >
> > A more recent kernel (912df91) still panics. See below.
> >
> > Do you have time to look into this? What can I provide in information to 
> > help?
> 
> Hmm.  So after my rmlock commits, we have the following disassembly for
> _rm_rlock() (with a few annotations added by me).  Note that the pcpu
> pointer is stored in register x18 by convention.
> 
>0x0046e304 <+0>: stp x29, x30, [sp, #-16]!
>0x0046e308 <+4>: mov x29, sp
>0x0046e30c <+8>: ldr x8, [x18]
>0x0046e310 <+12>:ldr x9, [x18]
>0x0046e314 <+16>:ldr x10, [x18]
>0x0046e318 <+20>:cmp x9, x10
>0x0046e31c <+24>:b.ne0x0046e3cc <_rm_rlock+200>  
> // b.any
>0x0046e320 <+28>:ldr x9, [x18]
>0x0046e324 <+32>:ldrhw9, [x9, #314]
>0x0046e328 <+36>:cbnzw9, 0x0046e3c0 <_rm_rlock+188>
>0x0046e32c <+40>:str wzr, [x1, #32]
>0x0046e330 <+44>:stp x0, x8, [x1, #16]
>0x0046e334 <+48>:ldrbw9, [x0, #10]
>0x0046e338 <+52>:tbz w9, #4, 0x0046e358 
> <_rm_rlock+84>
>0x0046e33c <+56>:ldr x9, [x18]
>0x0046e340 <+60>:ldr w10, [x9, #888]
>0x0046e344 <+64>:add w10, w10, #0x1
>0x0046e348 <+68>:str w10, [x9, #888]
>0x0046e34c <+72>:ldr x9, [x18]
>0x0046e350 <+76>:ldr w9, [x9, #888]
>0x0046e354 <+80>:cbz w9, 0x0046e3f4 <_rm_rlock+240>
>0x0046e358 <+84>:ldr w9, [x8, #1212]
>0x0046e35c <+88>:add x10, x18, #0x90
>0x0046e360 <+92>:add w9, w9, #0x1
>0x0046e364 <+96>:str w9, [x8, #1212]  <--- 
> critical_enter
>0x0046e368 <+100>:   str x10, [x1, #8]
>0x0046e36c <+104>:   ldr x9, [x18, #144]
>0x0046e370 <+108>:   str x9, [x1]
>0x0046e374 <+112>:   str x1, [x9, #8]
>0x0046e378 <+116>:   str x1, [x18, #144]
>0x0046e37c <+120>:   ldr x9, [x18]
>0x0046e380 <+124>:   ldr w10, [x9, #356]
>0x0046e384 <+128>:   add w10, w10, #0x1
>0x0046e388 <+132>:   str w10, [x9, #356]
>0x0046e38c <+136>:   ldr w9, [x8, #1212]
>0x0046e390 <+140>:   sub w9, w9, #0x1
>0x0046e394 <+144>:   str w9, [x8, #1212]  <--- 
> critical_exit
>0x0046e398 <+148>:   ldrbw8, [x8, #304]
>0x0046e39c <+152>:   ldr w9, [x18, #60]   <--- loading 
> >pc_cpuid
>...
> 
> A (the?) problem is that the compiler is treating "pc" as an alias
> for x18, but the rmlock code assumes that the pcpu pointer is loaded
> once, as it dereferences "pc" outside of the critical section.  On
> arm64, if a context switch occurs between the store at _rm_rlock+144 and
> the load at +152, and the thread is migrated to another CPU, then we'll
> end up using the wrong CPU ID in the rm->rm_writecpus test.
> 
> I suspect the problem is unique to arm64 as its get_pcpu()
> implementation is different from the others in 

Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Ronald Klop


Van: Mark Johnston 
Datum: maandag, 7 maart 2022 16:13
Aan: Ronald Klop 
CC: bob prohaska , Mark Millard , 
freebsd-...@freebsd.org, freebsd-current 
Onderwerp: Re: panic: data abort in critical section or under mutex (was: Re: 
panic: Unknown kernel exception 0 esr_el1 200 (on 14-CURRENT/aarch64 Feb 
28))


On Mon, Mar 07, 2022 at 02:46:09PM +0100, Ronald Klop wrote:
> Dear Mark Johnston,
>
> I did some binary search in the kernels and came to the conclusion that 
https://cgit.freebsd.org/src/commit/?id=1517b8d5a7f58897200497811de1b18809c07d3e 
still works and 
https://cgit.freebsd.org/src/commit/?id=407c34e735b5d17e2be574808a09e6d729b0a45a 
panics.
>
> I suspect your commit in 
https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a.
>
> Last panic:
>
> panic: vm_fault failed: 0046e708 error 1
> cpuid = 1
> time = 1646660058
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x174
> panic() at panic+0x44
> data_abort() at data_abort+0x2e8
> handle_el1h_sync() at handle_el1h_sync+0x10
> --- exception, esr 0x9604
> _rm_rlock_debug() at _rm_rlock_debug+0x8c
> osd_get() at osd_get+0x5c
> zio_execute() at zio_execute+0xf8
> taskqueue_run_locked() at taskqueue_run_locked+0x178
> taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
> fork_exit() at fork_exit+0x74
> fork_trampoline() at fork_trampoline+0x14
> KDB: enter: panic
> [ thread pid 0 tid 100129 ]
> Stopped at  kdb_enter+0x44: undefined   f902011f
> db>
>
> A more recent kernel (912df91) still panics. See below.
>
> Do you have time to look into this? What can I provide in information to help?

Hmm.  So after my rmlock commits, we have the following disassembly for
_rm_rlock() (with a few annotations added by me).  Note that the pcpu
pointer is stored in register x18 by convention.

   0x0046e304 <+0>: stp x29, x30, [sp, #-16]!
   0x0046e308 <+4>: mov x29, sp
   0x0046e30c <+8>: ldr x8, [x18]
   0x0046e310 <+12>:ldr x9, [x18]
   0x0046e314 <+16>:ldr x10, [x18]
   0x0046e318 <+20>:cmp x9, x10
   0x0046e31c <+24>:b.ne0x0046e3cc <_rm_rlock+200>  // 
b.any
   0x0046e320 <+28>:ldr x9, [x18]
   0x0046e324 <+32>:ldrhw9, [x9, #314]
   0x0046e328 <+36>:cbnzw9, 0x0046e3c0 <_rm_rlock+188>
   0x0046e32c <+40>:str wzr, [x1, #32]
   0x0046e330 <+44>:stp x0, x8, [x1, #16]
   0x0046e334 <+48>:ldrbw9, [x0, #10]
   0x0046e338 <+52>:tbz w9, #4, 0x0046e358 
<_rm_rlock+84>
   0x0046e33c <+56>:ldr x9, [x18]
   0x0046e340 <+60>:ldr w10, [x9, #888]
   0x0046e344 <+64>:add w10, w10, #0x1
   0x0046e348 <+68>:str w10, [x9, #888]
   0x0046e34c <+72>:ldr x9, [x18]
   0x0046e350 <+76>:ldr w9, [x9, #888]
   0x0046e354 <+80>:cbz w9, 0x0046e3f4 <_rm_rlock+240>
   0x0046e358 <+84>:ldr w9, [x8, #1212]
   0x0046e35c <+88>:add x10, x18, #0x90
   0x0046e360 <+92>:add w9, w9, #0x1
   0x0046e364 <+96>:str w9, [x8, #1212]  <--- critical_enter
   0x0046e368 <+100>:   str x10, [x1, #8]
   0x0046e36c <+104>:   ldr x9, [x18, #144]
   0x0046e370 <+108>:   str x9, [x1]
   0x0046e374 <+112>:   str x1, [x9, #8]
   0x0046e378 <+116>:   str x1, [x18, #144]
   0x0046e37c <+120>:   ldr x9, [x18]
   0x0046e380 <+124>:   ldr w10, [x9, #356]
   0x0046e384 <+128>:   add w10, w10, #0x1
   0x0046e388 <+132>:   str w10, [x9, #356]
   0x0046e38c <+136>:   ldr w9, [x8, #1212]
   0x0046e390 <+140>:   sub w9, w9, #0x1
   0x0046e394 <+144>:   str w9, [x8, #1212]  <--- critical_exit
   0x0046e398 <+148>:   ldrbw8, [x8, #304]
   0x0046e39c <+152>:   ldr w9, [x18, #60]   <--- loading 
>pc_cpuid
   ...

A (the?) problem is that the compiler is treating "pc" as an alias
for x18, but the rmlock code assumes that the pcpu pointer is loaded
once, as it dereferences "pc" outside of the critical section.  On
arm64, if a context switch occurs between the store at _rm_rlock+144 and
the load at +152, and the thread is migrated to another CPU, then we'll
end up using the wrong CPU ID in the rm->rm_writecpus test.

I suspect the problem is unique to arm64 as its get_pcpu()
implementation is different from the others in that it doesn't use
volatile-qualified inline assembly.  This has been the case since
https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762
.

I haven't been able to reproduce any crashes running poudriere in an
arm64 AWS instance, though.  Could you 

Re: vnet jails loose network connectivity

2022-03-07 Thread Johan Hendriks



On 04/03/2022 15:36, Johan Hendriks wrote:
Hello all, i use jails for some testing, but i can not seem to make it 
stable.
I use vnet jails with a bridge but when i put some load on it, some 
jails loose there network connectivity.


My setup is as follows, haproxy internal IP 10.233.185.20 using binat 
to make it Public accessable.

Then a varnish jail, and two web servers al on the 10.233.185.x range.

If i give it a little load with hey (hey -h2 -n 10 -c 20 -z 60s 
https://wp.test.nl) than within the test the haproxy jail is not 
reachable anymore it is not pingable from the host machine, and from 
the other jails. restarting the jails solves it, if i leave the system 
alone for some time i saw the varnish jail become unresponsive.


If i do a tcpdump on the epair${name}a interface i do see the packages 
from the host machine to the jail but the jail itself is not reachable.


There is nothing in the logs from the host and the jail itself, i can 
ping the jails ip adres from the jail itself.



I do not think i have a special setup, but i could be doing something 
wrong.

my jail.conf

# Global settings applied to all jails.
$domain = "test.nl";
$subdomain = "";

exec.start = "/bin/sh /etc/rc";
exec.stop = "/bin/sh /etc/rc.shutdown";
exec.clean;

mount.fstab = "/storage/jails/$name.fstab";

exec.system_user  = "root";
exec.jail_user    = "root";
mount.devfs;
sysvshm="new";
sysvsem="new";
allow.raw_sockets;
allow.set_hostname = 0;
allow.sysvipc;
enforce_statfs = "2";
devfs_ruleset = "11";

path = "/storage/jails/${name}";
host.hostname = "${name}${subdomain}.${domain}";

# Networking
$uplinkdev    = "vtnet1";
$epid = "${ip}";
$subnet   = "10.233.185.";
$cidr = "/24";
$ipv4_addr    = "${subnet}${ip}${cidr}";
vnet;
vnet.interface    = "vnet0";

$epair=epair${ip};
vnet;
#vnet.interface    = "${epair}b";  # default vnet interface
exec.prestart = "ifconfig bridge0 > /dev/null 2>&1 || ( ifconfig 
bridge0 create up && ifconfig bridge0 addm $uplinkdev )";
exec.prestart    += "ifconfig ${epair} create up description 
jail_${name}   || echo 'Skipped creating epair (exists?)'";
exec.prestart    += "ifconfig bridge0 addm ${epair}a   || echo 
'Skipped adding bridge member (already member?)'";

exec.created  = "ifconfig ${epair}b name vnet0";
exec.start    = "/bin/sh /etc/rc";
exec.consolelog   = "/var/log/jail/$name.test.nl";
exec.stop = "/bin/sh /etc/rc.shutdown";
exec.poststop = "ifconfig bridge0 deletem ${epair}a";
exec.poststop    += "ifconfig ${epair}a destroy";

varnish01 {
    $ip = 16;
    mount.fstab = "";
    path = "/storage/jails/${name}";
}

web01 {
    $ip = 18;
}

web02 {
    $ip = 19;
}

haproxy {
    $ip = 20;
    mount.fstab = "";
    path = "/storage/jails/${name}";
}

My ifconfig

bridge0: flags=8843 metric 0 
mtu 1500

    ether 58:9c:fc:10:ff:82
    inet 10.233.185.1 netmask 0xff00 broadcast 10.233.185.255
    id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
    maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
    root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
    member: epair20a flags=143
    ifmaxaddr 0 port 13 priority 128 path cost 2000
    member: epair19a flags=143
    ifmaxaddr 0 port 53 priority 128 path cost 2000
    member: epair18a flags=143
    ifmaxaddr 0 port 48 priority 128 path cost 2000
    member: epair16a flags=143
    ifmaxaddr 0 port 28 priority 128 path cost 2000
    groups: bridge
    nd6 options=9
epair16a: flags=8963 
metric 0 mtu 1500

    description: jail_varnish01
    options=8
    ether 02:76:32:8e:0e:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T )
    status: active
    nd6 options=29
epair18a: flags=8963 
metric 0 mtu 1500

    description: jail_web01
    options=8
    ether 02:6d:be:b8:36:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T )
    status: active
    nd6 options=29
epair19a: flags=8963 
metric 0 mtu 1500

    description: jail_web02
    options=8
    ether 02:54:fd:77:9a:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T )
    status: active
    nd6 options=29
epair20a: flags=8963 
metric 0 mtu 1500

    description: jail_haproxy
    options=8
    ether 02:f8:58:06:78:0a
    groups: epair
    media: Ethernet 10Gbase-T (10Gbase-T )
    status: active
    nd6 options=29

This is on both 13-STABLE and 14-HEAD.


For the sake of testing i tried it with FreeBSD 13.0-RELEASE-p7 and this 
works fine. This is an exact copy of the setup i use on 14-CURRENT and 
13-STABLE. (i did a ZFS send and receive of the jails and a copy of the 
jail.conf. pf.conf and so on) I did run the hey command targeting the 
13-0-RELEASE multiple times.


hey -h2 -n 10 -c 30 -z 300s https://wp.test.nl

Summary:
  Total:    300.0045 secs
  Slowest:    0.1137 secs
  Fastest:    0.0006 secs
  Average:    0.0090 secs
  Requests/sec:    4627.4504


Response time histogram:
  0.001 [1]    |
  0.012 [977291] 

Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Mark Johnston
On Mon, Mar 07, 2022 at 10:03:51AM -0800, Mark Millard wrote:
> 
> 
> On 2022-Mar-7, at 08:45, Mark Johnston  wrote:
> 
> > On Mon, Mar 07, 2022 at 04:25:22PM +, Andrew Turner wrote:
> >> 
> >>> On 7 Mar 2022, at 15:13, Mark Johnston  wrote:
> >>> ...
> >>> A (the?) problem is that the compiler is treating "pc" as an alias
> >>> for x18, but the rmlock code assumes that the pcpu pointer is loaded
> >>> once, as it dereferences "pc" outside of the critical section.  On
> >>> arm64, if a context switch occurs between the store at _rm_rlock+144 and
> >>> the load at +152, and the thread is migrated to another CPU, then we'll
> >>> end up using the wrong CPU ID in the rm->rm_writecpus test.
> >>> 
> >>> I suspect the problem is unique to arm64 as its get_pcpu()
> >>> implementation is different from the others in that it doesn't use
> >>> volatile-qualified inline assembly.  This has been the case since
> >>> https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762
> >>>  
> >>> 
> >>> .
> >>> 
> >>> I haven't been able to reproduce any crashes running poudriere in an
> >>> arm64 AWS instance, though.  Could you please try the patch below and
> >>> confirm whether it fixes your panics?  I verified that the apparent
> >>> problem described above is gone with the patch.
> >> 
> >> Alternatively (or additionally) we could do something like the following. 
> >> There are only a few MI users of get_pcpu with the main place being in rm 
> >> locks.
> >> 
> >> diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h
> >> index 09f6361c651c..59b890e5c2ea 100644
> >> --- a/sys/arm64/include/pcpu.h
> >> +++ b/sys/arm64/include/pcpu.h
> >> @@ -58,7 +58,14 @@ struct pcpu;
> >> 
> >> register struct pcpu *pcpup __asm ("x18");
> >> 
> >> -#defineget_pcpu()  pcpup
> >> +static inline struct pcpu *
> >> +get_pcpu(void)
> >> +{
> >> +   struct pcpu *pcpu;
> >> +
> >> +   __asm __volatile("mov   %0, x18" : "="(pcpu));
> >> +   return (pcpu);
> >> +}
> >> 
> >> static inline struct thread *
> >> get_curthread(void)
> > 
> > Indeed, I think this is probably the best solution.

Thinking a bit more, even with that patch, code like this may not behave
the same on arm64 as on other platforms:

critical_enter();
ptr = _GET(foo);
critical_exit();
bar = *ptr;

since as far as I can see the compiler may translate it to

critical_enter();
critical_exit();
bar = PCPU_GET(foo);

> Is this just partially reverting:
> 
> https://cgit.freebsd.org/src/commit/?id=63c858a04d56
> 
> If so, there might need to be comments about why the updated
> code is as it will be.
> 
> Looks like stable/13 picked up sensitivity to the get_pcpu
> details in rmlock in:
> 
> https://cgit.freebsd.org/src/commit/?h=stable/13=543157870da5
> 
> (a 2022-03-04 commit) and stable/13 also has the get_pcpu
> misdefinition in:
> 
> https://cgit.freebsd.org/src/commit/sys/arm64/include/pcpu.h?h=stable/13=63c858a04d56
> 
> . So an MFC would be appropriate in order for aarch64
> to be reliable for any variations in get_pcpu in stable/13
> (and for 13.1 to be so as well).

I reverted the rmlock commit in stable/13 already.  Either get_pcpu()
will be fixed shortly or 13.1 will ship without the rmlock commit.



Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Mark Millard



On 2022-Mar-7, at 08:45, Mark Johnston  wrote:

> On Mon, Mar 07, 2022 at 04:25:22PM +, Andrew Turner wrote:
>> 
>>> On 7 Mar 2022, at 15:13, Mark Johnston  wrote:
>>> ...
>>> A (the?) problem is that the compiler is treating "pc" as an alias
>>> for x18, but the rmlock code assumes that the pcpu pointer is loaded
>>> once, as it dereferences "pc" outside of the critical section.  On
>>> arm64, if a context switch occurs between the store at _rm_rlock+144 and
>>> the load at +152, and the thread is migrated to another CPU, then we'll
>>> end up using the wrong CPU ID in the rm->rm_writecpus test.
>>> 
>>> I suspect the problem is unique to arm64 as its get_pcpu()
>>> implementation is different from the others in that it doesn't use
>>> volatile-qualified inline assembly.  This has been the case since
>>> https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762
>>>  
>>> 
>>> .
>>> 
>>> I haven't been able to reproduce any crashes running poudriere in an
>>> arm64 AWS instance, though.  Could you please try the patch below and
>>> confirm whether it fixes your panics?  I verified that the apparent
>>> problem described above is gone with the patch.
>> 
>> Alternatively (or additionally) we could do something like the following. 
>> There are only a few MI users of get_pcpu with the main place being in rm 
>> locks.
>> 
>> diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h
>> index 09f6361c651c..59b890e5c2ea 100644
>> --- a/sys/arm64/include/pcpu.h
>> +++ b/sys/arm64/include/pcpu.h
>> @@ -58,7 +58,14 @@ struct pcpu;
>> 
>> register struct pcpu *pcpup __asm ("x18");
>> 
>> -#defineget_pcpu()  pcpup
>> +static inline struct pcpu *
>> +get_pcpu(void)
>> +{
>> +   struct pcpu *pcpu;
>> +
>> +   __asm __volatile("mov   %0, x18" : "="(pcpu));
>> +   return (pcpu);
>> +}
>> 
>> static inline struct thread *
>> get_curthread(void)
> 
> Indeed, I think this is probably the best solution.

Is this just partially reverting:

https://cgit.freebsd.org/src/commit/?id=63c858a04d56

If so, there might need to be comments about why the updated
code is as it will be.

Looks like stable/13 picked up sensitivity to the get_pcpu
details in rmlock in:

https://cgit.freebsd.org/src/commit/?h=stable/13=543157870da5

(a 2022-03-04 commit) and stable/13 also has the get_pcpu
misdefinition in:

https://cgit.freebsd.org/src/commit/sys/arm64/include/pcpu.h?h=stable/13=63c858a04d56

. So an MFC would be appropriate in order for aarch64
to be reliable for any variations in get_pcpu in stable/13
(and for 13.1 to be so as well).

===
Mark Millard
marklmi at yahoo.com




Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Mark Johnston
On Mon, Mar 07, 2022 at 04:25:22PM +, Andrew Turner wrote:
> 
> > On 7 Mar 2022, at 15:13, Mark Johnston  wrote:
> > ...
> > A (the?) problem is that the compiler is treating "pc" as an alias
> > for x18, but the rmlock code assumes that the pcpu pointer is loaded
> > once, as it dereferences "pc" outside of the critical section.  On
> > arm64, if a context switch occurs between the store at _rm_rlock+144 and
> > the load at +152, and the thread is migrated to another CPU, then we'll
> > end up using the wrong CPU ID in the rm->rm_writecpus test.
> > 
> > I suspect the problem is unique to arm64 as its get_pcpu()
> > implementation is different from the others in that it doesn't use
> > volatile-qualified inline assembly.  This has been the case since
> > https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762
> >  
> > 
> > .
> > 
> > I haven't been able to reproduce any crashes running poudriere in an
> > arm64 AWS instance, though.  Could you please try the patch below and
> > confirm whether it fixes your panics?  I verified that the apparent
> > problem described above is gone with the patch.
> 
> Alternatively (or additionally) we could do something like the following. 
> There are only a few MI users of get_pcpu with the main place being in rm 
> locks.
> 
> diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h
> index 09f6361c651c..59b890e5c2ea 100644
> --- a/sys/arm64/include/pcpu.h
> +++ b/sys/arm64/include/pcpu.h
> @@ -58,7 +58,14 @@ struct pcpu;
> 
>  register struct pcpu *pcpup __asm ("x18");
> 
> -#defineget_pcpu()  pcpup
> +static inline struct pcpu *
> +get_pcpu(void)
> +{
> +   struct pcpu *pcpu;
> +
> +   __asm __volatile("mov   %0, x18" : "="(pcpu));
> +   return (pcpu);
> +}
> 
>  static inline struct thread *
>  get_curthread(void)

Indeed, I think this is probably the best solution.



Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Andrew Turner

> On 7 Mar 2022, at 15:13, Mark Johnston  wrote:
> ...
> A (the?) problem is that the compiler is treating "pc" as an alias
> for x18, but the rmlock code assumes that the pcpu pointer is loaded
> once, as it dereferences "pc" outside of the critical section.  On
> arm64, if a context switch occurs between the store at _rm_rlock+144 and
> the load at +152, and the thread is migrated to another CPU, then we'll
> end up using the wrong CPU ID in the rm->rm_writecpus test.
> 
> I suspect the problem is unique to arm64 as its get_pcpu()
> implementation is different from the others in that it doesn't use
> volatile-qualified inline assembly.  This has been the case since
> https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762
>  
> 
> .
> 
> I haven't been able to reproduce any crashes running poudriere in an
> arm64 AWS instance, though.  Could you please try the patch below and
> confirm whether it fixes your panics?  I verified that the apparent
> problem described above is gone with the patch.

Alternatively (or additionally) we could do something like the following. There 
are only a few MI users of get_pcpu with the main place being in rm locks.

diff --git a/sys/arm64/include/pcpu.h b/sys/arm64/include/pcpu.h
index 09f6361c651c..59b890e5c2ea 100644
--- a/sys/arm64/include/pcpu.h
+++ b/sys/arm64/include/pcpu.h
@@ -58,7 +58,14 @@ struct pcpu;

 register struct pcpu *pcpup __asm ("x18");

-#defineget_pcpu()  pcpup
+static inline struct pcpu *
+get_pcpu(void)
+{
+   struct pcpu *pcpu;
+
+   __asm __volatile("mov   %0, x18" : "="(pcpu));
+   return (pcpu);
+}

 static inline struct thread *
 get_curthread(void)

Andrew



Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Mark Johnston
On Mon, Mar 07, 2022 at 02:46:09PM +0100, Ronald Klop wrote:
> Dear Mark Johnston,
> 
> I did some binary search in the kernels and came to the conclusion that 
> https://cgit.freebsd.org/src/commit/?id=1517b8d5a7f58897200497811de1b18809c07d3e
>  still works and 
> https://cgit.freebsd.org/src/commit/?id=407c34e735b5d17e2be574808a09e6d729b0a45a
>  panics.
> 
> I suspect your commit in 
> https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a.
> 
> Last panic:
> 
> panic: vm_fault failed: 0046e708 error 1
> cpuid = 1
> time = 1646660058
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x174
> panic() at panic+0x44
> data_abort() at data_abort+0x2e8
> handle_el1h_sync() at handle_el1h_sync+0x10
> --- exception, esr 0x9604
> _rm_rlock_debug() at _rm_rlock_debug+0x8c
> osd_get() at osd_get+0x5c
> zio_execute() at zio_execute+0xf8
> taskqueue_run_locked() at taskqueue_run_locked+0x178
> taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
> fork_exit() at fork_exit+0x74
> fork_trampoline() at fork_trampoline+0x14
> KDB: enter: panic
> [ thread pid 0 tid 100129 ]
> Stopped at  kdb_enter+0x44: undefined   f902011f
> db>
> 
> A more recent kernel (912df91) still panics. See below.
> 
> Do you have time to look into this? What can I provide in information to help?

Hmm.  So after my rmlock commits, we have the following disassembly for
_rm_rlock() (with a few annotations added by me).  Note that the pcpu
pointer is stored in register x18 by convention.

   0x0046e304 <+0>: stp x29, x30, [sp, #-16]!
   0x0046e308 <+4>: mov x29, sp
   0x0046e30c <+8>: ldr x8, [x18]
   0x0046e310 <+12>:ldr x9, [x18]
   0x0046e314 <+16>:ldr x10, [x18]
   0x0046e318 <+20>:cmp x9, x10
   0x0046e31c <+24>:b.ne0x0046e3cc <_rm_rlock+200>  // 
b.any
   0x0046e320 <+28>:ldr x9, [x18]
   0x0046e324 <+32>:ldrhw9, [x9, #314]
   0x0046e328 <+36>:cbnzw9, 0x0046e3c0 <_rm_rlock+188>
   0x0046e32c <+40>:str wzr, [x1, #32]
   0x0046e330 <+44>:stp x0, x8, [x1, #16]
   0x0046e334 <+48>:ldrbw9, [x0, #10]
   0x0046e338 <+52>:tbz w9, #4, 0x0046e358 
<_rm_rlock+84>
   0x0046e33c <+56>:ldr x9, [x18]
   0x0046e340 <+60>:ldr w10, [x9, #888]
   0x0046e344 <+64>:add w10, w10, #0x1
   0x0046e348 <+68>:str w10, [x9, #888]
   0x0046e34c <+72>:ldr x9, [x18]
   0x0046e350 <+76>:ldr w9, [x9, #888]
   0x0046e354 <+80>:cbz w9, 0x0046e3f4 <_rm_rlock+240>
   0x0046e358 <+84>:ldr w9, [x8, #1212]
   0x0046e35c <+88>:add x10, x18, #0x90
   0x0046e360 <+92>:add w9, w9, #0x1
   0x0046e364 <+96>:str w9, [x8, #1212]  <--- critical_enter
   0x0046e368 <+100>:   str x10, [x1, #8]
   0x0046e36c <+104>:   ldr x9, [x18, #144]
   0x0046e370 <+108>:   str x9, [x1]
   0x0046e374 <+112>:   str x1, [x9, #8]
   0x0046e378 <+116>:   str x1, [x18, #144]
   0x0046e37c <+120>:   ldr x9, [x18]
   0x0046e380 <+124>:   ldr w10, [x9, #356]
   0x0046e384 <+128>:   add w10, w10, #0x1
   0x0046e388 <+132>:   str w10, [x9, #356]
   0x0046e38c <+136>:   ldr w9, [x8, #1212]
   0x0046e390 <+140>:   sub w9, w9, #0x1
   0x0046e394 <+144>:   str w9, [x8, #1212]  <--- critical_exit
   0x0046e398 <+148>:   ldrbw8, [x8, #304]
   0x0046e39c <+152>:   ldr w9, [x18, #60]   <--- loading 
>pc_cpuid
   ...

A (the?) problem is that the compiler is treating "pc" as an alias
for x18, but the rmlock code assumes that the pcpu pointer is loaded
once, as it dereferences "pc" outside of the critical section.  On
arm64, if a context switch occurs between the store at _rm_rlock+144 and
the load at +152, and the thread is migrated to another CPU, then we'll
end up using the wrong CPU ID in the rm->rm_writecpus test.

I suspect the problem is unique to arm64 as its get_pcpu()
implementation is different from the others in that it doesn't use
volatile-qualified inline assembly.  This has been the case since
https://cgit.freebsd.org/src/commit/?id=63c858a04d56529eddbddf85ad04fc8e99e73762
.

I haven't been able to reproduce any crashes running poudriere in an
arm64 AWS instance, though.  Could you please try the patch below and
confirm whether it fixes your panics?  I verified that the apparent
problem described above is gone with the patch.

diff --git a/sys/kern/kern_rmlock.c b/sys/kern/kern_rmlock.c
index 0cdcfb8fec62..e51c25136ae0 100644
--- a/sys/kern/kern_rmlock.c
+++ 

Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Mark Millard


On 2022-Mar-7, at 05:46, Ronald Klop  wrote:

> Dear Mark Johnston,
> 
> I did some binary search in the kernels and came to the conclusion that 
> https://cgit.freebsd.org/src/commit/?id=1517b8d5a7f58897200497811de1b18809c07d3e
>  still works and 
> https://cgit.freebsd.org/src/commit/?id=407c34e735b5d17e2be574808a09e6d729b0a45a
>  panics.
> 
> I suspect your commit in 
> https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a.
> 
> Last panic:
> 
> panic: vm_fault failed: 0046e708 error 1
> cpuid = 1
> time = 1646660058
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x174
> panic() at panic+0x44
> data_abort() at data_abort+0x2e8
> handle_el1h_sync() at handle_el1h_sync+0x10
> --- exception, esr 0x9604
> _rm_rlock_debug() at _rm_rlock_debug+0x8c
> osd_get() at osd_get+0x5c
> zio_execute() at zio_execute+0xf8
> taskqueue_run_locked() at taskqueue_run_locked+0x178
> taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
> fork_exit() at fork_exit+0x74
> fork_trampoline() at fork_trampoline+0x14
> KDB: enter: panic
> [ thread pid 0 tid 100129 ]
> Stopped at  kdb_enter+0x44: undefined   f902011f
> db>

Was this a WITNESS/DEBUG kernel? Non-WITNESS? Non-debug?

Which aarch64 variant? Bob's was Cortex-A53 (RPi3).

> A more recent kernel (912df91) still panics. See below.
> 
> Do you have time to look into this? What can I provide in information to help?
> 
> Regards,
> Ronald.
> 
> Van: Ronald Klop 
> Datum: maandag, 7 maart 2022 11:38
> Aan: Mark Millard 
> CC: bob prohaska , freebsd-current 
> , freebsd-...@freebsd.org
> Onderwerp: Re: panic: data abort in critical section or under mutex (was: Re: 
> panic: Unknown kernel exception 0 esr_el1 200 (on 14-CURRENT/aarch64 Feb 
> 28))
> 
> Yes, I spoke to soon too. Often it crashes as soon as I start a parallel 
> poudriere build. But this time it went very far. As soon as nightly backups 
> kicked in it was game over again.
> I had read the mail of Bob on the arm@ ML. But I wanted to let the conclusion 
> that it is about the same problem to the developers. (Have seen enough of 
> wrong guessing of causes in my work. )
> 
> I will need to go further into the binary search of working kernels.
> 
> This was: FreeBSD 14.0-CURRENT #0 912df91: Wed Mar  2 00:36:35 UTC 2022
> Fatal data abort: 
>   
>   x0: 00f1efd8  x0: 00f1efd8 (mac_policy_rm + 0) 
> (mac_policy_rm + 0)   
>   
>   
>   x1:2  x1:2  
>  
>   
>   
>   x2: 0087dcf2  x2: 0087dcf2 (cam_status_table + 2f28a)   
>   
>  (cam_status_table + 2f28a)  x3: 0087dcf2 
>   
>   x3: 0087dcf2 (cam_status_table + 2f28a) (cam_status_table + 2f28a)  
>  
>   
>   
>   x4:  102  x4:  102  
>  
>   
>   
>   x5:7  x5:1  
>  
>   
>   
>   x6:0  x6:   ff  
>  
>   
>   
>   x7:0  x7: a00011fc2800  
>  
>   x8:1
>  
>   
>   
>   x8:1  x9: 00f37c10  
>

Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Ronald Klop

Dear Mark Johnston,

I did some binary search in the kernels and came to the conclusion that 
https://cgit.freebsd.org/src/commit/?id=1517b8d5a7f58897200497811de1b18809c07d3e
 still works and 
https://cgit.freebsd.org/src/commit/?id=407c34e735b5d17e2be574808a09e6d729b0a45a
 panics.

I suspect your commit in 
https://cgit.freebsd.org/src/commit/?id=c84bb8cd771ce4bed58152e47a32dda470bef23a.

Last panic:

panic: vm_fault failed: 0046e708 error 1
cpuid = 1
time = 1646660058
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x30
vpanic() at vpanic+0x174
panic() at panic+0x44
data_abort() at data_abort+0x2e8
handle_el1h_sync() at handle_el1h_sync+0x10
--- exception, esr 0x9604
_rm_rlock_debug() at _rm_rlock_debug+0x8c
osd_get() at osd_get+0x5c
zio_execute() at zio_execute+0xf8
taskqueue_run_locked() at taskqueue_run_locked+0x178
taskqueue_thread_loop() at taskqueue_thread_loop+0xc8
fork_exit() at fork_exit+0x74
fork_trampoline() at fork_trampoline+0x14
KDB: enter: panic
[ thread pid 0 tid 100129 ]
Stopped at  kdb_enter+0x44: undefined   f902011f
db>

A more recent kernel (912df91) still panics. See below.

Do you have time to look into this? What can I provide in information to help?

Regards,
Ronald.


Van: Ronald Klop 
Datum: maandag, 7 maart 2022 11:38
Aan: Mark Millard 
CC: bob prohaska , freebsd-current 
, freebsd-...@freebsd.org
Onderwerp: Re: panic: data abort in critical section or under mutex (was: Re: 
panic: Unknown kernel exception 0 esr_el1 200 (on 14-CURRENT/aarch64 Feb 
28))


Yes, I spoke to soon too. Often it crashes as soon as I start a parallel 
poudriere build. But this time it went very far. As soon as nightly backups 
kicked in it was game over again.
I had read the mail of Bob on the arm@ ML. But I wanted to let the conclusion 
that it is about the same problem to the developers. (Have seen enough of wrong 
guessing of causes in my work. )

I will need to go further into the binary search of working kernels.

This was: FreeBSD 14.0-CURRENT #0 912df91: Wed Mar  2 00:36:35 UTC 2022
Fatal data abort:   
  x0: 00f1efd8  x0: 00f1efd8 (mac_policy_rm + 0) (mac_policy_rm + 0)

  x1:2  x1:2

  x2: 0087dcf2  x2: 0087dcf2 (cam_status_table + 2f28a) 
 (cam_status_table + 2f28a)  x3: 0087dcf2   
  x3: 0087dcf2 (cam_status_table + 2f28a) (cam_status_table + 2f28a)

  x4:  102  x4:  102

  x5:7  x5:1

  x6:0  x6:   ff

  x7:0  x7: a00011fc2800
  x8:1  

  x8:1  x9: 00f37c10
  x9: 419d9090 (pcpu0 + 90) (g_ctx + 40278fe4)  

 x10: a0017be2a600 x10: a10fa600   

fts(3) not checking for readdir(3) errors

2022-03-07 Thread Ganael Laplanche
Hello,

For one of my projects, I've received a patch to our implementation of fts(3) 
which does not check for readdir(3) errors. The patch seemed obvious and 
looked OK to me so I merged it to my project.

I think we should merge it to FreeBSD too so I've opened a PR (with the patch) 
here:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=262038

Could a src committer have a look at it please ?

Thanks in advance,
Best regards,

-- 
Ganael LAPLANCHE 
http://www.martymac.org | http://contribs.martymac.org
FreeBSD: martymac , http://www.FreeBSD.org





Re: panic: data abort in critical section or under mutex (was: Re: panic: Unknown kernel exception 0 esr_el1 2000000 (on 14-CURRENT/aarch64 Feb 28))

2022-03-07 Thread Ronald Klop

Yes, I spoke to soon too. Often it crashes as soon as I start a parallel 
poudriere build. But this time it went very far. As soon as nightly backups 
kicked in it was game over again.
I had read the mail of Bob on the arm@ ML. But I wanted to let the conclusion 
that it is about the same problem to the developers. (Have seen enough of wrong 
guessing of causes in my work. )

I will need to go further into the binary search of working kernels.

This was: FreeBSD 14.0-CURRENT #0 912df91: Wed Mar  2 00:36:35 UTC 2022
Fatal data abort:   
 x0: 00f1efd8  x0: 00f1efd8 (mac_policy_rm + 0) (mac_policy_rm + 0)
   
 x1:2  x1:2
   
 x2: 0087dcf2  x2: 0087dcf2 (cam_status_table + 2f28a) 
(cam_status_table + 2f28a)  x3: 0087dcf2   
 x3: 0087dcf2 (cam_status_table + 2f28a) (cam_status_table + 2f28a)
   
 x4:  102  x4:  102
   
 x5:7  x5:1
   
 x6:0  x6:   ff
   
 x7:0  x7: a00011fc2800
 x8:1  
   
 x8:1  x9: 00f37c10
 x9: 419d9090 (pcpu0 + 90) (g_ctx + 40278fe4)  
   
x10: a0017be2a600 x10: a10fa600
x11: 394aed08d0003a48  
   
x12: 350001a8b946a108 x11:0
   
x12: 00f37c10 x13: badecce4 (pcpu0 + 90)   
   
x13: a0001fbde6b0 x14:0
   
x14: 4965ae49 x15:1
   
x15:  1000193 x16: 016a4238
x16: 000100482d38 (__stop_set_modmetadata_set + d00) (__stop_set_modmetadata_set + 448)

RE: [Intel AlderLake] Read files to FAT32 or UFS partition cause data corrupt due to P-Core-Core

2022-03-07 Thread Chen, Alvin W
Hi guys,
Any progresses for this issue?



Regards,
Alvin Chen
Dell | Comercial Client Group
office +86-10-82862506, fax +86-10-82861554, Dell Lync 8672506 
weike_c...@dell.com


Internal Use - Confidential

-Original Message-
From: Konstantin Belousov  
Sent: 2022年2月24日 9:24
To: Alexander Motin
Cc: Mike Karels; Tomoaki AOKI; Chen, Alvin W; freebsd-current@freebsd.org
Subject: Re: [Intel AlderLake] Read files to FAT32 or UFS partition cause 
data corrupt due to P-Core


[EXTERNAL EMAIL] 

On Wed, Feb 23, 2022 at 12:25:24PM -0500, Alexander Motin wrote:
> On 22.02.2022 19:00, Konstantin Belousov wrote:
> > On Tue, Feb 22, 2022 at 06:53:09PM -0500, Alexander Motin wrote:
> > > On 22.02.2022 18:41, Konstantin Belousov wrote:
> > > > On Tue, Feb 22, 2022 at 06:38:24PM -0500, Alexander Motin wrote:
> > > > > On 22.02.2022 18:30, Konstantin Belousov wrote:
> > > > > > As another blind guess, try to disable pcid, vm.pmap.pcid_enabled=0.
> > > > > 
> > > > > Do you mean it to be a workaround for TrueNAS 12, or it should 
> > > > > provide some information?  The system is at the office and has 
> > > > > no IPMI, so I can't switch the boot device from home right now.
> > > > I intended to see if it is the cause or related feature.
> > > 
> > > I'll try that on the 12 tomorrow, if applicable.
> > 
> > Yes should be relevant still.
> 
> It did the trick.  I repeated several times successful boots with the 
> pcid disabled, and failed ones with default enabled.  In attachment 
> you may find verbose serial console output captures with pcid disabled 
> and enabled, though without the cpuinfo patch.  During the testing I 
> had only one P and one E cores enabled to reduce noise.  Only after 
> that I found P core having SMT enabled, but I then repeated without 
> SMT also, so it is indeed irrelevant.
> 
> I'm curios, what in pcid could differentiate the P and E cores, and 
> have it got fixed in latest stable/13, or I am just "unlucky" to not 
> reproduce it there?

I am curious as well.  PCID works on both big Intel cores, and on small cores 
like Apollo Lake etc.  So the fact that it does not properly interact in P/E 
settings either mean that there is something I did not accounted for from the 
spec, or there is a bug in silicon.

I have no idea why do we work on stable/13 and HEAD.  There were enough changes 
to PCID code there, but it was mostly restructuring and polishing.

So the only way to get more understanding is to bisect to see which commit on 
HEAD fixed the boot.