Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer
On Mon, Jan 13, 2020 at 09:00:30PM +, Andrew Doran wrote:
> I reproduced it on native x86.  It's a bug in the CPU topology code.  Now
> fixed with revision 1.11 src/sys/kern/subr_cpu.c - sorry about that.

I confirm, I now see user activity on all CPUs. Thanks !

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Andrew Doran
On Mon, Jan 13, 2020 at 09:17:28PM +0100, Manuel Bouyer wrote:

> On Mon, Jan 13, 2020 at 07:11:21PM +, Andrew Doran wrote:
> > On Mon, Jan 13, 2020 at 07:36:41PM +0100, Manuel Bouyer wrote:
> > 
> > > On Mon, Jan 13, 2020 at 06:33:08PM +, Andrew Doran wrote:
> > > > On Mon, Jan 13, 2020 at 05:43:51PM +0100, Manuel Bouyer wrote:
> > > > 
> > > > > On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote:
> > > > > > It also sets rsp and rbp. I think rbp is not set by anything else, 
> > > > > > at last
> > > > > > in the Xen case.
> > > > > > The different rbp value would explain why in one case we hit a 
> > > > > > KASSERT()
> > > > > > in lwp_startup later.
> > > > > > But I don't know what pcb_rbp contains; I couldn't find where the 
> > > > > > pcb for
> > > > > > idlelwp is initialized.
> > > > > 
> > > > > I tried the attached patch, which should set rsp/rbp as cpu_switchto()
> > > > > does. It doens't cause the lwp_startup() KASSERT as calling 
> > > > > cpu_switchto()
> > > > > does; it also doesn't change the scheduler behavior.
> > > > 
> > > > Wait - do you mean that everything works now?  Or that everything still 
> > > > runs
> > > > on CPU0?
> > > 
> > > No, everything still runs on CPU0
> > 
> > Hmm, I don't understand why.  I'll set up Xen and try it out.  It might take
> > me a day or two.
> 
> OK thanks. 

I reproduced it on native x86.  It's a bug in the CPU topology code.  Now
fixed with revision 1.11 src/sys/kern/subr_cpu.c - sorry about that.

> OK, so I removed the call to cpu_switchto() before idle_loop(),
> and added a few KASSERTS.
> I guess you can back out the prev == NULL case from cpu_switchto().

Will do.  Thank you Manuel.

Andrew


Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer
On Mon, Jan 13, 2020 at 07:11:21PM +, Andrew Doran wrote:
> On Mon, Jan 13, 2020 at 07:36:41PM +0100, Manuel Bouyer wrote:
> 
> > On Mon, Jan 13, 2020 at 06:33:08PM +, Andrew Doran wrote:
> > > On Mon, Jan 13, 2020 at 05:43:51PM +0100, Manuel Bouyer wrote:
> > > 
> > > > On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote:
> > > > > It also sets rsp and rbp. I think rbp is not set by anything else, at 
> > > > > last
> > > > > in the Xen case.
> > > > > The different rbp value would explain why in one case we hit a 
> > > > > KASSERT()
> > > > > in lwp_startup later.
> > > > > But I don't know what pcb_rbp contains; I couldn't find where the pcb 
> > > > > for
> > > > > idlelwp is initialized.
> > > > 
> > > > I tried the attached patch, which should set rsp/rbp as cpu_switchto()
> > > > does. It doens't cause the lwp_startup() KASSERT as calling 
> > > > cpu_switchto()
> > > > does; it also doesn't change the scheduler behavior.
> > > 
> > > Wait - do you mean that everything works now?  Or that everything still 
> > > runs
> > > on CPU0?
> > 
> > No, everything still runs on CPU0
> 
> Hmm, I don't understand why.  I'll set up Xen and try it out.  It might take
> me a day or two.

OK thanks. 

> [...]
> 
> The assertion in lwp_startup() is because I made MI changes so that prevlwp
> is never NULL when calling cpu_switchto(), when fixing some bugs problems MP
> support on !x86 and make things more correct.  lwp_startup()/mi_switch() now
> need to unlock prevlwp after it is finished in cpu_switchto().  I never
> expected anybody but mi_switch() to call cpu_switchto().

OK, so I removed the call to cpu_switchto() before idle_loop(),
and added a few KASSERTS.
I guess you can back out the prev == NULL case from cpu_switchto().

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Andrew Doran
On Mon, Jan 13, 2020 at 07:36:41PM +0100, Manuel Bouyer wrote:

> On Mon, Jan 13, 2020 at 06:33:08PM +, Andrew Doran wrote:
> > On Mon, Jan 13, 2020 at 05:43:51PM +0100, Manuel Bouyer wrote:
> > 
> > > On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote:
> > > > It also sets rsp and rbp. I think rbp is not set by anything else, at 
> > > > last
> > > > in the Xen case.
> > > > The different rbp value would explain why in one case we hit a KASSERT()
> > > > in lwp_startup later.
> > > > But I don't know what pcb_rbp contains; I couldn't find where the pcb 
> > > > for
> > > > idlelwp is initialized.
> > > 
> > > I tried the attached patch, which should set rsp/rbp as cpu_switchto()
> > > does. It doens't cause the lwp_startup() KASSERT as calling cpu_switchto()
> > > does; it also doesn't change the scheduler behavior.
> > 
> > Wait - do you mean that everything works now?  Or that everything still runs
> > on CPU0?
> 
> No, everything still runs on CPU0

Hmm, I don't understand why.  I'll set up Xen and try it out.  It might take
me a day or two.

> > The very first thing that idle_loop() does on amd64/i386 is set up the frame
> > pointer - ebp/rbp.
> > 
> >  :
> >0:   55  push   %rbp
> >1:   48 89 e5mov%rsp,%rbp
> >4:   41 56   push   %r14
> >6:   41 55   push   %r13
> 
> OK, so it's OK that my patch doesn't changes anything.
> And so I still don't understand the KASSERT when cpu_switchto() is called
> before idle_loop().

The assertion in lwp_startup() is because I made MI changes so that prevlwp
is never NULL when calling cpu_switchto(), when fixing some bugs problems MP
support on !x86 and make things more correct.  lwp_startup()/mi_switch() now
need to unlock prevlwp after it is finished in cpu_switchto().  I never
expected anybody but mi_switch() to call cpu_switchto().

Andrew


Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer
On Mon, Jan 13, 2020 at 06:33:08PM +, Andrew Doran wrote:
> On Mon, Jan 13, 2020 at 05:43:51PM +0100, Manuel Bouyer wrote:
> 
> > On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote:
> > > It also sets rsp and rbp. I think rbp is not set by anything else, at last
> > > in the Xen case.
> > > The different rbp value would explain why in one case we hit a KASSERT()
> > > in lwp_startup later.
> > > But I don't know what pcb_rbp contains; I couldn't find where the pcb for
> > > idlelwp is initialized.
> > 
> > I tried the attached patch, which should set rsp/rbp as cpu_switchto()
> > does. It doens't cause the lwp_startup() KASSERT as calling cpu_switchto()
> > does; it also doesn't change the scheduler behavior.
> 
> Wait - do you mean that everything works now?  Or that everything still runs
> on CPU0?

No, everything still runs on CPU0

> 
> The very first thing that idle_loop() does on amd64/i386 is set up the frame
> pointer - ebp/rbp.
> 
>  :
>0:   55  push   %rbp
>1:   48 89 e5mov%rsp,%rbp
>4:   41 56   push   %r14
>6:   41 55   push   %r13

OK, so it's OK that my patch doesn't changes anything.
And so I still don't understand the KASSERT when cpu_switchto() is called
before idle_loop().

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Andrew Doran
On Mon, Jan 13, 2020 at 05:43:51PM +0100, Manuel Bouyer wrote:

> On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote:
> > It also sets rsp and rbp. I think rbp is not set by anything else, at last
> > in the Xen case.
> > The different rbp value would explain why in one case we hit a KASSERT()
> > in lwp_startup later.
> > But I don't know what pcb_rbp contains; I couldn't find where the pcb for
> > idlelwp is initialized.
> 
> I tried the attached patch, which should set rsp/rbp as cpu_switchto()
> does. It doens't cause the lwp_startup() KASSERT as calling cpu_switchto()
> does; it also doesn't change the scheduler behavior.

Wait - do you mean that everything works now?  Or that everything still runs
on CPU0?

The very first thing that idle_loop() does on amd64/i386 is set up the frame
pointer - ebp/rbp.

 :
   0:   55  push   %rbp
   1:   48 89 e5mov%rsp,%rbp
   4:   41 56   push   %r14
   6:   41 55   push   %r13

Andrew


Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer
On Mon, Jan 13, 2020 at 05:43:51PM +0100, Manuel Bouyer wrote:
> On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote:
> > It also sets rsp and rbp. I think rbp is not set by anything else, at last
> > in the Xen case.
> > The different rbp value would explain why in one case we hit a KASSERT()
> > in lwp_startup later.
> > But I don't know what pcb_rbp contains; I couldn't find where the pcb for
> > idlelwp is initialized.
> 
> I tried the attached patch, which should set rsp/rbp as cpu_switchto()
> does. It doens't cause the lwp_startup() KASSERT as calling cpu_switchto()
> does; it also doesn't change the scheduler behavior.

With the patch this time

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--
Index: sys/arch/xen/x86/cpu.c
===
RCS file: /cvsroot/src/sys/arch/xen/x86/cpu.c,v
retrieving revision 1.131
diff -u -p -u -r1.131 cpu.c
--- sys/arch/xen/x86/cpu.c  23 Nov 2019 19:40:38 -  1.131
+++ sys/arch/xen/x86/cpu.c  13 Jan 2020 16:40:50 -
@@ -739,7 +739,16 @@ cpu_hatch(void *v)
 
aprint_debug_dev(ci->ci_dev, "running\n");
 
-   cpu_switchto(NULL, ci->ci_data.cpu_idlelwp, true);
+#ifdef __x86_64__
+   asm("movq %0, %%rsp" : : "r" (pcb->pcb_rsp));
+   asm("movq %0, %%rbp" : : "r" (pcb->pcb_rbp));
+#else
+   asm("movl %0, %%esp" : : "r" (pcb->pcb_esp));
+   asm("movl %0, %%ebp" : : "r" (pcb->pcb_ebp));
+#endif
+   KASSERT(ci->ci_curlwp == ci->ci_data.cpu_idlelwp);
+
+   //cpu_switchto(NULL, ci->ci_data.cpu_idlelwp, true);
 
idle_loop(NULL);
KASSERT(false);


Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer
On Mon, Jan 13, 2020 at 04:59:50PM +0100, Manuel Bouyer wrote:
> It also sets rsp and rbp. I think rbp is not set by anything else, at last
> in the Xen case.
> The different rbp value would explain why in one case we hit a KASSERT()
> in lwp_startup later.
> But I don't know what pcb_rbp contains; I couldn't find where the pcb for
> idlelwp is initialized.

I tried the attached patch, which should set rsp/rbp as cpu_switchto()
does. It doens't cause the lwp_startup() KASSERT as calling cpu_switchto()
does; it also doesn't change the scheduler behavior.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer
On Mon, Jan 13, 2020 at 02:49:52PM +, Andrew Doran wrote:
> > Now I get a different panic:
> > [   1.000] vcpu0 at hypervisor0
> > [   1.000] vcpu0: 64 page colors
> > [   1.000] vcpu0: Intel(R) Core(TM)2 Duo CPU E6550  @ 2.33GHz, id 
> > 0x6fb
> > [   1.000] vcpu0: node 0, package 0, core 1, smt 0
> > [   1.000] vcpu1 at hypervisor0
> > [   1.000] vcpu1: 2 page colors
> > [   1.000] vcpu1: starting
> > [   1.000] vcpu1: is started.
> > [   1.000] vcpu1: Intel(R) Core(TM)2 Duo CPU E6550  @ 2.33GHz, id 
> > 0x6fb
> > [   1.000] vcpu1: node 0, package 0, core 0, smt 0
> > [...]
> > [   1.030] UVM: using package allocation scheme, 1 package(s) per bucket
> > [   1.030] Xen vcpu1 clock: using event channel 7
> > [   1.8809493] vcpu1: running
> > [   1.8809493] panic: kernel diagnostic assertion "prev != NULL" failed: 
> > file "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_lwp.c", line 1021
> > [   1.8809493] cpu1: Begin traceback...
> > [   1.8809493] 
> > vpanic(c057f868,d77abf74,d77abf98,c03cc3e5,c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0)
> >  at netbsd:vpanic+0x134
> > [   1.8809493] 
> > kern_assert(c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0,0,0,c13a6900,c03c60c0)
> >  at netbsd:kern_assert+0x23
> > [   1.8809493] lwp_startup(0,c13a6900,8b1000,c0674200,0,c010007a,0,0,0,0) 
> > at netbsd:lwp_startup+0x155
> > [   1.8809493] cpu1: End traceback...
> > 
> > If I remove the call to cpu_switchto() in cpu_hatch() it boots, but it seems
> > that all user processes are running on cpu0 only ...
> 
> I looked and the only thing cpu_switchto() is doing there is setting curlwp,
> but that's already set in cpu_start_secondary(), so it's not needed.

It also sets rsp and rbp. I think rbp is not set by anything else, at last
in the Xen case.
The different rbp value would explain why in one case we hit a KASSERT()
in lwp_startup later.
But I don't know what pcb_rbp contains; I couldn't find where the pcb for
idlelwp is initialized.


> 
> > I can't see what extra work the cpu_switchto() could be doing that would
> > matters, execpt maybe the %epb/rbp init. Any idea ?
> 
> Right I don't think cpu_switchto() matters there.  The strategy for
> assigning LWPs to CPUs in the scheduler has changed.  If the machine is not
> busy everything is likely to stay on CPU0.  Are you putting much load on it?

I just tried a build.sh -j4
CPU0 is 100% busy, the others are 100% idle:

load averages:  3.02,  2.14,  1.26;   up 0+00:51:5916:59:03
61 processes: 5 runnable, 54 sleeping, 2 on CPU
CPU0 states: 39.3% user,  0.0% nice, 60.7% system,  0.0% interrupt,  0.0% idle
CPU1 states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU2 states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
CPU3 states:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
Memory: 1402M Act, 168K Inact, 16K Wired, 14M Exec, 1352M File, 1932M Free
Swap: 

  PID USERNAME PRI NICE   SIZE   RES STATE  TIME   WCPUCPU COMMAND
21392 bouyer33029M 5964K RUN/0  0:00  2.00%  0.10% as
0 root   00 0K   11M CPU/3  0:30  0.00%  0.00% [system]
   81 bouyer85020M 3596K kqueue/0   0:19  0.00%  0.00% tmux
  226 bouyer43016M 1900K CPU/0  0:00  0.00%  0.00% top
16883 bouyer330  8992K 2212K RUN/0  0:00  0.00%  0.00% nbmake
21137 bouyer330  7844K 1220K RUN/0  0:00  0.00%  0.00% sed
12098 bouyer330  4288K  164K RUN/0  0:00  0.00%  0.00% sh
22411 bouyer330  4288K  164K RUN/0  0:00  0.00%  0.00% cc
   42 root  85080M 5768K poll/0 0:00  0.00%  0.00% sshd

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Andrew Doran
On Mon, Jan 13, 2020 at 03:16:05PM +0100, Manuel Bouyer wrote:

> On Mon, Jan 13, 2020 at 12:02:13PM +, Andrew Doran wrote:
> > Ah yes it does, I saw something that made me think it affected x86_64 only. 
> > I'll make the change on i386 too.
> 
> thanks.
> 
> Now I get a different panic:
> [   1.000] vcpu0 at hypervisor0
> [   1.000] vcpu0: 64 page colors
> [   1.000] vcpu0: Intel(R) Core(TM)2 Duo CPU E6550  @ 2.33GHz, id 
> 0x6fb
> [   1.000] vcpu0: node 0, package 0, core 1, smt 0
> [   1.000] vcpu1 at hypervisor0
> [   1.000] vcpu1: 2 page colors
> [   1.000] vcpu1: starting
> [   1.000] vcpu1: is started.
> [   1.000] vcpu1: Intel(R) Core(TM)2 Duo CPU E6550  @ 2.33GHz, id 
> 0x6fb
> [   1.000] vcpu1: node 0, package 0, core 0, smt 0
> [...]
> [   1.030] UVM: using package allocation scheme, 1 package(s) per bucket
> [   1.030] Xen vcpu1 clock: using event channel 7
> [   1.8809493] vcpu1: running
> [   1.8809493] panic: kernel diagnostic assertion "prev != NULL" failed: file 
> "/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_lwp.c", line 1021
> [   1.8809493] cpu1: Begin traceback...
> [   1.8809493] 
> vpanic(c057f868,d77abf74,d77abf98,c03cc3e5,c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0)
>  at netbsd:vpanic+0x134
> [   1.8809493] 
> kern_assert(c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0,0,0,c13a6900,c03c60c0) 
> at netbsd:kern_assert+0x23
> [   1.8809493] lwp_startup(0,c13a6900,8b1000,c0674200,0,c010007a,0,0,0,0) at 
> netbsd:lwp_startup+0x155
> [   1.8809493] cpu1: End traceback...
> 
> If I remove the call to cpu_switchto() in cpu_hatch() it boots, but it seems
> that all user processes are running on cpu0 only ...

I looked and the only thing cpu_switchto() is doing there is setting curlwp,
but that's already set in cpu_start_secondary(), so it's not needed.

> I can't see what extra work the cpu_switchto() could be doing that would
> matters, execpt maybe the %epb/rbp init. Any idea ?

Right I don't think cpu_switchto() matters there.  The strategy for
assigning LWPs to CPUs in the scheduler has changed.  If the machine is not
busy everything is likely to stay on CPU0.  Are you putting much load on it?

Andrew


Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer
On Mon, Jan 13, 2020 at 12:02:13PM +, Andrew Doran wrote:
> Ah yes it does, I saw something that made me think it affected x86_64 only. 
> I'll make the change on i386 too.

thanks.

Now I get a different panic:
[   1.000] vcpu0 at hypervisor0
[   1.000] vcpu0: 64 page colors
[   1.000] vcpu0: Intel(R) Core(TM)2 Duo CPU E6550  @ 2.33GHz, id 0x6fb
[   1.000] vcpu0: node 0, package 0, core 1, smt 0
[   1.000] vcpu1 at hypervisor0
[   1.000] vcpu1: 2 page colors
[   1.000] vcpu1: starting
[   1.000] vcpu1: is started.
[   1.000] vcpu1: Intel(R) Core(TM)2 Duo CPU E6550  @ 2.33GHz, id 0x6fb
[   1.000] vcpu1: node 0, package 0, core 0, smt 0
[...]
[   1.030] UVM: using package allocation scheme, 1 package(s) per bucket
[   1.030] Xen vcpu1 clock: using event channel 7
[   1.8809493] vcpu1: running
[   1.8809493] panic: kernel diagnostic assertion "prev != NULL" failed: file 
"/dsk/l1/misc/bouyer/HEAD/clean/src/sys/kern/kern_lwp.c", line 1021
[   1.8809493] cpu1: Begin traceback...
[   1.8809493] 
vpanic(c057f868,d77abf74,d77abf98,c03cc3e5,c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0)
 at netbsd:vpanic+0x134
[   1.8809493] 
kern_assert(c057f868,c057f802,c05b0f71,c05b0ce4,3fd,0,0,0,c13a6900,c03c60c0) at 
netbsd:kern_assert+0x23
[   1.8809493] lwp_startup(0,c13a6900,8b1000,c0674200,0,c010007a,0,0,0,0) at 
netbsd:lwp_startup+0x155
[   1.8809493] cpu1: End traceback...

If I remove the call to cpu_switchto() in cpu_hatch() it boots, but it seems
that all user processes are running on cpu0 only ...
I can't see what extra work the cpu_switchto() could be doing that would
matters, execpt maybe the %epb/rbp init. Any idea ?

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Andrew Doran
On Mon, Jan 13, 2020 at 12:56:22PM +0100, Manuel Bouyer wrote:
> On Mon, Jan 13, 2020 at 11:42:17AM +, Andrew Doran wrote:
> > Hi Manuel,
> > 
> > On Mon, Jan 13, 2020 at 10:56:23AM +0100, Manuel Bouyer wrote:
> > > Hello,
> > > A current Xen domU kernel fails to boot with:
> > > [   1.000] hypervisor0 at mainbus0: Xen version 4.11.3nb1
> > > [   1.000] vcpu0 at hypervisor0
> > > [   1.000] vcpu0: Intel(R) Xeon(TM) CPU 3.00GHz, id 0xf64
> > > [   1.000] vcpu0: node 0, package 0, core 1, smt 1
> > > [   1.000] vcpu1 at hypervisor0
> > > [   1.000] vcpu1: Intel(R) Xeon(TM) CPU 3.00GHz, id 0xf64
> > > [   1.000] vcpu1: node 0, package 1, core 0, smt 0
> > > [   1.000] xenbus0 at hypervisor0: Xen Virtual Bus Interface
> > > [   1.000] xencons0 at hypervisor0: Xen Virtual Console Driver
> > > [   1.9901295] uvm_fault(0x80d5c120, 0x0, 1) -> e
> > > [   1.9901295] fatal page fault in supervisor mode
> > > [   1.9901295] trap type 6 code 0 rip 0x8020209f cs 0x8 rflags 
> > > 0x10246 cr2 0x28 ilevel 0 rsp 0xb7802b19de88
> > > [   1.9901295] curlwp 0xb7800083b500 pid 0.15 lowest kstack 
> > > 0xb7802b1992c0
> > > kernel: page fault trap, code=0
> > > Stopped in pid 0.15 (system) at netbsd:cpu_switchto+0xf:movq
> > > 28(%r13),%rax
> > > cpu_switchto() at netbsd:cpu_switchto+0xf
> > > 
> > > both amd64 and i386. A boot with vcpus=1 succeeds, so I guess something is
> > > missing in initialisations of secondary CPUs.
> > > This happens with the 202001101800Z but the problem is probably older than
> > > that (the testbed used vcpus=1 until today)
> > > 
> > > Any idea ?
> > 
> > It should work now with revision 1.199 of 
> > src/sys/arch/amd64/amd64/locore.S. 
> 
> The same problem happens with i386.

Ah yes it does, I saw something that made me think it affected x86_64 only. 
I'll make the change on i386 too.

Andrew


Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Manuel Bouyer
On Mon, Jan 13, 2020 at 11:42:17AM +, Andrew Doran wrote:
> Hi Manuel,
> 
> On Mon, Jan 13, 2020 at 10:56:23AM +0100, Manuel Bouyer wrote:
> > Hello,
> > A current Xen domU kernel fails to boot with:
> > [   1.000] hypervisor0 at mainbus0: Xen version 4.11.3nb1
> > [   1.000] vcpu0 at hypervisor0
> > [   1.000] vcpu0: Intel(R) Xeon(TM) CPU 3.00GHz, id 0xf64
> > [   1.000] vcpu0: node 0, package 0, core 1, smt 1
> > [   1.000] vcpu1 at hypervisor0
> > [   1.000] vcpu1: Intel(R) Xeon(TM) CPU 3.00GHz, id 0xf64
> > [   1.000] vcpu1: node 0, package 1, core 0, smt 0
> > [   1.000] xenbus0 at hypervisor0: Xen Virtual Bus Interface
> > [   1.000] xencons0 at hypervisor0: Xen Virtual Console Driver
> > [   1.9901295] uvm_fault(0x80d5c120, 0x0, 1) -> e
> > [   1.9901295] fatal page fault in supervisor mode
> > [   1.9901295] trap type 6 code 0 rip 0x8020209f cs 0x8 rflags 
> > 0x10246 cr2 0x28 ilevel 0 rsp 0xb7802b19de88
> > [   1.9901295] curlwp 0xb7800083b500 pid 0.15 lowest kstack 
> > 0xb7802b1992c0
> > kernel: page fault trap, code=0
> > Stopped in pid 0.15 (system) at netbsd:cpu_switchto+0xf:movq
> > 28(%r13),%rax
> > cpu_switchto() at netbsd:cpu_switchto+0xf
> > 
> > both amd64 and i386. A boot with vcpus=1 succeeds, so I guess something is
> > missing in initialisations of secondary CPUs.
> > This happens with the 202001101800Z but the problem is probably older than
> > that (the testbed used vcpus=1 until today)
> > 
> > Any idea ?
> 
> It should work now with revision 1.199 of src/sys/arch/amd64/amd64/locore.S. 

The same problem happens with i386.

> Nothing else in tree calls cpu_switchto() with prevlwp=NULL any more.  Can
> Xen's cpu_hatch() call idle_loop() directly?

Maybe it could, but cpu_switchto() does some extra work (switch the stack,
set curlwp at last). Maybe this is already done but I'll have to double check.

-- 
Manuel Bouyer 
 NetBSD: 26 ans d'experience feront toujours la difference
--


Re: Xen MP panics in cpu_switchto()

2020-01-13 Thread Andrew Doran
Hi Manuel,

On Mon, Jan 13, 2020 at 10:56:23AM +0100, Manuel Bouyer wrote:
> Hello,
> A current Xen domU kernel fails to boot with:
> [   1.000] hypervisor0 at mainbus0: Xen version 4.11.3nb1
> [   1.000] vcpu0 at hypervisor0
> [   1.000] vcpu0: Intel(R) Xeon(TM) CPU 3.00GHz, id 0xf64
> [   1.000] vcpu0: node 0, package 0, core 1, smt 1
> [   1.000] vcpu1 at hypervisor0
> [   1.000] vcpu1: Intel(R) Xeon(TM) CPU 3.00GHz, id 0xf64
> [   1.000] vcpu1: node 0, package 1, core 0, smt 0
> [   1.000] xenbus0 at hypervisor0: Xen Virtual Bus Interface
> [   1.000] xencons0 at hypervisor0: Xen Virtual Console Driver
> [   1.9901295] uvm_fault(0x80d5c120, 0x0, 1) -> e
> [   1.9901295] fatal page fault in supervisor mode
> [   1.9901295] trap type 6 code 0 rip 0x8020209f cs 0x8 rflags 
> 0x10246 cr2 0x28 ilevel 0 rsp 0xb7802b19de88
> [   1.9901295] curlwp 0xb7800083b500 pid 0.15 lowest kstack 
> 0xb7802b1992c0
> kernel: page fault trap, code=0
> Stopped in pid 0.15 (system) at netbsd:cpu_switchto+0xf:movq
> 28(%r13),%rax
> cpu_switchto() at netbsd:cpu_switchto+0xf
> 
> both amd64 and i386. A boot with vcpus=1 succeeds, so I guess something is
> missing in initialisations of secondary CPUs.
> This happens with the 202001101800Z but the problem is probably older than
> that (the testbed used vcpus=1 until today)
> 
> Any idea ?

It should work now with revision 1.199 of src/sys/arch/amd64/amd64/locore.S. 
Nothing else in tree calls cpu_switchto() with prevlwp=NULL any more.  Can
Xen's cpu_hatch() call idle_loop() directly?

Thank you,
Andrew