bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)

2022-10-23 Thread Ludovic Courtès
Hi,

Ludovic Courtès  skribis:

> Of course, the thing boots just fine on that machine when using
> ‘exec.static’.

It’s frustrating I did not get to the bottom of it, but time passes, so
I pushed this workaround in Guix commit
3fb3bd3da530a5f82a169b1fa451474f9d90c3b6.

Ludo’.





bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)

2022-10-17 Thread Ludovic Courtès
Hi,

Ludovic Courtès  skribis:

> … so ‘exec_load’ is doing its job, it seems.

Turns out that may not be the case.

Here’s a *bad* mapping on the second ‘task_resume’ breakpoint (when
‘exec’ is about to start):

--8<---cut here---start->8---
  db> show all threads
  TASKTHREADS
0 gnumach (f5f7cf00): 7 threads:
0 (f5f7be18) .W..N. 0xc11dac04
1 (f5f7bcd0) R..O..(idle_thread_continue)
2 (f5f7bb88) .W.ON.(reaper_thread_continue) 0xc12015d4
3 (f5f7ba40) .W.ON.(swapin_thread_continue) 0xc11f8e2c
4 (f5f7b8f8) .W.ON.(sched_thread_continue) 0
5 (f5f7b7b0) .W.ON.(io_done_thread_continue) 0xc1201f74
6 (f5f7b668) .W.ON.(net_thread_continue) 0xc11db0a8
1 ext2fs (f5f7ce40): 6 threads:
0 (f5f7b520) RF
1 (f5f7b290) .W.O..(mach_msg_receive_continue) 0
2 (f5f7b148) .W.O..(mach_msg_receive_continue) 0
3 (f5f7b000) .W.O..(mach_msg_continue) 0
4 (f67d3e20) .W.O..(mach_msg_receive_continue) 0
5 (f67d3cd8) .W.O..(mach_msg_continue) 0
2 exec (f5f7cd80): (f5f7b3d8) ..SO..(thread_bootstrap_return)
   db> trace
  task_resume(f593e010,fb7d9010,f5f73e80,c106972a)
  ipc_kobject_server(f593e000,3,18,0)+0x1eb
  mach_msg_trap(b4c0,3,18,20,8)+0x1703
  > user space <
  db> x/tbx 0xcbc 0xf5f7b3d8

  no memory is assigned to address 0cbc
  0
  db> show map $map2
Map 0xf5f6ff30: name="exec", pmap=0xf5f71fa8,ref=1,nentries=5
  size=290816,resident:225280,wired=0
  version=13
 map entry 0xf625ec08: start=0x0, end=0x1000
 prot=1/7/copy, object=0x0, offset=0x0
 map entry 0xf625ebb0: start=0x1000, end=0x26000
 prot=5/7/copy, object=0xf5f6ad70, offset=0x0
  Object 0xf5f6ad70: size=0x25000, 1 references
  37 resident pages, 0 absent pages, 0 paging ops
   memory object=0x0 (offset=0x0),control=0x0, name=0xf5f82780
   uninitialized,temporaryinternal,copy_strategy=0
   shadow=0x0 (offset=0x0),copy=0x0
 map entry 0xf625eb58: start=0x26000, end=0x34000
 prot=1/7/copy, object=0xf5f6ad20, offset=0x0
  Object 0xf5f6ad20: size=0xe000, 1 references
  14 resident pages, 0 absent pages, 0 paging ops
   memory object=0x0 (offset=0x0),control=0x0, name=0xf5f82730
   uninitialized,temporaryinternal,copy_strategy=0
   shadow=0x0 (offset=0x0),copy=0x0
 map entry 0xf625eb00: start=0x34000, end=0x37000
 prot=3/7/copy, object=0xf5f6acd0, offset=0x0
  Object 0xf5f6acd0: size=0x3000, 1 references
  3 resident pages,--db_more--
--8<---cut here---end--->8---

Compare with what a “good” mapping looks like at that same moment:

--8<---cut here---start->8---
  start ext2fs: Hurd server bootstrap: ext2fs[device:hd0s1]Kernel Breakpoint 
trap,
   eip 0xc1030d5b
  Breakpoint at  task_resume: pushl   %ebp
  db> show all threads
  TASKTHREADS
0 gnumach (f5f7cf00): 7 threads:
0 (f5f7be18) .W..N. 0xc11dac04
1 (f5f7bcd0) R..O..(idle_thread_continue)
2 (f5f7bb88) .W.ON.(reaper_thread_continue) 0xc12015d4
3 (f5f7ba40) .W.ON.(swapin_thread_continue) 0xc11f8e2c
4 (f5f7b8f8) .W.ON.(sched_thread_continue) 0
5 (f5f7b7b0) .W.ON.(io_done_thread_continue) 0xc1201f74
6 (f5f7b668) .W.ON.(net_thread_continue) 0xc11db0a8
1 ext2fs (f5f7ce40): 6 threads:
0 (f5f7b520) RF
1 (f5f7b290) .W.O..(mach_msg_receive_continue) 0
2 (f5f7b148) .W.O..(mach_msg_receive_continue) 0
3 (f5f7b000) .W.O..(mach_msg_continue) 0
4 (f67d2e20) .W.O..(mach_msg_receive_continue) 0
5 (f67d2cd8) .W.O..(mach_msg_continue) 0
2 exec (f5f7cd80): (f5f7b3d8) ..SO..(thread_bootstrap_return)
  db> x/tbx 0xcbc 0xf5f7b3d8
  8
  db> show map $map2
  Map 0xf5f6ff30: name="exec", pmap=0xf5f71fa8,ref=1,nentries=5
  size=290816,resident:229376,wired=0
  version=14
   map entry 0xf625ec08: start=0x0, end=0x1000
   prot=1/7/copy, object=0xf5f6ad70, offset=0x0
Object 0xf5f6ad70: size=0x1000, 1 references
1 resident pages, 0 absent pages, 0 paging ops
 memory object=0x0 (offset=0x0),control=0x0, name=0xf5f82780
 uninitialized,temporary  internal,copy_strategy=0
 shadow=0x0 (offset=0x0),copy=0x0
   map entry 0xf625ebb0: start=0x1000, end=0x26000
   prot=5/7/copy, object=0xf5f6ad20, offset=0x0
Object 0xf5f6ad20: size=0x25000, 1 references
37 resident pages, 0 absent pages, 0 paging ops
 memory object=0x0 (offset=0x0),control=0x0, name=0xf5f82730
 uninitialized,temporary  internal,copy_strategy=0
 shadow=0x0 (offset=0x0),copy=0x0
   map entry 0xf625eb58: start=0x26000, end=0x34000
   prot=1/7/copy, 

bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)

2022-10-10 Thread Ludovic Courtès
Ludovic Courtès  skribis:

> Through a dichotomy I tried to see how far it goes.  The info I have so
> far is that ld.so errors out from elf/rtld.c:563 (line 565 is not
> reached):
>
> 558:  if (bootstrap_map.l_addr || ! 
> bootstrap_map.l_info[VALIDX(DT_GNU_PRELINKED)])
> 559:{
> 560:  /* Relocate ourselves so we can do normal function calls and
> 561: data access using the global offset table.  */
> 562:
> 563:  ELF_DYNAMIC_RELOCATE (_map, 0, 0, 0);
> 564:}
> 565:  bootstrap_map.l_relocated = 1;
> ...
> 578:  __rtld_malloc_init_stubs ();

Via brute force¹, I found that ‘__assert_fail’ is hit, with its first
argument in $eax being:

--8<---cut here---start->8---
db> x/c 0x28604,80  

ELF32_R_TYPE (reloc->r_info) == R_386_RELATIVE\000\000map->l_in 

fo[VERSYMIDX (DT_VERSYM)] != NULL\000\000Fatal glibc error: Too 

 many audit mo  

--8<---cut here---end--->8---

This comes from i386/dl-machine.h:

--8<---cut here---start->8---
auto inline void
__attribute ((always_inline))
elf_machine_rel_relative (Elf32_Addr l_addr, const Elf32_Rel *reloc,
  void *const reloc_addr_arg)
{
  Elf32_Addr *const reloc_addr = reloc_addr_arg;
  assert (ELF32_R_TYPE (reloc->r_info) == R_386_RELATIVE);
  *reloc_addr += l_addr;
}
--8<---cut here---end--->8---

How can we get there?  Looking at ‘_dl_start’, it could be that
‘elf_machine_load_address’ returns a bogus value and we end up reading
wrong ELF data?  Or it could be memory corruption somewhere.  Or…?

Thing is, it’s not fully deterministic (happens 9 times out of 10 with
KVM, never happens without KVM).

Ideas?  :-)

Ludo’.

¹ Building with ‘-fno-optimize-sibling-calls’ didn’t help get nicer
  backtraces, but that’s prolly because all that early relocation code
  is inlined.





bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)

2022-10-09 Thread Samuel Thibault
Ludovic Courtès, le dim. 09 oct. 2022 18:09:07 +0200, a ecrit:
> So it would seem that ‘_dl_start’ is called and somehow then a tail-call
> to ‘_dl_fatal_printf’ is made.

Perhaps you can build glibc without tail-call optimization?
(-fno-optimize-sibling-calls)

Samuel





bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)

2022-10-09 Thread Ludovic Courtès
Hi!

Ludovic Courtès  skribis:

> $ addr2line -e  
> /gnu/store/m8afvcgwmrfhvjpd7b0xllk8vv5isd6j-glibc-cross-i586-pc-gnu-2.33/lib/ld.so.1
>  0x1000 0x11627 0x11bb
> ??:0
> /tmp/guix-build-glibc-cross-i586-pc-gnu-2.33.drv-0/glibc-2.33/elf/dl-misc.c:333
> :?
>
>
> That’s ‘_dl_fatal_printf’ calling ‘_exit’; it’s trying to tell us
> something.
>
> I’ll try and rebuild the system with the debugging patches at
> , to
> get early ld.so output, for lack of a better solution…

I tried adapted the patches above and tried them, but it seems that
‘_dl_sysdep_start’ isn’t even reached.  For example, I set a breakpoint
on ‘mach_task_self’ (called from ‘__mach_init’, called from
‘_dl_sysdep_start’), but that’s never reached (I’m assuming ‘break/tu’
is reliable, is it?).

The user-space backtrace upon trap remains unhelpful:

--8<---cut here---start->8---
start ext2fs: Hurd server bootstrap: ext2fs[device:hd0s1]Kernel Breakpoint 
trap,
 eip 0xc1030d5b 

Breakpoint at  task_resume: pushl   %ebp

db> debug traps /on 

db> b task_terminate

set breakpoint #2   

db> c   

Kernel Debug trap trap, eip 0xc1030d5b  

 execkernel: Page fault (14), code=6

Stopped at  0x1000: pushl   0x4(%ebx)   

> user space <  
> 
0x1000(bf24,0,0,1160b,0)

0x11627(bf9c,0,0,0,2)   

0x11bb()

--8<---cut here---end--->8---

… where:

--8<---cut here---start->8---
$ addr2line -e 
/gnu/store/4p1kab1c4h7h3kvgcm1hbjja4y5k9x4p-glibc-cross-i586-pc-gnu-2.33/lib/ld.so.1
 0x11627 0x11bb
/tmp/guix-build-glibc-cross-i586-pc-gnu-2.33.drv-0/glibc-2.33/elf/dl-misc.c:333
:?
$ objdump -S 
/gnu/store/4p1kab1c4h7h3kvgcm1hbjja4y5k9x4p-glibc-cross-i586-pc-gnu-2.33/lib/ld.so.1
 --start-address=0x11b0 |head -40

/gnu/store/4p1kab1c4h7h3kvgcm1hbjja4y5k9x4p-glibc-cross-i586-pc-gnu-2.33/lib/ld.so.1:
 file format elf32-i386


Disassembly of section .text:

11b0 <_start>:
11b0:   89 e0   mov%esp,%eax
11b2:   83 ec 0csub$0xc,%esp
11b5:   50  push   %eax
11b6:   e8 b5 0a 00 00  call   1c70 <_dl_start>
11bb:   83 c4 10add$0x10,%esp
--8<---cut here---end--->8---

So it would seem that ‘_dl_start’ is called and somehow then a tail-call
to ‘_dl_fatal_printf’ is made.

Through a dichotomy I tried to see how far it goes.  The info I have so
far is that ld.so errors out from elf/rtld.c:563 (line 565 is not
reached):

--8<---cut here---start->8---
558:  if (bootstrap_map.l_addr || ! 
bootstrap_map.l_info[VALIDX(DT_GNU_PRELINKED)])
559:{
560:  /* Relocate ourselves so we can do normal function calls and
561: data access using the global offset table.  */
562:
563:  ELF_DYNAMIC_RELOCATE (_map, 0, 0, 0);
564:}
565:  bootstrap_map.l_relocated = 1;
...
578:  __rtld_malloc_init_stubs ();
--8<---cut here---end--->8---

It’s hard to be more precise because ELF_DYNAMIC_RELOCATE is a macro
that expands to quite a lot of code.

I don’t see the code path that would lead to a ‘_dl_fatal_printf’ call
though.

Ideas?  :-)

Ludo’.





bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)

2022-10-08 Thread Ludovic Courtès
Hi Samuel,

Samuel Thibault  skribis:

> About the backtrace:
>
>> user space <
> 0x1000(bf24,0,0,1160b,0)
> 0x11627(bf9c,0,0,0,2)
> 0x11bb()
>
> That is quite surprising actually: in my ld.so there is nothing useful
> at 0x1000. Perhaps you can check what 0x11627 is all about?

Sure:

--8<---cut here---start->8---
$ addr2line -e  
/gnu/store/m8afvcgwmrfhvjpd7b0xllk8vv5isd6j-glibc-cross-i586-pc-gnu-2.33/lib/ld.so.1
 0x1000 0x11627 0x11bb
??:0
/tmp/guix-build-glibc-cross-i586-pc-gnu-2.33.drv-0/glibc-2.33/elf/dl-misc.c:333
:?
--8<---cut here---end--->8---

That’s ‘_dl_fatal_printf’ calling ‘_exit’; it’s trying to tell us
something.

I’ll try and rebuild the system with the debugging patches at
, to
get early ld.so output, for lack of a better solution…

>> Program Header:
>> LOAD off0x vaddr 0x paddr 0x align 2**12
>>  filesz 0x0dd8 memsz 0x0dd8 flags r--
>
> We don't have this section in the Debian glibc. It'd probably be useful
> to know what this is about.

Address 0 is for the ‘_begin’ symbol, passed by -Wl,-defsym:

--8<---cut here---start->8---
i586-pc-gnu-gcc   -nostdlib -nostartfiles -r -o 
/tmp/guix-build-glibc-cross-i586-pc-gnu-2.33.drv-0/build/elf/librtld.os 
'-Wl,-(' 
/tmp/guix-build-glibc-cross-i586-pc-gnu-2.33.drv-0/build/elf/dl-allobjs.os 
/tmp/guix-build-glibc-cross-i586-pc-gnu-2.33.drv-0/build/elf/rtld-libc.a -lgcc 
'-Wl,-)' \
  
-Wl,-Map,/tmp/guix-build-glibc-cross-i586-pc-gnu-2.33.drv-0/build/elf/librtld.os.map
i586-pc-gnu-gcc   -nostdlib -nostartfiles -shared -o 
/tmp/guix-build-glibc-cross-i586-pc-gnu-2.33.drv-0/build/elf/ld.so.new  
   \
  -Wl,-z,combreloc -Wl,-z,relro -Wl,--hash-style=both -Wl,-z,defs   
\
  
/tmp/guix-build-glibc-cross-i586-pc-gnu-2.33.drv-0/build/elf/librtld.os 
-Wl,--version-script=/tmp/guix-build-glibc-cross-i586-pc-gnu-2.33.drv-0/build/ld.map
  \
  -Wl,-soname=ld.so.1   \
  -Wl,-defsym=_begin=0
i586-pc-gnu-readelf -s 
/tmp/guix-build-glibc-cross-i586-pc-gnu-2.33.drv-0/build/elf/ld.so.new \
  | gawk '($7 ~ /^UND(|EF)$/ && $1 != "0:" && $4 != "REGISTER") { print; p=1 } 
END { exit p != 0 }'
mv -f /tmp/guix-build-glibc-cross-i586-pc-gnu-2.33.drv-0/build/elf/ld.so.new 
/tmp/guix-build-glibc-cross-i586-pc-gnu-2.33.drv-0/build/elf/ld.so
--8<---cut here---end--->8---

And indeed:

--8<---cut here---start->8---
$ objdump -t  
/gnu/store/m8afvcgwmrfhvjpd7b0xllk8vv5isd6j-glibc-cross-i586-pc-gnu-2.33/lib/ld.so.1|grep
 _begin
 l   *ABS*    _begin
--8<---cut here---end--->8---

That ‘-Wl,-defsym=_begin=0’ flag was removed in glibc commit
6f043e0ee7e477f50a44024ed0cb579d5e3f511d (April 2022).

On darnassus it’s different but then it’s Debian’s glibc 2.35, natively
built, so I don’t what conclusions can be drawn:

--8<---cut here---start->8---
ludo@darnassus:~$ /lib/ld.so.1 --version
ld.so (Debian GLIBC 2.35-1) stable release version 2.35.
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
ludo@darnassus:~$ objdump -x /lib/ld.so.1 |head -40

/lib/ld.so.1: file format elf32-i386
/lib/ld.so.1
architecture: i386, flags 0x0150:
HAS_SYMS, DYNAMIC, D_PAGED
start address 0x0001cc40

Program Header:
LOAD off0x vaddr 0x paddr 0x align 2**12
 filesz 0x00038494 memsz 0x00038494 flags r-x
LOAD off0x00038c00 vaddr 0x00039c00 paddr 0x00039c00 align 2**12
 filesz 0x1ca8 memsz 0x1e34 flags rw-
 DYNAMIC off0x00039f24 vaddr 0x0003af24 paddr 0x0003af24 align 2**2
 filesz 0x00b8 memsz 0x00b8 flags rw-
NOTE off0x0114 vaddr 0x0114 paddr 0x0114 align 2**2
 filesz 0x0024 memsz 0x0024 flags r--
--8<---cut here---end--->8---

Thanks for your feedback!

Ludo’.





bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)

2022-10-08 Thread Samuel Thibault
Ludovic Courtès, le ven. 07 oct. 2022 10:24:22 +0200, a ecrit:
> trap, eip 0xc10305c1  
>   
> Breakpoint at  task_terminate:  pushl   %ebp  
>   
> db> show all threads  
>   
> TASKTHREADS   
>   
>   0 gnumach (f5f7cf00): 7 threads:
>   
>   0 (f5f7be18) .W..N. 0xc11dac04  
>   
>   1 (f5f7bcd0) R..O..(idle_thread_continue)   
>   
>   2 (f5f7bb88) .W.ON.(reaper_thread_continue) 0xc12015d4  
>   
>   3 (f5f7ba40) .W.ON.(swapin_thread_continue) 0xc11f8e2c  
>   
>   4 (f5f7b8f8) .W.ON.(sched_thread_continue) 0
>   
>   5 (f5f7b7b0) .W.ON.(io_done_thread_continue) 0xc1201f74 
>   
>   6 (f5f7b668) .W.ON.(net_thread_continue) 0xc11db0a8 
>   
>   1 ext2fs (f5f7ce40): 6 threads: 
>   
>   0 (f5f7b520) .W.O.F(mach_msg_continue) 0
>   
>   1 (f5f7b290) .W.O..(mach_msg_receive_continue) 0
>   
>   2 (f5f7b148) .W.O..(mach_msg_receive_continue) 0
>   
>   3 (f5f7b000) .W.O..(mach_msg_continue) 0
>   
>   4 (f67d4e20) .W.O..(mach_msg_receive_continue) 0
>   
>   5 (f67d4cd8) .W.O..(mach_msg_continue) 0
>   
>   2 exec (f5f7cd80): (f5f7b3d8) R.
>   
> db> trace/t 0xf5f7b3d8
>   
> task_terminate(f625eb10,0,f5f7cd80,f5f7b3d8,c11da940) 
>   
> exception_try_task(1,1,bffefffc,,c1202b4c)+0x58   
>   
> exception(1,1,bffefffc,c10096da,f5957fbc)+0x7a
>   
> interrupted_pc(1,1,bffefffc,c102ce99,c1202b40)
>   
> trap_name(1,f5957f80,f5f73f4c,f5f73f58)   
>   
> vm_fault(f5f6ff30,bffef000,3,0,0,c1008ee4,f5f82550,fb7d9000)+0x74a
>   
> user_trap(f5f7a718)+0x2df 
>   
> > Page fault (14) at 0x1000 < 
> >   
> > user space <
> >   
> db> show map $map2
>   
> Map 0xf5f6ff30: name="exec", pmap=0xf5f71fa8,ref=1,nentries=5 
>   
> size=290816,resident:290816,wired=0   
>   
> version=14
>   
>  map entry 0xf625ec08: start=0x0, end=0x1000  
>   
>  prot=1/7/copy, object=0xf5f6a7d0, offset=0x0 
>   
>   Object 0xf5f6a7d0: size=0x1000, 1 references
>   
>   1 resident pages, 0 absent pages, 0 paging ops  
>   
>memory object=0x0 (offset=0x0),control=0x0, name=0xf5938968
>   
>uninitialized,temporary  internal,copy_strategy=0  
>   
>shadow=0x0 (offset=0x0),copy=0x0   
>   
>  map entry 0xf625ebb0: start=0x1000, end=0x26000  
>

bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)

2022-10-08 Thread Samuel Thibault
Ludovic Courtès, le ven. 07 oct. 2022 00:10:15 +0200, a ecrit:
> Samuel Thibault  skribis:
> 
> > Ludovic Courtès, le jeu. 06 oct. 2022 15:14:13 +0200, a ecrit:
> >> such that it would stop on each trap, hopefully allowing me to see why
> >> exec is not starting.
> >
> > Also, better use exec.static to have static addresses.
> 
> Thanks for the hint.
> 
> Of course, the thing boots just fine on that machine when using
> ‘exec.static’.

Uh. At least you have a workaround :)

> So the issue might be somewhere in ld.so, or triggered by ld.so.
> Any debugging tricks here?

maybe for a start check with 

show map $map2

what is actually mapped, whether it's just ld.so, or also exec, etc.

Perhaps you can also try to run other programs than exec, like small
dumb programs.

Samuel





bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)

2022-10-08 Thread Samuel Thibault
Ludovic Courtès, le jeu. 06 oct. 2022 15:14:13 +0200, a ecrit:
> such that it would stop on each trap, hopefully allowing me to see why
> exec is not starting.

Also, better use exec.static to have static addresses.

We use the dynamic version of exec "just because we can", but that makes
debugging difficult.

Samuel





bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)

2022-10-07 Thread Ludovic Courtès
Hi!

Samuel Thibault  skribis:

> Ludovic Courtès, le ven. 07 oct. 2022 00:10:15 +0200, a ecrit:

[...]

>> Of course, the thing boots just fine on that machine when using
>> ‘exec.static’.
>
> Uh. At least you have a workaround :)

Yup.  :-)

>> So the issue might be somewhere in ld.so, or triggered by ld.so.
>> Any debugging tricks here?
>
> maybe for a start check with 
>
> show map $map2
>
> what is actually mapped, whether it's just ld.so, or also exec, etc.

On the first trap (page fault) I see:

--8<---cut here---start->8---
db> show all tasks  

 ID TASK NAME [THREADS] 

  0 f5f7cf00 gnumach [7]

  1 f5f7ce40 ext2fs [6] 

  2 f5f7cd80 exec [1]   

db> show map $map2  

Map 0xf5f6ff30: name="exec", pmap=0xf5f71fa8,ref=1,nentries=5   

size=290816,resident:290816,wired=0 

version=14  

 map entry 0xf625ec08: start=0x0, end=0x1000

 prot=1/7/copy, object=0xf5f6a7d0, offset=0x0   

  Object 0xf5f6a7d0: size=0x1000, 1 references  

  1 resident pages, 0 absent pages, 0 paging ops

   memory object=0x0 (offset=0x0),control=0x0, name=0xf5938968  

   uninitialized,temporary  internal,copy_strategy=0

   shadow=0x0 (offset=0x0),copy=0x0 

 map entry 0xf625ebb0: start=0x1000, end=0x26000

 prot=5/7/copy, object=0xf5f6ad70, offset=0x0   

  Object 0xf5f6ad70: size=0x25000, 1 references 

  37 resident pages, 0 absent pages, 0 paging ops   

   memory object=0x0 (offset=0x0),control=0x0, name=0xf5f82780  

   uninitialized,temporary  internal,copy_strategy=0

   shadow=0x0 (offset=0x0),copy=0x0 

 map entry 0xf625eb58: start=0x26000, end=0x34000   

 prot=1/7/copy, object=0xf5f6ad20, offset=0x0   

  Object 0xf5f6ad20: size=0xe000, 1 references  

  14 resident pages, 0 absent pages, 0 paging ops   

   memory object=0x0 (offset=0x0),control=0x0, name=0xf5f82730  

   uninitialized,temporary  internal,copy_strategy=0

   shadow=0x0 (offset=0x0),copy=0x0 

 map entry 0xf625eb00: start=0x34000, end=0x37000   

 prot=3/7/copy, object=0xf5f6acd0, offset=0x0   

  Object 0xf5f6acd0: size=0x3000, 1 references  

  3 resident pages, 0 absent pages, 0 paging ops

   memory object=0x0 (offset=0x0),control=0x0, name=0xf5f826e0  

   uninitialized,temporary  internal,copy_strategy=0

   shadow=0x0 (offset=0x0),copy=0x0 

 map entry 0xf625eaa8: start=0xbfff, end=0xc000 

 prot=3/7/copy, object=0xf5f6ac80, offset=0x0   

bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)

2022-10-06 Thread Ludovic Courtès
Samuel Thibault  skribis:

> Ludovic Courtès, le jeu. 06 oct. 2022 15:14:13 +0200, a ecrit:
>> such that it would stop on each trap, hopefully allowing me to see why
>> exec is not starting.
>
> Also, better use exec.static to have static addresses.

Thanks for the hint.

Of course, the thing boots just fine on that machine when using
‘exec.static’.

So the issue might be somewhere in ld.so, or triggered by ld.so.
Any debugging tricks here?

Ludo’.





bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)

2022-10-06 Thread Ludovic Courtès
Hi!

As suggested by Samuel on IRC, I did that early on in kdb:

  debug traps /on

such that it would stop on each trap, hopefully allowing me to see why
exec is not starting.

--8<---cut here---start->8---
module 0: ext2fs --multiboot-command-line=${kernel-command-line} 
--host-priv-por
t=${host-port} --device-master-port=${device-port} 
--exec-server-task=${exec-tas
k} --store-type=typed --x-xattr-translator-records ${root} $(task-create) 
$(task
-resume)

module 1: exec 
/gnu/store/99sqiayswrxxb80331pl7jxin18wv28b-hurd-0.9-1.91a5167/hu   
 
rd/exec $(exec-task=task-create)

2 multiboot modules 

task loaded: ext2fs --multiboot-command-line=root=device:hd0s1 
root=3367134b-cfb
d-1e90-2f38-dfd13367134b 
gnu.system=/gnu/store/m66ccpdzdbcd3k2fdvyaj8cgmk23lybn-
system gnu.load=/gnu/store/m66ccpdzdbcd3k2fdvyaj8cgmk23lybn-system/boot 
--host-p
riv-port=1 --device-master-port=2 --exec-server-task=3 --store-type=typed 
--x-xa
ttr-translator-records device:hd0s1 

task loaded: exec 
/gnu/store/99sqiayswrxxb80331pl7jxin18wv28b-hurd-0.9-1.91a5167  
  
/hurd/exec  



start ext2fs: Hurd server bootstrap: ext2fs[device:hd0s1] execkernel: Page 
fault
 (14), code=6   

Stopped at  0x1000: pushl   0x4(%ebx)   

> user space <  
> 
0x1000(bf24,0,0,1160b,0)

0x11627(bf9c,0,0,0,2)   

0x11bb()

db> show all threads

TASKTHREADS 

  0 gnumach (f5f7cf00): 7 threads:  

  0 (f5f7be18) .W..N. 0xc11dac04

  1 (f5f7bcd0) R..O..(idle_thread_continue) 

  2 (f5f7bb88) .W.ON.(reaper_thread_continue) 0xc12015d4

  3 (f5f7ba40) .W.ON.(swapin_thread_continue) 0xc11f8e2c

  4 (f5f7b8f8) .W.ON.(sched_thread_continue) 0  

  5 (f5f7b7b0) .W.ON.(io_done_thread_continue) 0xc1201f74   

  6 (f5f7b668) .W.ON.(net_thread_continue) 0xc11db0a8   

  1 ext2fs (f5f7ce40): 6 threads:   

  0 (f5f7b520) .W.O.F(mach_msg_continue) 0  

  1 (f5f7b290) .W.O..(mach_msg_receive_continue) 0  

  2 (f5f7b148) .W.O..(mach_msg_receive_continue) 0  

  3 (f5f7b000) .W.O..(mach_msg_continue) 0  

  4 (f67d4e20) .W.O..(mach_msg_receive_continue) 0  

  5 (f67d4cd8) .W.O..(mach_msg_continue) 0  

  2 exec (f5f7cd80): (f5f7b3d8) R.  

--8<---cut here---end--->8---

Then lots of page faults with the same stack trace, seemingly endlessly:

--8<---cut here---start->8---
db> c   

kernel: Page fault (14), code=6 

Stopped at  0x1000: pushl   0x4(%ebx)   

> user space <  
> 
0x1000(bf24,0,0,1160b,0)  

bug#58320: Hurd VM fails to boot on AMD EPYC (kvm-amd)

2022-10-05 Thread Ludovic Courtès
Hello!

On AMD EPYC processors, as found on the build nodes of ci.guix.gnu.org,
childhurd VMs fail to boot when running with ‘qemu-system-i386
-enable-kvm’ (the kvm-amd Linux kernel module is used), with the Hurd
startup process hanging before /hurd/exec has been started:

--8<---cut here---start->8---
module 0: ext2fs --multiboot-command-line=${kernel-command-line} 
--host-priv-por
t=${host-port} --device-master-port=${device-port} 
--exec-server-task=${exec-tas
k} --store-type=typed --x-xattr-translator-records ${root} $(task-create) 
$(task
-resume)

module 1: exec 
/gnu/store/99sqiayswrxxb80331pl7jxin18wv28b-hurd-0.9-1.91a5167/hu   
 
rd/exec $(exec-task=task-create)

2 multiboot modules 

task loaded: ext2fs --multiboot-command-line=root=device:hd0s1 
root=3367134b-cfb
d-1e90-2f38-dfd13367134b 
gnu.system=/gnu/store/m66ccpdzdbcd3k2fdvyaj8cgmk23lybn-
system gnu.load=/gnu/store/m66ccpdzdbcd3k2fdvyaj8cgmk23lybn-system/boot 
--host-p
riv-port=1 --device-master-port=2 --exec-server-task=3 --store-type=typed 
--x-xa
ttr-translator-records device:hd0s1 

task loaded: exec 
/gnu/store/99sqiayswrxxb80331pl7jxin18wv28b-hurd-0.9-1.91a5167  
  
/hurd/exec  



start ext2fs: Hurd server bootstrap: ext2fs[device:hd0s1] exec  

--8<---cut here---end--->8---

Kdb shows these two tasks:

--8<---cut here---start->8---
Stopped   at  mach
ine_idle+0x26:  nop 

machine_idle(0,c102f380,f5f7bcd0,c102e3fa)+0x26 

idle_thread_continue(f5f7ade0,36366000,f5f73868,f5f73890,0)+0x73

> user space <  
> 
db> show all threads

TASKTHREADS 

  0 gnumach (f5f7cf00): 7 threads:  

  0 (f5f7be18) .W..N. 0xc11dac04

  1 (f5f7bcd0) R.   

  2 (f5f7bb88) .W.ON.(reaper_thread_continue) 0xc12015d4

  3 (f5f7ba40) .W.ON.(swapin_thread_continue) 0xc11f8e2c

  4 (f5f7b8f8) .W.ON.(sched_thread_continue) 0  

  5 (f5f7b7b0) .W.ON.(io_done_thread_continue) 0xc1201f74   

  6 (f5f7b668) .W.ON.(net_thread_continue) 0xc11db0a8   

  1 ext2fs (f5f7ce40): 6 threads:   

  0 (f5f7b520) .W.O.F(mach_msg_continue) 0  

  1 (f5f7b290) .W.O.F(mach_msg_receive_continue) 0  

  2 (f5f7b148) .W.O..(mach_msg_receive_continue) 0  

  3 (f5f7b000) .W.O..(mach_msg_continue) 0  

  4 (f67d3e20) .W.O..(mach_msg_receive_continue) 0  

  5 (f67d3cd8) .W.O..(mach_msg_continue) 0  

db> trace/t 0xf5f7b520  

Continuation mach_msg_continue  

> user space <  
> 
0x80ccaec() 

--8<---cut here---end--->8---

For ext2fs.static, that just means thread 0 is here:

--8<---cut here---start->8---
$ addr2line -e