Frank, Thank you very much for your very descriptive account of how the exception location might be discovered using the kernel debugger. I think this may be a long exercise, but I wanted to respond to at least acknowledge your message.
On Wednesday, 29 July 2020 09:24:17 CEST Frank Mehnert wrote: > > I want to encourage you to take the program counter value serious. The > message says that there was an access to the memory at address 0x38 > (sounds like an access to offset 38 of an object where the object pointer > was not initialized) and the corresponding program counter in userland > s 0x3a8bd9. From that value I guess that your host is AMD64. Yes, that is correct. I also assumed that the error was related to a null reference. > Now the question is of course: Which application triggered this exception? > If you know the answer then you should disassemble the corresponding binary > with > > objdump -ldC <filename> | less > > and search for the program counter. If your binary was compiled with > debugging information, you will even see the source code around the > faulting instruction. > > If your binary was not compiled with debugging information: > > 1. If the application is compiled within the L4Re tree then use the > binary from the package build directory because that one is not > stripped, for example > > build-x86-64/pkg/hello/server/src/OBJ-amd64_gen-l4f/hello > > rather than > > build-x86-64/bin/amd64_gen/l4f/hello > > because the latter binary is stripped (i.e. contains no debugging > information) if CONFIG_BID_STRIP_PROGS is set to 'y'. This is a useful reminder, but I think I must have experienced difficulties before with the bin subdirectory's contents, so I tend to access the appropriate binaries inside their package directories, anyway. It's probably just good fortune that something in my mind remembers the right kind of location to investigate. > 2. If you compiled the binary yourself, make sure to the the '-g' flag > to the compiler options. For L4Re applications using the L4Re build > infrastructure this is done automatically, see 1. I think that getting programs built outside the L4Re build framework would be too advanced for me. > Next question: Is your binary linked statically or does it use dynamic > libraries? You can find this out by doing > > objdump -p <filename> > > If the output contains at least one line with 'NEEDED' then your binary > uses dynamic libraries and looking for the program counter can be more > difficult if the fault happens in a dynamic library because the library > code is relocated to an unknown address when the library is loaded at > program start. > > Therefore for debugging it's always advisable to use static linked > binaries. If your application uses the L4Re build infrastructure, set > > MODE = static > > in the Makefile. If you use your own Makefile, make sure to add > > -static > > to the linker flags. > > Exploring your application binary is always the first advisable strategy > to such an exception. Here, I was using shared libraries, so I have now switched the linking of the offending program to be static. [Details of the current thread and the return instruction address...] > Remember: You are inspecting the region mapper thread which is != the > thread which triggered the exception! Therefore, if you press <space> > at the word marked as 'Return frame: IP', you will see the code for > 'enter_kdebug()'. That doesn't help you. This was certainly very useful advice, saving me quite some potential frustration, along with this: > Now use the 'lp' view to see the list of present threads in the system. The > cursor is placed at the current thread (the region mapper of your > application). Look around at threads with the same 'sp' value (sp = space, > the address space of the application). See this example: > > id cpu name pr sp wait to state > 20 0 hello 2 1c 1d ready,rcv_wait > 1d 0 #hello ff 1c ready > d 0 moe ff c - ready,rcv_wait > b 0 sigma0 1 a - ready,rcv_wait > 9 1 ----- 0 1 ready > 8 3 ----- 0 1 ready > 7 2 ----- 0 1 ready > 6 0 ----- 0 1 ready > > (this setup emulates 4 CPUs, thus there are 4 idle threads) > > Thread '1d' is the region mapper thread of the hello application. 'hello' > has 2 threads, thread 1d and thread 20. Thread 20 is currently waiting > for an IPC from thread 1d. Therefore thread 20 is the one you want to > inspect. Go there and press enter. Then move the TCB stack cursor down > to 'Return frame: IP' as I told you before, see there: OK, so following these instructions, I think I correctly identify the waiting thread in the same "space" corresponding to the region mapper thread. Navigating to the return instruction address indeed indicates the reported address: L4Re[rm]: unhandled read page fault at 0x70 pc=0x100491b And if I look in the objdump output, at least on some occasions, I can find an instruction which would be causing the exception. The code looks like this: 100490f: 49 8b 04 24 mov (%r12),%rax 1004913: 4c 89 ee mov %r13,%rsi 1004916: 31 d2 xor %edx,%edx 1004918: 4c 89 e7 mov %r12,%rdi 100491b: ff 50 70 callq *0x70(%rax) It is at this final instruction that the exception occurs, and the offset is as reported, too. The awkward thing here, though, is that the offending instruction is a virtual method call within the same instance: this->flush_flexpage(flexpage); As I think I noted in my previous message, concurrency issues may be involved here, and I rather think I may need to step back and consider whether I am doing things well enough. Paul _______________________________________________ l4-hackers mailing list [email protected] http://os.inf.tu-dresden.de/mailman/listinfo/l4-hackers
