Hi Stephen, On Sat, Apr 5, 2025 at 12:30 PM Stephen Brennan <stephen.s.bren...@oracle.com> wrote: > > Hi Tao, > > Thanks for digging through the code with me. I'm not very confident or > familiar with it, so I don't have 100% confident answers, but I'll do my > best to answer:
Thanks for your explanation! I think your modification is reasonable and I have run tests against your patch, no regression is found. Though I still haven't figured out whether it is a gdb issue or crash specific, however I think it is OK to accept your patch for now. So I will ack your No.1 and No.2 patch. Though No.2 patch needs some cleanup on the code, but I think that is just a minor change, I have no objection to its major part. So ack. Thanks, Tao Liu > > Tao Liu <l...@redhat.com> writes: > > Hi Stephen, > > > > On Fri, Apr 4, 2025 at 11:43 AM Stephen Brennan > > <stephen.s.bren...@oracle.com> wrote: > >> > >> Stephen Brennan <stephen.s.bren...@oracle.com> writes: > >> > Tao Liu <l...@redhat.com> writes: > >> >> Hi Stephen, > >> >> > >> >> Thanks for reporting the issue and patch. Please check if I understood > >> >> you correctly. The correct output you are expecting in your case is: > >> >> > >> >> crash> sym ata_dummy_port_ops > >> >> ffffffffc0a71580 (?) ata_dummy_port_ops [libata] > >> >> crash> mod -s libata > >> >> MODULE NAME TEXT_BASE SIZE > >> >> OBJECT FILE > >> >> ffffffffc0a7b640 libata ffffffffc0a47000 520192 > >> >> > >> >> /usr/lib/debug/lib/modules/6.12.0-0.11.8.el9uek.x86_64/kernel/drivers/ata/libata.ko.debug > >> >> crash> sym ata_dummy_port_ops > >> >> ffffffffc0a71580 (B) ata_dummy_port_ops [libata] > >> >> crash> p/x &ata_dummy_port_ops > >> >> $1 = 0xffffffffc0a6fe80 <<-------- should be 0xffffffffc0a71580, > >> >> same as sym's output, right? > >> > > >> > Yes, that's correct. > >> > > >> >> If that is the case, then after applied your patch, the issue still > >> >> exists on my machine: > >> >> > >> >> crash> sym fuse_miscdevice > >> >> ffffffffc05f7fe0 (?) fuse_miscdevice [fuse] > >> >> crash> mod -s fuse > >> >> MODULE NAME TEXT_BASE > >> >> SIZE OBJECT FILE > >> >> ffffffffc05f8dc0 fuse ffffffffc05da000 > >> >> 233472 > >> >> /usr/lib/debug/lib/modules/6.11.5-300.fc41.x86_64/kernel/fs/fuse/fuse.ko.debug > >> >> crash> sym fuse_miscdevice > >> >> ffffffffc05f7fe0 (b) fuse_miscdevice [fuse] > >> >> crash> p/x &fuse_miscdevice > >> >> $1 = 0xffffffffc05daf20 << ------ unmatch with the previous value. > >> > > >> > Indeed, I checked with fuse on Fedora (6.11.4-301.fc41.x86_64, so almost > >> > the same version). And I see the same behavior with fuse_miscdevice. > >> > > >> > In your case and mine, the offset is 0x1d0c0, which could be a clue to > >> > how this mismatch happens. > >> > > >> >> The .data section of fuse.ko.debug have non-zero address: > >> >> > >> >> $ readelf -S > >> >> /usr/lib/debug/lib/modules/6.11.5-300.fc41.x86_64/kernel/fs/fuse/fuse.ko.debug > >> >> Section Headers: > >> >> [Nr] Name Type Address Offset > >> >> Size EntSize Flags Link Info Align > >> >> ... > >> >> [50] .data NOBITS 0000000000000e20 00000100 > >> >> 000000000000080f 0000000000000000 WA 0 0 32 > >> >> > >> >> Could you please re-check your patch for this, or is there something > >> >> I'm missing? > >> > > >> > From what I can tell, the important difference here may be that the > >> > symbol you selected is a static variable (which becomes a local ELF > >> > symbol). Apparently, my patch only resolves the issue for *global > >> > variables*, but not for local variables. > >> > > >> > $ eu-readelf -s /usr/lib/debug/lib/modules/$(uname > >> > -r)/kernel/fs/fuse/fuse.ko.debug | grep -P "fuse_(miscdevice|mutex)" > >> > 85: 0000000000000100 80 OBJECT LOCAL DEFAULT 50 > >> > fuse_miscdevice > >> > 1041: 00000000000005c0 32 OBJECT GLOBAL DEFAULT 50 fuse_mutex > >> > > >> > Here is the behavior of crash's master branch on these two symbols: > >> > > >> > crash> sym fuse_miscdevice > >> > ffffffffc02fafe0 (?) fuse_miscdevice [fuse] > >> > crash> sym fuse_mutex > >> > ffffffffc02fb4a0 (?) fuse_mutex [fuse] > >> > crash> mod -s fuse > >> > MODULE NAME TEXT_BASE > >> > SIZE OBJECT FILE > >> > ffffffffc02fbdc0 fuse ffffffffc02dd000 > >> > 233472 > >> > /usr/lib/debug/lib/modules/6.11.4-301.fc41.x86_64/kernel/fs/fuse/fuse.ko.debug > >> > crash> sym fuse_miscdevice > >> > ffffffffc02fafe0 (b) fuse_miscdevice [fuse] > >> > crash> sym fuse_mutex > >> > ffffffffc02fb4a0 (B) fuse_mutex [fuse] > >> > crash> p/x &fuse_miscdevice > >> > $1 = 0xffffffffc02ddf20 > >> > crash> p/x &fuse_mutex > >> > $2 = 0xffffffffc02fa680 > >> > > >> > Both fuse_miscdevice, and fuse_mutex are mismatched (with different > >> > offsets). And now, here's the behavior of crash with my patch: > >> > > >> > crash> sym fuse_miscdevice > >> > ffffffffc02fafe0 (?) fuse_miscdevice [fuse] > >> > crash> sym fuse_mutex > >> > ffffffffc02fb4a0 (?) fuse_mutex [fuse] > >> > crash> mod -s fuse > >> > MODULE NAME TEXT_BASE > >> > SIZE OBJECT FILE > >> > ffffffffc02fbdc0 fuse ffffffffc02dd000 > >> > 233472 > >> > /usr/lib/debug/lib/modules/6.11.4-301.fc41.x86_64/kernel/fs/fuse/fuse.ko.debug > >> > crash> sym fuse_miscdevice > >> > ffffffffc02fafe0 (b) fuse_miscdevice [fuse] > >> > crash> sym fuse_mutex > >> > ffffffffc02fb4a0 (B) fuse_mutex [fuse] > >> > crash> p/x &fuse_miscdevice > >> > $1 = 0xffffffffc02ddf20 > >> > crash> p/x &fuse_mutex > >> > $2 = 0xffffffffc02fb4a0 > >> > > >> > Here the global fuse_mutex is correct, but the static/local > >> > fuse_miscdevice is still incorrect. This indicates that there's more yet > >> > to fix. (But at least my patch does part of the work!) > >> > > >> > I'm wondering whether this has anything to do with ELF relocations. > >> > >> Hello Tao, > > > > Thanks for your detailed debug info, which is quite helpful! For your > > No.1 patch which fixes the global variable relocation, it looks good. > > However, for your No.2 patch in the attached file, I have some > > questions. > >> > >> I've continued debugging and verified that the root cause for this was > >> GDB performing relocations to the DWARF info incorrectly. At the bottom > > > > 1) Is this a gdb bug? If it does, then the proper way is to post No.2 > > patch to gdb mailing list and get it fixed there, then we can do gdb > > patch backporting to crash. Frankly I haven't figured out whether it > > is gdb or crash bug so far by myself... > > I'm also not sure whether it's a GDB or Crash bug. > > Honestly, I think the "bug" is in Linux for producing non-zero section > addresses in a relocatable file. It seems to be universal that section > addresses are zero for relocatble ELF files. > > > 2) I'm quoting part of the commit log in your No.2 patch has follows > > (Please don't send patch as attachment next time, because I cannot > > comment directly on it) > > Understood, sorry. > > > ... > > 1. In default_symfile_offsets(), GDB detects the nonzero sh_addr field > > and fails to apply the user-supplied section offsets. The result is > > that later, in symfile_relocate_debug_section(), relocations are > > applied to the DWARF info using the wrong section addresses, which > > > > Didn't we have corrected the .data section address in No.1 patch of > > the .ko? Then why the "wrong section addresses"? > > The flow is like this: > > 1. crash executes "add-symbol-file [...] -s .data ADDRESS" > 2. add_symbol_file_command() parses these and stores them in a list > 3. We eventually make our way down to syms_from_objfile_1(), which seems > to be the main logic. > a. addr_info_make_relative() takes this list of section offsets, > along with the BFD file, and matches each one against the BFD > section. The function description indicates that the purpose is to > make the addresses provided be relative to the addresses in the > BFD section, which I suppose is why this number is subtracted. > > b. Then, default_symfile_offsets() is called from > syms_from_objfile_1(). This function only cares about ET_REL > relocatable files. It verifies that all sections have a zero > sh_addr, and assuming that is so, it updates the load address and > sets various fields on the BFD file to allow relocations to work > correctly. > 4. Relocations get applied using the values set in 3.b. Since the values > in 3.b were tampered with in 3.a, relocations that are based on the > address of the affected address are done incorrectly. > > I'm still not certain who is at fault, though! > > > results in invalid addresses for variables. Clearly, this has > > happened before, because crash has special-cased the "__ksymtab*" > > > > Which line of code do you mean the special-cased __ksymtab? I didn't > > find one in gdb/symfile.c. > > It is this part of the diff. I've removed it in my commit and also > removed the check for nen-zero section addresses. > > --- gdb-16.2/gdb/symfile.c.orig > +++ gdb-16.2/gdb/symfile.c > @@ -633,7 +633,26 @@ default_symfile_offsets (struct objfile *objfile, > for (cur_sec = abfd->sections; cur_sec != NULL; cur_sec = > cur_sec->next) > /* We do not expect this to happen; just skip this step if the > relocatable file has a section with an assigned VMA. */ > - if (bfd_section_vma (cur_sec) != 0) > + if (bfd_section_vma (cur_sec) != 0 > + /* > + * Kernel modules may have some non-zero VMAs, i.e., like the > + * __ksymtab and __ksymtab_gpl sections in this example: > + * > + * Section Headers: > + * [Nr] Name Type Address > Offset > + * Size EntSize Flags Link Info > Align > + * ... > + * [ 8] __ksymtab PROGBITS 0000000000000060 > 0000ad90 > + * 0000000000000010 0000000000000000 A 0 0 > 16 > + * [ 9] .rela__ksymtab RELA 0000000000000000 > 0000ada0 > + * 0000000000000030 0000000000000018 43 8 > 8 > + * [10] __ksymtab_gpl PROGBITS 0000000000000070 > 0000add0 > + * 00000000000001a0 0000000000000000 A 0 0 > 16 > + * ... > + * > + * but they should be treated as if they are NULL. > + */ > + && strncmp (bfd_section_name (cur_sec), "__k", 3) != 0) > break; > > if (cur_sec == NULL) > > > > section names to avoid this condition. To resolve this, we simply > > drop the check for nonzero sh_addr altogether: the only ET_REL files > > we encounter should be kernel modules, so there's no real reason to > > be picky. > > > > 2. Even with that fixed, the user-supplied section addresses are > > clobbered by the addr_info_make_relative(), which subtracts out > > section offsets during its operation. To resolve this, undo the > > operation for ET_REL files where a section address was provided by > > the user (i.e. crash). > > ... > > > > According to your description, it seems to me the gdb cannot handle > > the nonzero sh_addr in the .ko case, but .ko isn't the only one whose > > .data VMA is non-zero, see this case: > > > > $ readelf -S /lib64/libc.so.6 > > ... > > [30] .data PROGBITS 00000000001e8000 001e8000 > > 00000000000016c8 0000000000000000 WA 0 0 32 > > ... > > > > I don't know if gdb also fails in the userspace .so case? I mean, if > > gdb can handle the user case with no problem, then probably we don't > > need to hack gdb like this. > > One important difference is that libc.so.6 is a ET_DYN dynamic object > file. A key difference between dynamic executables and relocatable files > is the symbol table: symbols in dynamic executables have an absolute > virtual address, and they only change based on the load address. Symbols > in relocatable files have an address which is an offset from the > beginning of their section. So if a relocation is done based on the > section address, and the section address is incorrect, then the > relocation will be wrong. > > Stephen > -- Crash-utility mailing list -- devel@lists.crash-utility.osci.io To unsubscribe send an email to devel-le...@lists.crash-utility.osci.io https://${domain_name}/admin/lists/devel.lists.crash-utility.osci.io/ Contribution Guidelines: https://github.com/crash-utility/crash/wiki