On Sunday, August 25, 2019 at 1:37:57 AM UTC-4, Waldek Kozaczuk wrote: > > > On Wednesday, August 21, 2019 at 3:22:58 AM UTC-4, Nadav Har'El wrote: >> >> Hi, thanks. I don't have satisfactory answers for everything, but I wrote >> below some answers with background information which might be helpful. >> Please read all the way to the hand - I have answers sprinkled all over. >> >> On Tue, Aug 20, 2019 at 6:25 PM Waldek Kozaczuk <[email protected]> >> wrote: >> >>> Thanks for the review. You can hopefully find some more related >>> background info in this group post from a couple of days ago - >>> https://groups.google.com/forum/#!msg/osv-dev/kmkqlZt2DRY/ijpde19DDQAJ >>> >>> On Tuesday, August 20, 2019 at 9:19:12 AM UTC-4, Nadav Har'El wrote: >>>> >>>> >>>> On Tue, Aug 20, 2019 at 7:20 AM Waldemar Kozaczuk <[email protected]> >>>> wrote: >>>> >>>>> The commit >>>>> https://github.com/cloudius-systems/osv/commit/f5cc12d56dd986d2c7982c5738b1f859702b07fb >>>>> addressed the issue #993 to relax handling of missing symbols in >>>>> relocate_pltgot() >>>>> when loading ELF objects with BIND_NOW. This patch on other hand >>>>> attempts to close >>>>> the gap of handling missing symbols this time in relocate_rela() >>>>> during loading phase >>>>> of ELF objects by dynamic linker. >>>>> >>>> >>>> Do you have an understanding *why* issue #1023 is different from >>>> anything we've seen >>>> before, where the function is relocated by relocate_rela() and not >>>> relocate_pltgot()? >>>> My guess was that this function is somehow accessed as a *variable* >>>> (e.g., its pointer is >>>> copied or something) instead of as a real function call, but this was >>>> just a guess... >>>> >>> >>> Fixing #1023 was not a priority to me until I discovered it was >>> preventing me from running ffmpeg ( >>> https://github.com/cloudius-systems/osv-apps/tree/master/ffmpeg) that I >>> am helping someone else to run. This app requires compiling from source and >>> it used to work at some point. At first, it would fail with missing >>> *versionsort* symbol. After I added this, it would fail with a missing >>> symbol from the family of posix_spawnattr_* functions. This rang a bell >>> and this is how it led me to the issue 1023. After making simple >>> modification like in an original RFC patch (see same mailing link) the app >>> would happily work. As a matter of fact, I was able to run a version of >>> ffmpeg from host. >>> >>> Honestly, I do not understand why this app, in this case, needs symbols >>> resolution processed through relocate_rela() vs relocate_pltgot(). >>> Actually, that is why I was asking the question about R_X86_64_GLOB_DAT >>> which I would think should apply to variables, not functions. But I think >>> you guess about being function pointers is a good one. Could it be >>> different compiler flags? >>> >> >> What bothers me is that I don't know... One of the lines you showed was: >> /usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_destroy >> of type: 6 >> >> So I went and took a look at the source code of librsvg, and it doesn't >> seem to use posix_spawnattr_destroy at all, let alone use it in some >> strange way like assume it is a variable or something... >> > > I dug a little deeper and it turns out librsvg was rewritten couple of > years to use Rust inside but still provide C interfaces - see > https://github.com/GNOME/librsvg/tree/8c5587d7c7e86cbfea1f3b14420d4bdba1f14d31/rsvg_internals/src > and > couple of presentations from main Readme. The Rust code seems to be using > libc crate - https://github.com/rust-lang/libc - which seems to be > providing Rust placeholders to all libc symbols - > https://github.com/rust-lang/libc/blob/692dc544dbaddf660d1aef37bf1846d714e3f7cf/src/unix/linux_like/linux/mod.rs#L2717 > including > the one you were looking for in librsvg code and could not find. I have > very little experience with Rust and this specific libc crate but I am > speculating that any Rust app/library that depends on libc gets all its > symbols whether they are used or NOT. This kind of is similar o the symbols > you originally reported missing in the issue 1023 which was about trying to > run the httpserver in Rust. > Another explanation is that librsvg is linked as dylib and it should as cdylib. Per this link - https://doc.rust-lang.org/edition-guide/rust-2018/platform-and-target-support/cdylib-crates-for-c-interoperability.html - dylib has extra stuff that cdylib linked lib does not.
> >> So I'm still at a complete loss what really happens that we're trying to >> solve, or why we only started to see this problem now, after running dozens >> of different executables without this problem. >> >> That being said, I don't dispute you are solving a real problem that >> really prevents you from running things, so I won't refuse to commit this, >> even if we don't fully understand all the details. >> > > Just like with the solution you proposed in > https://github.com/cloudius-systems/osv/issues/993 to delay resolution of > missing symbols in the hope the will not be used, I am trying to fill > similar gap and let more *unmodified binaries from Linux host *run on > OSv. Except in the case of 1023, where missing symbols are processed in > "rela" phase, we cannot rely on a similar mechanism like > elf_resolve_pltgot to catch symbol later when actually used. But > similarly, as with 993 I want to ignore symbols identified as missing > during rela phase with hope they will not be actually used and create a > mechanism that would detect that missing symbol was actually used. > >> >>>> >>>>> The solution is similar to what we do in relocate_pltgot() >>>>> - let missing symbol get ignored and hope they will not be used. >>>>> However in this case we cannot rely on lazy symbol resolution >>>>> logic that would catch missing symbol later when used and abort. >>>>> Instead we place an indicator of missing symbol (special invalid >>>>> address) >>>>> which would trigger page fault when symbol accessed. This indicator >>>>> allows >>>>> us to distinguish missing symbol used scenario from other page fault >>>>> related ones. >>>>> This unfortunately does not tell us which missing symbol was used. >>>>> In future we could place indicator + offset where offset would point >>>>> to a name of the missing symbol in some missing symbols table. >>>>> >>>>> Fixes #1023 >>>>> >>>> >>>> Did you test this on an application (like the rust example #1023) where >>>> before the >>>> application couldn't run, and now it can? >>>> >>> I did test with ffmpeg which is the original reason for this patch. I >>> did not try to recreate it with the original rust example which I should. >>> >> I did run the original version of Rust httpserver which was used to identify the issue #1023 by reverting this patch https://github.com/cloudius-systems/osv-apps/commit/6329570cc685a6ba543c2ddecfb8b7173c3b3832 or manually changing cdylib do dylib in Cargo.toml. And now it works with my patch. > >>> >>>> Or does the application really try to use >>>> the invalid pointer, and crash at that point? >>>> >>> In the ffmpeg case it never used the symbols it needed to be resolved >>> through relocate_rela(): >>> >>> /usr/lib/libmount.so.1: ignoring missing symbol versionsort of type: 6 >>> /usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_destroy >>> of type: 6 >>> /usr/lib/librsvg-2.so.2: ignoring missing symbol >>> posix_spawnattr_setsigmask of type: 6 >>> /usr/lib/librsvg-2.so.2: ignoring missing symbol >>> posix_spawnattr_setsigdefault of type: 6 >>> /usr/lib/librsvg-2.so.2: ignoring missing symbol >>> posix_spawn_file_actions_adddup2 of type: 6 >>> /usr/lib/librsvg-2.so.2: ignoring missing symbol >>> posix_spawn_file_actions_init of type: 6 >>> /usr/lib/librsvg-2.so.2: ignoring missing symbol >>> posix_spawn_file_actions_destroy of type: 6 >>> /usr/lib/librsvg-2.so.2: ignoring missing symbol >>> posix_spawnattr_setflags of type: 6 >>> /usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_init of >>> type: 6 >>> /usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnp of type: 6 >>> /usr/lib/libgmp.so.10: ignoring missing symbol obstack_vprintf of type: 1 >>> /usr/lib/libasound.so.2: ignoring missing symbol versionsort of type: 6 >>> /usr/lib/libnsl.so.1: ignoring missing symbol _libc_intl_domainname of >>> type: 6 >>> /usr/lib/libasyncns.so.0: ignoring missing symbol __res_query of type: 6 >>> /usr/lib/libasyncns.so.0: ignoring missing symbol __res_search of type: 6 >>> >>> So no it would not crash. Fffmpeg is a pretty complex app (toolbox for >>> all kinds of media processing use cases) and depends on many shared >>> libraries (or at least the version I compiled and the one provided as a >>> package by Fedora or Ubuntu). I guess in all the use cases I have tested >>> the particular symbols though missing were not used. I think that applies >>> to many libraries - many symbols provided but not used by application >>> actually. >>> >>> >>>> >>>> >>>> >>>>> >>>>> Signed-off-by: Waldemar Kozaczuk <[email protected]> >>>>> --- >>>>> arch/x64/arch-elf.cc | 20 ++++++++++++++++---- >>>>> arch/x64/arch-elf.hh | 2 ++ >>>>> arch/x64/mmu.cc | 4 ++++ >>>>> 3 files changed, 22 insertions(+), 4 deletions(-) >>>>> >>>>> diff --git a/arch/x64/arch-elf.cc b/arch/x64/arch-elf.cc >>>>> index ebe40996..4c513150 100644 >>>>> --- a/arch/x64/arch-elf.cc >>>>> +++ b/arch/x64/arch-elf.cc >>>>> @@ -70,16 +70,28 @@ bool object::arch_relocate_rela(u32 type, u32 sym, >>>>> void *addr, >>>>> memcpy(addr, sm.relocated_addr(), sm.size()); >>>>> break; >>>>> } >>>>> - case R_X86_64_64: >>>>> - *static_cast<void**>(addr) = symbol(sym).relocated_addr() + >>>>> addend; >>>>> + case R_X86_64_64: { >>>>> + auto _sym = symbol(sym, true); >>>>> + if (_sym.symbol) { >>>>> + *static_cast<void**>(addr) = _sym.relocated_addr() + >>>>> addend; >>>>> + } else { >>>>> + *static_cast<void**>(addr) = MISSING_SYMBOL_INDICATOR; >>>>> + } >>>>> break; >>>>> + } >>>>> >>>> The reason I set the address to MISSING_SYMBOL_INDICATOR is NOT to >>> prevent the app from crashing but to better inform why it crashes if >>> missing symbol is used later. Otherwise it would show up as a regular null >>> pointer fault. >>> >>>> case R_X86_64_RELATIVE: >>>>> *static_cast<void**>(addr) = _base + addend; >>>>> break; >>>>> case R_X86_64_JUMP_SLOT: >>>>> - case R_X86_64_GLOB_DAT: >>>>> - *static_cast<void**>(addr) = symbol(sym).relocated_addr(); >>>>> + case R_X86_64_GLOB_DAT: { >>>>> + auto _sym = symbol(sym, true); >>>>> + if (_sym.symbol) { >>>>> + *static_cast<void**>(addr) = _sym.relocated_addr(); >>>>> + } else { >>>>> + *static_cast<void**>(addr) = MISSING_SYMBOL_INDICATOR; >>>>> + } >>>>> break; >>>>> + } >>>>> // The next 3 types are intended to relocate symbols of thread >>>>> local variables >>>>> // defined with __thread modifier >>>>> // >>>>> diff --git a/arch/x64/arch-elf.hh b/arch/x64/arch-elf.hh >>>>> index 1811ceb5..b4c63967 100644 >>>>> --- a/arch/x64/arch-elf.hh >>>>> +++ b/arch/x64/arch-elf.hh >>>>> @@ -34,6 +34,8 @@ enum { >>>>> R_X86_64_IRELATIVE = 37, // word64 indirect(B + A) >>>>> }; >>>>> >>>>> +void * const MISSING_SYMBOL_INDICATOR = (void*)0xffffeeeeddddcccc; >>>>> >>>> >>>> Can you please remind me why this is an invalid pointer? Does it not >>>> have enough f's in the beginning to be a valid pointer? >>>> >>> I tried to pick something that would never be a valid pointer. Is there >>> a better address I should use? >>> >> >> Although x86_64 supports 64-bit pointers, they knew that applications >> don't really need - yet - to address the full 2^64 bits of memory. >> So the processor allows you to choose how many bits you *really* want. >> The default is 48 bits, allowing you to address 262 terabytes of memory, >> and quite enough for today's standards. This uses 4-level page tables. You >> can also choose 57 bits, and 5-level page tables ( >> https://en.wikipedia.org/wiki/Intel_5-level_paging) but OSv neither does >> this nor want it. >> >> When 48-bit addresses are supported, valid addresses are so-called >> "canonical" addresses, where all the highest bits cannot be the same. For >> 48 bits it means the last legal positive addresses is 0x00007FFFFFFFFFFF, >> while the most negative address is 0xFFFF800000000000. If you take an >> address which is not canonical in that sense , for example, >> 0xF000000000000000, then this address is completely illegal. If you take an >> address which is canonical, but simply not in the page table, you get a >> regular page fault - this is what happens with the "0" address. But you >> need to make sure you never allocate it. >> >> You chose 0xffffeeeeddddcccc. This is in fact a legal canonical address. >> In it may be possible that our malloc() (see mempool.cc) allocates it. >> Whether it does, I don't know. >> What happens if you use a non-canonical address? I'm not even sure if you >> get the normal #PF or something else like #GP. >> > Do you suggest to use a different address then? Or maybe a different > mechanism? The bottom line is that if I only ignore the missing symbol in > *arch_relocate_rela* and unfortunately it does get used later, the user > will get regular "0" address page fault in most cases which would not tell > him that missing symbol was used in such case. Thoughts? > >> >> >>>> + >>>>> /* for pltgot relocation */ >>>>> #define ARCH_JUMP_SLOT R_X86_64_JUMP_SLOT >>>>> >>>>> diff --git a/arch/x64/mmu.cc b/arch/x64/mmu.cc >>>>> index 2f1ba5e2..441b6c45 100644 >>>>> --- a/arch/x64/mmu.cc >>>>> +++ b/arch/x64/mmu.cc >>>>> @@ -6,6 +6,7 @@ >>>>> */ >>>>> >>>>> #include "arch-cpu.hh" >>>>> +#include "arch-elf.hh" >>>>> #include <osv/debug.hh> >>>>> #include <osv/sched.hh> >>>>> #include <osv/mmu.hh> >>>>> @@ -28,6 +29,9 @@ void page_fault(exception_frame *ef) >>>>> if (!pc) { >>>>> abort("trying to execute null pointer"); >>>>> } >>>>> + if (pc == MISSING_SYMBOL_INDICATOR) { >>>>> + abort("trying to execute missing symbol"); >>>>> >>>> >>>> Do you have a test case where you actually see this message? >>>> >>> Yes. I was able to manually create tests that would trigger this with >>> missing symbol scenario with R_X86_64_GLOB_DAT and R_X86_64_64 relocation >>> type. In the latter case, I used the graalvm app that uses mprotect and I >>> manually misspelled to make it not found and the message was properly >>> handled. I also conducted a similar test for R_X86_64_GLOB_DAT. >>> >> >> I didn't understand how you recreated this. Did you have a test case >> where the code really used the function relocated via R_X86_64_GLOB_DAT? >> >> >>> >>>> Because I wonder if the invalid pointer actually gets *executed* (so pc >>>> = ...) - it is also possible the pointer get followed, not executed. I >>>> think "pc" isn't the general indication of where the page fault happened. >>>> >>> Not sure I understand what you are saying here. I did happen in the 2 >>> test scenarios I ran. I see that this particular page fault logic looks at >>> RIP value which only apply to function execution, right? Do we have another >>> page fault handler when data is actually read/written to where we should >>> supply similar logic? >>> >> >> When you get a page fault, there can be two things that happened: One >> option is that the instruction tried to access some memory address which >> isn't mapped, or mapped for wrong permissions (e.g., the instruction tried >> to write into memory mapped read-only). In this case, we have the broken >> address in the cr2 register. Another option is that the *instruction* >> itself could not be executed - because the current pc (program counter) >> points to memory not mapped as executable - or - often - pc is 0 means that >> someone tried to *execute* the null pointer. >> >> Because this last case happens commonly, we have a special message for it >> in our page_fault() handler. But it only happens if someone tries to >> *execute* a null pointer. If you try to read or write from a null pointer, >> you won't get that message. >> >> Similarly, your test for pc will only catch trying to executing this fake >> address - not reading or writing for it. To catch the latter you would need >> to also check "addr" (i.e., cr2). >> >> In all the examples we discussed so far, these relocations were used for >> functions, so presumably if ever used, these addresses will indeed be >> *executed*, so your test would be good enough. But if this relocation is >> used for a variable, then this address might be read or written, not just >> executed. Moreover, if the address is "close" (e.g., same page) as the fake >> address, it is probably an indication that someone tried to read or write a >> field inside that variable, instead of its first byte. >> >> >>>> // The following code may sleep. So let's verify the fault did not >>>>> happen >>>>> // when preemption was disabled, or interrupts were disabled. >>>>> assert(sched::preemptable()); >>>>> -- >>>>> 2.20.1 >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "OSv Development" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/osv-dev/20190820042043.25133-1-jwkozaczuk%40gmail.com >>>>> . >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "OSv Development" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/osv-dev/3a4fe9b0-e2a1-46fc-8982-b8a06fee2be3%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/osv-dev/3a4fe9b0-e2a1-46fc-8982-b8a06fee2be3%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/a99534fb-12a8-49a4-8850-633aff62be12%40googlegroups.com.
