To add to this: the bottom line is that this smaller version of the patch:
diff --git a/arch/x64/arch-elf.cc b/arch/x64/arch-elf.cc
index ebe40996..4c513150 100644
--- a/arch/x64/arch-elf.cc
+++ b/arch/x64/arch-elf.cc
@@ -70,16 +70,28 @@ bool object::arch_relocate_rela(u32 type, u32 sym, void
*addr,
memcpy(addr, sm.relocated_addr(), sm.size());
break;
}
- case R_X86_64_64:
- *static_cast<void**>(addr) = symbol(sym).relocated_addr() + addend;
+ case R_X86_64_64: {
+ auto _sym = symbol(sym, true);
+ if (_sym.symbol) {
+ *static_cast<void**>(addr) = _sym.relocated_addr() + addend;
+ }
break;
+ }
case R_X86_64_RELATIVE:
*static_cast<void**>(addr) = _base + addend;
break;
case R_X86_64_JUMP_SLOT:
- case R_X86_64_GLOB_DAT:
- *static_cast<void**>(addr) = symbol(sym).relocated_addr();
+ case R_X86_64_GLOB_DAT: {
+ auto _sym = symbol(sym, true);
+ if (_sym.symbol) {
+ *static_cast<void**>(addr) = _sym.relocated_addr();
+ }
break;
+ }
// The next 3 types are intended to relocate symbols of thread local
variables
// defined with __thread modifier
//
would suffice to make ffmpeg run.
The remaining part is to better inform the user if such a missing symbol is
ACTUALLY used.
On Tuesday, August 20, 2019 at 11:25:13 AM UTC-4, Waldek Kozaczuk wrote:
>
> Thanks for the review. You can hopefully find some more related background
> info in this group post from a couple of days ago -
> https://groups.google.com/forum/#!msg/osv-dev/kmkqlZt2DRY/ijpde19DDQAJ
>
> On Tuesday, August 20, 2019 at 9:19:12 AM UTC-4, Nadav Har'El wrote:
>>
>>
>> On Tue, Aug 20, 2019 at 7:20 AM Waldemar Kozaczuk <[email protected]>
>> wrote:
>>
>>> The commit
>>> https://github.com/cloudius-systems/osv/commit/f5cc12d56dd986d2c7982c5738b1f859702b07fb
>>> addressed the issue #993 to relax handling of missing symbols in
>>> relocate_pltgot()
>>> when loading ELF objects with BIND_NOW. This patch on other hand
>>> attempts to close
>>> the gap of handling missing symbols this time in relocate_rela() during
>>> loading phase
>>> of ELF objects by dynamic linker.
>>>
>>
>> Do you have an understanding *why* issue #1023 is different from anything
>> we've seen
>> before, where the function is relocated by relocate_rela() and not
>> relocate_pltgot()?
>> My guess was that this function is somehow accessed as a *variable*
>> (e.g., its pointer is
>> copied or something) instead of as a real function call, but this was
>> just a guess...
>>
>
> Fixing #1023 was not a priority to me until I discovered it was preventing
> me from running ffmpeg (
> https://github.com/cloudius-systems/osv-apps/tree/master/ffmpeg) that I
> am helping someone else to run. This app requires compiling from source and
> it used to work at some point. At first, it would fail with missing
> *versionsort* symbol. After I added this, it would fail with a missing
> symbol from the family of posix_spawnattr_* functions. This rang a bell
> and this is how it led me to the issue 1023. After making simple
> modification like in an original RFC patch (see same mailing link) the app
> would happily work. As a matter of fact, I was able to run a version of
> ffmpeg from host.
>
> Honestly, I do not understand why this app, in this case, needs symbols
> resolution processed through relocate_rela() vs relocate_pltgot().
> Actually, that is why I was asking the question about R_X86_64_GLOB_DAT
> which I would think should apply to variables, not functions. But I think
> you guess about being function pointers is a good one. Could it be
> different compiler flags?
>
>>
>>
>>> The solution is similar to what we do in relocate_pltgot()
>>> - let missing symbol get ignored and hope they will not be used.
>>> However in this case we cannot rely on lazy symbol resolution
>>> logic that would catch missing symbol later when used and abort.
>>> Instead we place an indicator of missing symbol (special invalid address)
>>> which would trigger page fault when symbol accessed. This indicator
>>> allows
>>> us to distinguish missing symbol used scenario from other page fault
>>> related ones.
>>> This unfortunately does not tell us which missing symbol was used.
>>> In future we could place indicator + offset where offset would point
>>> to a name of the missing symbol in some missing symbols table.
>>>
>>> Fixes #1023
>>>
>>
>> Did you test this on an application (like the rust example #1023) where
>> before the
>> application couldn't run, and now it can?
>>
> I did test with ffmpeg which is the original reason for this patch. I did
> not try to recreate it with the original rust example which I should.
>
>
>> Or does the application really try to use
>> the invalid pointer, and crash at that point?
>>
> In the ffmpeg case it never used the symbols it needed to be resolved
> through relocate_rela():
>
> /usr/lib/libmount.so.1: ignoring missing symbol versionsort of type: 6
> /usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_destroy
> of type: 6
> /usr/lib/librsvg-2.so.2: ignoring missing symbol
> posix_spawnattr_setsigmask of type: 6
> /usr/lib/librsvg-2.so.2: ignoring missing symbol
> posix_spawnattr_setsigdefault of type: 6
> /usr/lib/librsvg-2.so.2: ignoring missing symbol
> posix_spawn_file_actions_adddup2 of type: 6
> /usr/lib/librsvg-2.so.2: ignoring missing symbol
> posix_spawn_file_actions_init of type: 6
> /usr/lib/librsvg-2.so.2: ignoring missing symbol
> posix_spawn_file_actions_destroy of type: 6
> /usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_setflags
> of type: 6
> /usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_init of
> type: 6
> /usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnp of type: 6
> /usr/lib/libgmp.so.10: ignoring missing symbol obstack_vprintf of type: 1
> /usr/lib/libasound.so.2: ignoring missing symbol versionsort of type: 6
> /usr/lib/libnsl.so.1: ignoring missing symbol _libc_intl_domainname of
> type: 6
> /usr/lib/libasyncns.so.0: ignoring missing symbol __res_query of type: 6
> /usr/lib/libasyncns.so.0: ignoring missing symbol __res_search of type: 6
>
> So no it would not crash. Fffmpeg is a pretty complex app (toolbox for all
> kinds of media processing use cases) and depends on many shared libraries
> (or at least the version I compiled and the one provided as a package by
> Fedora or Ubuntu). I guess in all the use cases I have tested the
> particular symbols though missing were not used. I think that applies to
> many libraries - many symbols provided but not used by application actually.
>
>
>>
>>
>>
>>>
>>> Signed-off-by: Waldemar Kozaczuk <[email protected]>
>>> ---
>>> arch/x64/arch-elf.cc | 20 ++++++++++++++++----
>>> arch/x64/arch-elf.hh | 2 ++
>>> arch/x64/mmu.cc | 4 ++++
>>> 3 files changed, 22 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/arch/x64/arch-elf.cc b/arch/x64/arch-elf.cc
>>> index ebe40996..4c513150 100644
>>> --- a/arch/x64/arch-elf.cc
>>> +++ b/arch/x64/arch-elf.cc
>>> @@ -70,16 +70,28 @@ bool object::arch_relocate_rela(u32 type, u32 sym,
>>> void *addr,
>>> memcpy(addr, sm.relocated_addr(), sm.size());
>>> break;
>>> }
>>> - case R_X86_64_64:
>>> - *static_cast<void**>(addr) = symbol(sym).relocated_addr() +
>>> addend;
>>> + case R_X86_64_64: {
>>> + auto _sym = symbol(sym, true);
>>> + if (_sym.symbol) {
>>> + *static_cast<void**>(addr) = _sym.relocated_addr() + addend;
>>> + } else {
>>> + *static_cast<void**>(addr) = MISSING_SYMBOL_INDICATOR;
>>> + }
>>> break;
>>> + }
>>>
>> The reason I set the address to MISSING_SYMBOL_INDICATOR is NOT to
> prevent the app from crashing but to better inform why it crashes if
> missing symbol is used later. Otherwise it would show up as a regular null
> pointer fault.
>
>> case R_X86_64_RELATIVE:
>>> *static_cast<void**>(addr) = _base + addend;
>>> break;
>>> case R_X86_64_JUMP_SLOT:
>>> - case R_X86_64_GLOB_DAT:
>>> - *static_cast<void**>(addr) = symbol(sym).relocated_addr();
>>> + case R_X86_64_GLOB_DAT: {
>>> + auto _sym = symbol(sym, true);
>>> + if (_sym.symbol) {
>>> + *static_cast<void**>(addr) = _sym.relocated_addr();
>>> + } else {
>>> + *static_cast<void**>(addr) = MISSING_SYMBOL_INDICATOR;
>>> + }
>>> break;
>>> + }
>>> // The next 3 types are intended to relocate symbols of thread
>>> local variables
>>> // defined with __thread modifier
>>> //
>>> diff --git a/arch/x64/arch-elf.hh b/arch/x64/arch-elf.hh
>>> index 1811ceb5..b4c63967 100644
>>> --- a/arch/x64/arch-elf.hh
>>> +++ b/arch/x64/arch-elf.hh
>>> @@ -34,6 +34,8 @@ enum {
>>> R_X86_64_IRELATIVE = 37, // word64 indirect(B + A)
>>> };
>>>
>>> +void * const MISSING_SYMBOL_INDICATOR = (void*)0xffffeeeeddddcccc;
>>>
>>
>> Can you please remind me why this is an invalid pointer? Does it not have
>> enough f's in the beginning to be a valid pointer?
>>
> I tried to pick something that would never be a valid pointer. Is there a
> better address I should use?
>
>>
>> +
>>> /* for pltgot relocation */
>>> #define ARCH_JUMP_SLOT R_X86_64_JUMP_SLOT
>>>
>>> diff --git a/arch/x64/mmu.cc b/arch/x64/mmu.cc
>>> index 2f1ba5e2..441b6c45 100644
>>> --- a/arch/x64/mmu.cc
>>> +++ b/arch/x64/mmu.cc
>>> @@ -6,6 +6,7 @@
>>> */
>>>
>>> #include "arch-cpu.hh"
>>> +#include "arch-elf.hh"
>>> #include <osv/debug.hh>
>>> #include <osv/sched.hh>
>>> #include <osv/mmu.hh>
>>> @@ -28,6 +29,9 @@ void page_fault(exception_frame *ef)
>>> if (!pc) {
>>> abort("trying to execute null pointer");
>>> }
>>> + if (pc == MISSING_SYMBOL_INDICATOR) {
>>> + abort("trying to execute missing symbol");
>>>
>>
>> Do you have a test case where you actually see this message?
>>
> Yes. I was able to manually create tests that would trigger this with
> missing symbol scenario with R_X86_64_GLOB_DAT and R_X86_64_64 relocation
> type. In the latter case, I used the graalvm app that uses mprotect and I
> manually misspelled to make it not found and the message was properly
> handled. I also conducted a similar test for R_X86_64_GLOB_DAT.
>
>
>> Because I wonder if the invalid pointer actually gets *executed* (so pc =
>> ...) - it is also possible the pointer get followed, not executed. I think
>> "pc" isn't the general indication of where the page fault happened.
>>
> Not sure I understand what you are saying here. I did happen in the 2 test
> scenarios I ran. I see that this particular page fault logic looks at RIP
> value which only apply to function execution, right? Do we have another
> page fault handler when data is actually read/written to where we should
> supply similar logic?
>
>>
>>
>> // The following code may sleep. So let's verify the fault did not
>>> happen
>>> // when preemption was disabled, or interrupts were disabled.
>>> assert(sched::preemptable());
>>> --
>>> 2.20.1
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "OSv Development" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/osv-dev/20190820042043.25133-1-jwkozaczuk%40gmail.com
>>> .
>>>
>>
--
You received this message because you are subscribed to the Google Groups "OSv
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/osv-dev/70594378-49ae-4dc1-8786-a9eb73a97267%40googlegroups.com.