On Sunday, August 25, 2019 at 1:37:57 AM UTC-4, Waldek Kozaczuk wrote:
>
>
> On Wednesday, August 21, 2019 at 3:22:58 AM UTC-4, Nadav Har'El wrote:
>>
>> Hi, thanks. I don't have satisfactory answers for everything, but I wrote 
>> below some answers with background information which might be helpful. 
>> Please read all the way to the hand - I have answers sprinkled all over.
>>
>> On Tue, Aug 20, 2019 at 6:25 PM Waldek Kozaczuk <[email protected]> 
>> wrote:
>>
>>> Thanks for the review. You can hopefully find some more related 
>>> background info in this group post from a couple of days ago - 
>>> https://groups.google.com/forum/#!msg/osv-dev/kmkqlZt2DRY/ijpde19DDQAJ
>>>
>>> On Tuesday, August 20, 2019 at 9:19:12 AM UTC-4, Nadav Har'El wrote:
>>>>
>>>>
>>>> On Tue, Aug 20, 2019 at 7:20 AM Waldemar Kozaczuk <[email protected]> 
>>>> wrote:
>>>>
>>>>> The commit 
>>>>> https://github.com/cloudius-systems/osv/commit/f5cc12d56dd986d2c7982c5738b1f859702b07fb
>>>>> addressed the issue #993 to relax handling of missing symbols in 
>>>>> relocate_pltgot()
>>>>> when loading ELF objects with BIND_NOW. This patch on other hand 
>>>>> attempts to close
>>>>> the gap of handling missing symbols this time in relocate_rela() 
>>>>> during loading phase
>>>>> of ELF objects by dynamic linker.
>>>>>
>>>>
>>>> Do you have an understanding *why* issue #1023 is different from 
>>>> anything we've seen
>>>> before, where the function is relocated by relocate_rela() and not 
>>>> relocate_pltgot()?
>>>> My guess was that this function is somehow accessed as a *variable* 
>>>> (e.g., its pointer is
>>>> copied or something) instead of as a real function call, but this was 
>>>> just a guess...
>>>>
>>>
>>> Fixing #1023 was not a priority to me until I discovered it was 
>>> preventing me from running ffmpeg (
>>> https://github.com/cloudius-systems/osv-apps/tree/master/ffmpeg) that I 
>>> am helping someone else to run. This app requires compiling from source and 
>>> it used to work at some point. At first, it would fail with missing 
>>> *versionsort* symbol. After I added this, it would fail with a missing 
>>> symbol from the family of posix_spawnattr_* functions. This rang a bell 
>>> and this is how it led me to the issue 1023. After making simple 
>>> modification like in an original RFC patch (see same mailing link) the app 
>>> would happily work. As a matter of fact, I was able to run a version of 
>>> ffmpeg from host. 
>>>
>>> Honestly, I do not understand why this app, in this case, needs symbols 
>>> resolution processed through relocate_rela() vs relocate_pltgot(). 
>>> Actually, that is why I was asking the question about R_X86_64_GLOB_DAT 
>>> which I would think should apply to variables, not functions. But I think 
>>> you guess about being function pointers is a good one. Could it be 
>>> different compiler flags?
>>>
>>
>> What bothers me is that I don't know... One of the lines you showed was:
>> /usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_destroy 
>> of type: 6
>>
>> So I went and took a look at the source code of librsvg, and it doesn't 
>> seem to use posix_spawnattr_destroy at all, let alone use it in some 
>> strange way like assume it is a variable or something...
>>
>  
> I dug a little deeper and it turns out librsvg was rewritten couple of 
> years to use Rust inside but still provide C interfaces - see 
> https://github.com/GNOME/librsvg/tree/8c5587d7c7e86cbfea1f3b14420d4bdba1f14d31/rsvg_internals/src
>  and 
> couple of presentations from main Readme. The Rust code seems to be using 
> libc crate - https://github.com/rust-lang/libc - which seems to be 
> providing Rust placeholders to all libc symbols -  
> https://github.com/rust-lang/libc/blob/692dc544dbaddf660d1aef37bf1846d714e3f7cf/src/unix/linux_like/linux/mod.rs#L2717
>  including 
> the one you were looking for in librsvg code and could not find. I have 
> very little experience with Rust and this specific libc crate but I am 
> speculating that any Rust app/library that depends on libc gets all its 
> symbols whether they are used or NOT. This kind of is similar o the symbols 
> you originally reported missing in the issue 1023 which was about trying to 
> run the httpserver in Rust.
>
Another explanation is that librsvg is linked as dylib and it should as 
cdylib. Per this link - 
https://doc.rust-lang.org/edition-guide/rust-2018/platform-and-target-support/cdylib-crates-for-c-interoperability.html
 - 
dylib has extra stuff that cdylib linked lib does not. 

>
>> So I'm still at a complete loss what really happens that we're trying to 
>> solve, or why we only started to see this problem now, after running dozens 
>> of different executables without this problem.
>>
>> That being said, I don't dispute you are solving a real problem that 
>> really prevents you from running things, so I won't refuse to commit this, 
>> even if we don't fully understand all the details. 
>>
>
> Just like with the solution you proposed in 
> https://github.com/cloudius-systems/osv/issues/993 to delay resolution of 
> missing symbols in the hope the will not be used, I am trying to fill 
> similar gap and let more *unmodified binaries from Linux host *run on 
> OSv. Except in the case of 1023, where missing symbols are processed in 
> "rela" phase, we cannot rely on a similar mechanism like 
> elf_resolve_pltgot to catch symbol later when actually used. But 
> similarly, as with 993 I want to ignore symbols identified as missing 
> during rela phase with hope they will not be actually used and create a 
> mechanism that would detect that missing symbol was actually used.
>
>>
>>>>
>>>>> The solution is similar to what we do in relocate_pltgot()
>>>>> - let missing symbol get ignored and hope they will not be used.
>>>>> However in this case we cannot rely on lazy symbol resolution
>>>>> logic that would catch missing symbol later when used and abort.
>>>>> Instead we place an indicator of missing symbol (special invalid 
>>>>> address)
>>>>> which would trigger page fault when symbol accessed. This indicator 
>>>>> allows
>>>>> us to distinguish missing symbol used scenario from other page fault 
>>>>> related ones.
>>>>> This unfortunately does not tell us which missing symbol was used.
>>>>> In future we could place indicator + offset where offset would point
>>>>> to a name of the missing symbol in some missing symbols table.
>>>>>
>>>>> Fixes #1023
>>>>>
>>>>
>>>> Did you test this on an application (like the rust example #1023) where 
>>>> before the
>>>> application couldn't run, and now it can? 
>>>>
>>> I did test with ffmpeg which is the original reason for this patch. I 
>>> did not try to recreate it with the original rust example which I should. 
>>>
>> I did run the original version of Rust httpserver which was used to 
identify the issue #1023 by reverting this patch 
https://github.com/cloudius-systems/osv-apps/commit/6329570cc685a6ba543c2ddecfb8b7173c3b3832
 or 
manually changing cdylib do dylib in Cargo.toml. And now it works with my 
patch.

>  
>>>
>>>> Or does the application really try to use
>>>> the invalid pointer, and crash at that point?
>>>>
>>> In the ffmpeg case it never used the symbols it needed to be resolved 
>>> through relocate_rela():
>>>
>>> /usr/lib/libmount.so.1: ignoring missing symbol versionsort of type: 6
>>> /usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_destroy 
>>> of type: 6
>>> /usr/lib/librsvg-2.so.2: ignoring missing symbol 
>>> posix_spawnattr_setsigmask of type: 6
>>> /usr/lib/librsvg-2.so.2: ignoring missing symbol 
>>> posix_spawnattr_setsigdefault of type: 6
>>> /usr/lib/librsvg-2.so.2: ignoring missing symbol 
>>> posix_spawn_file_actions_adddup2 of type: 6
>>> /usr/lib/librsvg-2.so.2: ignoring missing symbol 
>>> posix_spawn_file_actions_init of type: 6
>>> /usr/lib/librsvg-2.so.2: ignoring missing symbol 
>>> posix_spawn_file_actions_destroy of type: 6
>>> /usr/lib/librsvg-2.so.2: ignoring missing symbol 
>>> posix_spawnattr_setflags of type: 6
>>> /usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnattr_init of 
>>> type: 6
>>> /usr/lib/librsvg-2.so.2: ignoring missing symbol posix_spawnp of type: 6
>>> /usr/lib/libgmp.so.10: ignoring missing symbol obstack_vprintf of type: 1
>>> /usr/lib/libasound.so.2: ignoring missing symbol versionsort of type: 6
>>> /usr/lib/libnsl.so.1: ignoring missing symbol _libc_intl_domainname of 
>>> type: 6
>>> /usr/lib/libasyncns.so.0: ignoring missing symbol __res_query of type: 6
>>> /usr/lib/libasyncns.so.0: ignoring missing symbol __res_search of type: 6
>>>
>>> So no it would not crash. Fffmpeg is a pretty complex app (toolbox for 
>>> all kinds of media processing use cases) and depends on many shared 
>>> libraries (or at least the version I compiled and the one provided as a 
>>> package by Fedora or Ubuntu). I guess in all the use cases I have tested 
>>> the particular symbols though missing were not used. I think that applies 
>>> to many libraries - many symbols provided but not used by application 
>>> actually.
>>>
>>>
>>>>
>>>>  
>>>>
>>>>>
>>>>> Signed-off-by: Waldemar Kozaczuk <[email protected]>
>>>>> ---
>>>>>  arch/x64/arch-elf.cc | 20 ++++++++++++++++----
>>>>>  arch/x64/arch-elf.hh |  2 ++
>>>>>  arch/x64/mmu.cc      |  4 ++++
>>>>>  3 files changed, 22 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/arch/x64/arch-elf.cc b/arch/x64/arch-elf.cc
>>>>> index ebe40996..4c513150 100644
>>>>> --- a/arch/x64/arch-elf.cc
>>>>> +++ b/arch/x64/arch-elf.cc
>>>>> @@ -70,16 +70,28 @@ bool object::arch_relocate_rela(u32 type, u32 sym, 
>>>>> void *addr,
>>>>>          memcpy(addr, sm.relocated_addr(), sm.size());
>>>>>          break;
>>>>>      }
>>>>> -    case R_X86_64_64:
>>>>> -        *static_cast<void**>(addr) = symbol(sym).relocated_addr() + 
>>>>> addend;
>>>>> +    case R_X86_64_64: {
>>>>> +        auto _sym = symbol(sym, true);
>>>>> +        if (_sym.symbol) {
>>>>> +            *static_cast<void**>(addr) = _sym.relocated_addr() + 
>>>>> addend;
>>>>> +        } else {
>>>>> +            *static_cast<void**>(addr) = MISSING_SYMBOL_INDICATOR;
>>>>> +        }
>>>>>          break;
>>>>> +    }
>>>>>
>>>> The reason I set the address to MISSING_SYMBOL_INDICATOR  is NOT to 
>>> prevent the app from crashing but to better inform why it crashes if 
>>> missing symbol is used later. Otherwise it would show up as a regular null 
>>> pointer fault.
>>>
>>>>      case R_X86_64_RELATIVE:
>>>>>          *static_cast<void**>(addr) = _base + addend;
>>>>>          break;
>>>>>      case R_X86_64_JUMP_SLOT:
>>>>> -    case R_X86_64_GLOB_DAT:
>>>>> -        *static_cast<void**>(addr) = symbol(sym).relocated_addr();
>>>>> +    case R_X86_64_GLOB_DAT: {
>>>>> +        auto _sym = symbol(sym, true);
>>>>> +        if (_sym.symbol) {
>>>>> +            *static_cast<void**>(addr) = _sym.relocated_addr();
>>>>> +        } else {
>>>>> +            *static_cast<void**>(addr) = MISSING_SYMBOL_INDICATOR;
>>>>> +        }
>>>>>          break;
>>>>> +    }
>>>>>      // The next 3 types are intended to relocate symbols of thread 
>>>>> local variables
>>>>>      // defined with __thread modifier
>>>>>      //
>>>>> diff --git a/arch/x64/arch-elf.hh b/arch/x64/arch-elf.hh
>>>>> index 1811ceb5..b4c63967 100644
>>>>> --- a/arch/x64/arch-elf.hh
>>>>> +++ b/arch/x64/arch-elf.hh
>>>>> @@ -34,6 +34,8 @@ enum {
>>>>>      R_X86_64_IRELATIVE = 37, //  word64 indirect(B + A)
>>>>>  };
>>>>>
>>>>> +void * const MISSING_SYMBOL_INDICATOR = (void*)0xffffeeeeddddcccc;
>>>>>
>>>>
>>>> Can you please remind me why this is an invalid pointer? Does it not 
>>>> have enough f's in the beginning to be a valid pointer?
>>>>
>>> I tried to pick something that would never be a valid pointer. Is there 
>>> a better address I should use? 
>>>
>>
>> Although x86_64 supports 64-bit pointers, they knew that applications 
>> don't really need - yet - to address the full 2^64 bits of memory.
>> So the processor allows you to choose how many bits you *really* want. 
>> The default is 48 bits, allowing you to address 262 terabytes of memory, 
>> and quite enough for today's standards. This uses 4-level page tables. You 
>> can also choose 57 bits, and 5-level page tables (
>> https://en.wikipedia.org/wiki/Intel_5-level_paging) but OSv neither does 
>> this nor want it.
>>
>> When 48-bit addresses are supported, valid addresses are so-called 
>> "canonical" addresses, where all the highest bits cannot be the same. For 
>> 48 bits it means the last legal positive addresses is 0x00007FFFFFFFFFFF, 
>> while the most negative address is 0xFFFF800000000000. If you take an 
>> address which is not canonical in that sense , for example, 
>> 0xF000000000000000, then this address is completely illegal. If you take an 
>> address which is canonical, but simply not in the page table, you get a 
>> regular page fault - this is what happens with the "0" address. But you 
>> need to make sure you never allocate it.
>>
>> You chose 0xffffeeeeddddcccc. This is in fact a legal canonical address. 
>> In it may be possible that our malloc() (see mempool.cc) allocates it. 
>> Whether it does, I don't know.
>> What happens if you use a non-canonical address? I'm not even sure if you 
>> get the normal #PF or something else like #GP. 
>>
> Do you suggest to use a different address then? Or maybe a different 
> mechanism? The bottom line is that if I only ignore the missing symbol in 
> *arch_relocate_rela* and unfortunately it does get used later, the user 
> will get regular "0" address page fault in most cases which would not tell 
> him that missing symbol was used in such case. Thoughts?
>
>>
>>
>>>> +
>>>>>  /* for pltgot relocation */
>>>>>  #define ARCH_JUMP_SLOT R_X86_64_JUMP_SLOT
>>>>>
>>>>> diff --git a/arch/x64/mmu.cc b/arch/x64/mmu.cc
>>>>> index 2f1ba5e2..441b6c45 100644
>>>>> --- a/arch/x64/mmu.cc
>>>>> +++ b/arch/x64/mmu.cc
>>>>> @@ -6,6 +6,7 @@
>>>>>   */
>>>>>
>>>>>  #include "arch-cpu.hh"
>>>>> +#include "arch-elf.hh"
>>>>>  #include <osv/debug.hh>
>>>>>  #include <osv/sched.hh>
>>>>>  #include <osv/mmu.hh>
>>>>> @@ -28,6 +29,9 @@ void page_fault(exception_frame *ef)
>>>>>      if (!pc) {
>>>>>          abort("trying to execute null pointer");
>>>>>      }
>>>>> +    if (pc == MISSING_SYMBOL_INDICATOR) {
>>>>> +        abort("trying to execute missing symbol");
>>>>>
>>>>
>>>> Do you have a test case where you actually see this message?
>>>>
>>> Yes. I was able to manually create tests that would trigger this with 
>>> missing symbol scenario with R_X86_64_GLOB_DAT and R_X86_64_64 relocation 
>>> type. In the latter case, I used the graalvm app that uses mprotect and I 
>>> manually misspelled to make it not found and the message was properly 
>>> handled. I also conducted a similar test for R_X86_64_GLOB_DAT.
>>>
>>
>> I didn't understand how you recreated this. Did you have a test case 
>> where the code really used the function relocated via R_X86_64_GLOB_DAT?
>>
>>  
>>>
>>>> Because I wonder if the invalid pointer actually gets *executed* (so pc 
>>>> = ...) - it is also possible the pointer get followed, not executed. I 
>>>> think "pc" isn't the general indication of where the page fault happened.
>>>>
>>> Not sure I understand what you are saying here. I did happen in the 2 
>>> test scenarios I ran. I see that this particular page fault logic looks at 
>>> RIP value which only apply to function execution, right? Do we have another 
>>> page fault handler when data is actually read/written to where we should 
>>> supply similar logic?
>>>
>>
>> When you get a page fault, there can be two things that happened: One 
>> option is that the instruction tried to access some memory address which 
>> isn't mapped, or mapped for wrong permissions (e.g., the instruction tried 
>> to write into memory mapped read-only). In this case, we have the broken 
>> address in the cr2 register. Another option is that the *instruction* 
>> itself could not be executed - because the current pc (program counter) 
>> points to memory not mapped as executable - or - often - pc is 0 means that 
>> someone tried to *execute* the null pointer.
>>
>> Because this last case happens commonly, we have a special message for it 
>> in our page_fault() handler. But it only happens if someone tries to 
>> *execute* a null pointer. If you try to read or write from a null pointer, 
>> you won't get that message.
>>
>> Similarly, your test for pc will only catch trying to executing this fake 
>> address - not reading or writing for it. To catch the latter you would need 
>> to also check "addr" (i.e., cr2). 
>>
>> In all the examples we discussed so far, these relocations were used for 
>> functions, so presumably if ever used, these addresses will indeed be 
>> *executed*, so your test would be good enough. But if this relocation is 
>> used for a variable, then this address might be read or written, not just 
>> executed. Moreover, if the address is "close" (e.g., same page) as the fake 
>> address, it is probably an indication that someone tried to read or write a 
>> field inside that variable, instead of its first byte.
>>
>>
>>>>      // The following code may sleep. So let's verify the fault did not 
>>>>> happen
>>>>>      // when preemption was disabled, or interrupts were disabled.
>>>>>      assert(sched::preemptable());
>>>>> -- 
>>>>> 2.20.1
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "OSv Development" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/osv-dev/20190820042043.25133-1-jwkozaczuk%40gmail.com
>>>>> .
>>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "OSv Development" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/osv-dev/3a4fe9b0-e2a1-46fc-8982-b8a06fee2be3%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/osv-dev/3a4fe9b0-e2a1-46fc-8982-b8a06fee2be3%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/a99534fb-12a8-49a4-8850-633aff62be12%40googlegroups.com.

Reply via email to