The other test tst-rename.so passes most of the time but sometimes fails 
with similar stack trace. 

I wonder if this code is deterministic:
void object::unload_needed()
{
    _needed.clear();
    _used_by_resolve_plt_got.clear();
}

The _needed is a vector but _used_by_resolve_plt_got is an unordered set.

On Sunday, December 1, 2019 at 5:30:55 PM UTC-5, Waldek Kozaczuk wrote:
>
> The problem is that order in which objects are unloaded (calling 
> program::remove_object()) is completely non-deterministic and based on 
> where the last reference through shared_ptr is removed and remove_object() 
> called.
>
> We could save the order in which init functions are called by pushing 
> objects removed in init_library() on another stack to capture the reversed 
> object. But when would we act on it?
>
> On Sunday, December 1, 2019 at 5:21:13 PM UTC-5, Waldek Kozaczuk wrote:
>>
>> Here is a little bit better-improved stacktrace after I used osv load_elf 
>> t manually load unit_test object:
>>
>> osv load-elf /lib64/libboost_unit_test_framework.so.1.69.0 
>> 0x0000000000088000
>>
>> (gdb) bt
>> #0  0x00000000403a3cf2 in processor::cli_hlt () at 
>> arch/x64/processor.hh:247
>> #1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
>> #2  osv::halt () at arch/x64/power.cc:26
>> #3  0x0000000040239dae in abort (fmt=fmt@entry=0x4064155f "Aborted\n") at 
>> runtime.cc:132
>> #4  0x00000000402028ab in abort () at runtime.cc:98
>> #5  0x000000004021981e in osv::generate_signal (siginfo=..., 
>> ef=0xffff80000123c068) at libc/signal.cc:124
>> #6  0x0000000040463f7f in osv::handle_mmap_fault (addr=<optimized out>, 
>> sig=<optimized out>, ef=<optimized out>) at libc/signal.cc:139
>> #7  0x000000004033da62 in mmu::vm_fault (addr=17592187015168, 
>> addr@entry=17592187015648, ef=ef@entry=0xffff80000123c068) at 
>> core/mmu.cc:1337
>> #8  0x000000004039dc30 in page_fault (ef=0xffff80000123c068) at 
>> arch/x64/mmu.cc:42
>> #9  <signal handler called>
>> #10 0x00001000000ed1e0 in std::__cxx11::basic_string<char, 
>> std::char_traits<char>, std::allocator<char> >::~basic_string() ()
>> #11 0x000000004023a357 in __cxxabiv1::__cxa_finalize (dso=<optimized 
>> out>) at runtime.cc:183
>> #12 0x000010000041f5f7 in __do_global_dtors_aux ()
>> #13 0x00002000001008f0 in ?? ()
>> #14 0x0000000040352f34 in elf::object::run_fini_funcs 
>> (this=0xffffa000015d0a00) at core/elf.cc:1105
>> Backtrace stopped: frame did not save the PC
>>
>> It looks like it crashes when trying to call the destructor of a string 
>> object.
>>
>> Also here is what ELF spec (
>> https://refspecs.linuxfoundation.org/elf/elf.pdf) says about calling 
>> termination functions (FINI*) in the "Initialization and Termination 
>> Functions" section:
>> "Similarly, shared objects may have termination functions, which are 
>> executed with the atexit(BA_OS) mechanism after the base process begins its 
>> termination sequence. The *order in which the dynamic linker calls 
>> termination functions is the exact reverse order of their corresponding 
>> initialization functions*. If a shared object has a termination 
>> function, but no initialization function, the termination function will 
>> execute in the order it would have as if the shared object's initialization 
>> function was present. The dynamic linker ensures that it will not execute 
>> any initialization or termination functions more than once."
>>
>> Clearly our linker does not behave like this. Obviously I do not have 
>> 100% proof that the incorrect order in which dynamic linker executes 
>> termination functions leads to the crash we are seeing. But based on the 
>> printouts from another test that succeeds it seems likely to be a culprit.
>>
>> On Sunday, December 1, 2019 at 7:36:20 AM UTC-5, Nadav Har'El wrote:
>>>
>>>
>>> On Sat, Nov 30, 2019 at 5:32 AM Waldek Kozaczuk <[email protected]> 
>>> wrote:
>>>
>>>> I think I have solved the mystery. It looks like another bug in the 
>>>> dynamic linker. This time related to an order in which objects are 
>>>> unloaded 
>>>> and FINI* functions are executed.
>>>>
>>>
>>> I'm afraid I didn't quite follow the details below and what's wrong with 
>>> the "wrong" order or what caused it.
>>>
>>> I wonder if it's related to to issue 
>>> https://github.com/cloudius-systems/osv/issues/334 but I don't know how.
>>>
>>>  
>>>
>>>>
>>>> As you can tell in the last test boost_filesystem, FINI function got 
>>>> executed after the libboost_unit_test_framework FINI function got 
>>>> called and unloaded. If you look at the first one, the order is opposite 
>>>> and that why it works.
>>>>
>>>
>>> Why is this order wrong?
>>>
>>> If I understand correctly, our test loads (DT_NEED) 
>>> libboost_unit_test_framework 
>>> and that loads boost_filesystem
>>> So you'd expect the program::remove_object() code to first finalize 
>>> unit_test_framework, and only then unload its dependency (boost_filesystem).
>>> So it sounds the order you call wrong, is right, no? If it's not right - 
>>> then at least it was as intended ;-) And if this intent is wrong, it should 
>>> be fixed (in remove_object())
>>>
>>> But there is a complication: in modules/tests/Makefile I see our test 
>>> depends on boost_filesystem *directly*.
>>> So now the order that boost_filesystem gets unloaded may become random 
>>> (because we analyze the entire DAG for dependencies....).
>>> This may explain why in some cases we got the "opposite" order and 
>>> everything worked.
>>>  
>>>>
>>>>
>>>> I think we need to change unloading logic to first execute all FINI 
>>>> functions of all objects probably on the same thread that the INIT ones 
>>>> got 
>>>> executed and only then unload the objects.
>>>>
>>>
>>> This is an interesting point. Indeed the loading code first loads all 
>>> the necessary objects in multiple levels, and only then runs all the init 
>>> functions (in some hopefully correct object), the unloading code doesn't do 
>>> this. But I don't think there's an easy way to do this, if at all.... It's 
>>> not always true that objects get unloaded in one large batch. It can also 
>>> happen like I mentioned above in our own example - boost_test_framework 
>>> says it wants to unload boost_filesystem, BUT, it is not the last holder of 
>>> boost_filesystem because the application also dt_needed boost_filesystem 
>>> directly. So we can't unload boost_filesystem just yet... And need to delay 
>>> its unload.
>>>
>>>  
>>>
>>>> What do you think?
>>>>
>>>> On Thursday, November 28, 2019 at 3:19:57 PM UTC-5, Nadav Har'El wrote:
>>>>>
>>>>> On Thu, Nov 28, 2019 at 10:04 PM Waldek Kozaczuk <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>  
>>>>>>>>
>>>>>>>>
>>>>>>>> Please note I have removed this line from the tests Makefile:
>>>>>> -$(boost-tests:%=$(out)/tests/%): CXXFLAGS += 
>>>>>> -D_GLIBCXX_USE_CXX11_ABI=0 \
>>>>>> I do not fully understand the significance of it nor the reasons for 
>>>>>> it in the first place (why was it there). Could it cause this issue?
>>>>>>
>>>>>
>>>>> This line was needed because we try to combine newly combined test 
>>>>> code and some super-antique version of Boost from external/.
>>>>> It is explained in commit 6a3bff38a281e65ee715bab4fadef63e0918f7d3. 
>>>>> You shouldn't need it if you stop using the antique boost.
>>>>>  
>>>>>
>>>>>> BTW I also now noticed this linker warning (not sure if related):
>>>>>>   LIBOSV.SO
>>>>>>   STRIP loader.elf -> loader-stripped.elf 
>>>>>> strip: 
>>>>>> build/release.x64/loader.elf[.gnu.build.attributes.text._ZN5boost6system23dummy_exported_functionEv]:
>>>>>>  
>>>>>> Warning: version note missing - assuming version 3
>>>>>> strip:build/release.x64/loader-stripped.elf[.gnu.build.attributes.unlikely]:
>>>>>>  
>>>>>> error: failed to copy merged notes into output: file in wrong format
>>>>>>   LZ loader-stripped.elf
>>>>>>
>>>>>>
>>>>> I noticed this too a few days ago, but didn't spend any time to chase 
>>>>> it.
>>>>>  
>>>>>
>>>>>>  
>>>>>> Is the backtrace broken? The important line is 
>>>>>> #10 0x00001000000ed1e0 in ?? ()
>>>>>>
>>>>>> Which is inside the test executable. Did you try "osv syms"?
>>>>>>
>>>>>> I did. There were couple of libraries that had missing debug info 
>>>>>> missing. Installed still did not help and backtrace looks like this - 
>>>>>> broken: 
>>>>>>
>>>>>
>>>>> I'm still guessing the address 0x00001000000ed1e0 is in the actual 
>>>>> test itself, not any of the libraries.
>>>>> It is possible that "osv syms" gets confused because this bug happens 
>>>>> while the test executable is being unloaded, so maybe it's no longer in 
>>>>> the 
>>>>> list. You can comment out the code that removes the library from the list 
>>>>> that "osv syms" uses, and then maybe gdb will be able to know about it.
>>>>>
>>>>>  
>>>>>
>>>>>> It seems there is a C++ destructor in the test code (or Boost 
>>>>>>> framework) being run, which causes a crash. But I don't know what it 
>>>>>>> is...
>>>>>>>
>>>>>>> Not sure it is related but our usr.manifest.skel pulls ancient 
>>>>>> libgc_s.so from externals.
>>>>>>
>>>>>
>>>>> The reason why we have this file is explained in commit 
>>>>> be565320c082c00069614c850d29b42831b3dea6
>>>>>  
>>>>>
>>>>>> Changing it to pull from host creates other issues which I will send 
>>>>>> email about.  
>>>>>>
>>>>>
>>>>> :-(
>>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "OSv Development" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/osv-dev/861da7d5-7a27-4a0a-bb03-2d417c425e4b%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/osv-dev/861da7d5-7a27-4a0a-bb03-2d417c425e4b%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/09b2f258-9882-452a-8c60-ae25aab3c73d%40googlegroups.com.

Reply via email to