The problem is that order in which objects are unloaded (calling program::remove_object()) is completely non-deterministic and based on where the last reference through shared_ptr is removed and remove_object() called.
We could save the order in which init functions are called by pushing objects removed in init_library() on another stack to capture the reversed object. But when would we act on it? On Sunday, December 1, 2019 at 5:21:13 PM UTC-5, Waldek Kozaczuk wrote: > > Here is a little bit better-improved stacktrace after I used osv load_elf > t manually load unit_test object: > > osv load-elf /lib64/libboost_unit_test_framework.so.1.69.0 > 0x0000000000088000 > > (gdb) bt > #0 0x00000000403a3cf2 in processor::cli_hlt () at > arch/x64/processor.hh:247 > #1 arch::halt_no_interrupts () at arch/x64/arch.hh:48 > #2 osv::halt () at arch/x64/power.cc:26 > #3 0x0000000040239dae in abort (fmt=fmt@entry=0x4064155f "Aborted\n") at > runtime.cc:132 > #4 0x00000000402028ab in abort () at runtime.cc:98 > #5 0x000000004021981e in osv::generate_signal (siginfo=..., > ef=0xffff80000123c068) at libc/signal.cc:124 > #6 0x0000000040463f7f in osv::handle_mmap_fault (addr=<optimized out>, > sig=<optimized out>, ef=<optimized out>) at libc/signal.cc:139 > #7 0x000000004033da62 in mmu::vm_fault (addr=17592187015168, > addr@entry=17592187015648, ef=ef@entry=0xffff80000123c068) at > core/mmu.cc:1337 > #8 0x000000004039dc30 in page_fault (ef=0xffff80000123c068) at > arch/x64/mmu.cc:42 > #9 <signal handler called> > #10 0x00001000000ed1e0 in std::__cxx11::basic_string<char, > std::char_traits<char>, std::allocator<char> >::~basic_string() () > #11 0x000000004023a357 in __cxxabiv1::__cxa_finalize (dso=<optimized out>) > at runtime.cc:183 > #12 0x000010000041f5f7 in __do_global_dtors_aux () > #13 0x00002000001008f0 in ?? () > #14 0x0000000040352f34 in elf::object::run_fini_funcs > (this=0xffffa000015d0a00) at core/elf.cc:1105 > Backtrace stopped: frame did not save the PC > > It looks like it crashes when trying to call the destructor of a string > object. > > Also here is what ELF spec ( > https://refspecs.linuxfoundation.org/elf/elf.pdf) says about calling > termination functions (FINI*) in the "Initialization and Termination > Functions" section: > "Similarly, shared objects may have termination functions, which are > executed with the atexit(BA_OS) mechanism after the base process begins its > termination sequence. The *order in which the dynamic linker calls > termination functions is the exact reverse order of their corresponding > initialization functions*. If a shared object has a termination function, > but no initialization function, the termination function will execute in > the order it would have as if the shared object's initialization function > was present. The dynamic linker ensures that it will not execute any > initialization or termination functions more than once." > > Clearly our linker does not behave like this. Obviously I do not have 100% > proof that the incorrect order in which dynamic linker executes termination > functions leads to the crash we are seeing. But based on the printouts from > another test that succeeds it seems likely to be a culprit. > > On Sunday, December 1, 2019 at 7:36:20 AM UTC-5, Nadav Har'El wrote: >> >> >> On Sat, Nov 30, 2019 at 5:32 AM Waldek Kozaczuk <[email protected]> >> wrote: >> >>> I think I have solved the mystery. It looks like another bug in the >>> dynamic linker. This time related to an order in which objects are unloaded >>> and FINI* functions are executed. >>> >> >> I'm afraid I didn't quite follow the details below and what's wrong with >> the "wrong" order or what caused it. >> >> I wonder if it's related to to issue >> https://github.com/cloudius-systems/osv/issues/334 but I don't know how. >> >> >> >>> >>> As you can tell in the last test boost_filesystem, FINI function got >>> executed after the libboost_unit_test_framework FINI function got >>> called and unloaded. If you look at the first one, the order is opposite >>> and that why it works. >>> >> >> Why is this order wrong? >> >> If I understand correctly, our test loads (DT_NEED) >> libboost_unit_test_framework >> and that loads boost_filesystem >> So you'd expect the program::remove_object() code to first finalize >> unit_test_framework, and only then unload its dependency (boost_filesystem). >> So it sounds the order you call wrong, is right, no? If it's not right - >> then at least it was as intended ;-) And if this intent is wrong, it should >> be fixed (in remove_object()) >> >> But there is a complication: in modules/tests/Makefile I see our test >> depends on boost_filesystem *directly*. >> So now the order that boost_filesystem gets unloaded may become random >> (because we analyze the entire DAG for dependencies....). >> This may explain why in some cases we got the "opposite" order and >> everything worked. >> >>> >>> >>> I think we need to change unloading logic to first execute all FINI >>> functions of all objects probably on the same thread that the INIT ones got >>> executed and only then unload the objects. >>> >> >> This is an interesting point. Indeed the loading code first loads all the >> necessary objects in multiple levels, and only then runs all the init >> functions (in some hopefully correct object), the unloading code doesn't do >> this. But I don't think there's an easy way to do this, if at all.... It's >> not always true that objects get unloaded in one large batch. It can also >> happen like I mentioned above in our own example - boost_test_framework >> says it wants to unload boost_filesystem, BUT, it is not the last holder of >> boost_filesystem because the application also dt_needed boost_filesystem >> directly. So we can't unload boost_filesystem just yet... And need to delay >> its unload. >> >> >> >>> What do you think? >>> >>> On Thursday, November 28, 2019 at 3:19:57 PM UTC-5, Nadav Har'El wrote: >>>> >>>> On Thu, Nov 28, 2019 at 10:04 PM Waldek Kozaczuk <[email protected]> >>>> wrote: >>>> >>>>> >>>>> >>>>>>> >>>>>>> >>>>>>> Please note I have removed this line from the tests Makefile: >>>>> -$(boost-tests:%=$(out)/tests/%): CXXFLAGS += >>>>> -D_GLIBCXX_USE_CXX11_ABI=0 \ >>>>> I do not fully understand the significance of it nor the reasons for >>>>> it in the first place (why was it there). Could it cause this issue? >>>>> >>>> >>>> This line was needed because we try to combine newly combined test code >>>> and some super-antique version of Boost from external/. >>>> It is explained in commit 6a3bff38a281e65ee715bab4fadef63e0918f7d3. You >>>> shouldn't need it if you stop using the antique boost. >>>> >>>> >>>>> BTW I also now noticed this linker warning (not sure if related): >>>>> LIBOSV.SO >>>>> STRIP loader.elf -> loader-stripped.elf >>>>> strip: >>>>> build/release.x64/loader.elf[.gnu.build.attributes.text._ZN5boost6system23dummy_exported_functionEv]: >>>>> >>>>> Warning: version note missing - assuming version 3 >>>>> strip:build/release.x64/loader-stripped.elf[.gnu.build.attributes.unlikely]: >>>>> >>>>> error: failed to copy merged notes into output: file in wrong format >>>>> LZ loader-stripped.elf >>>>> >>>>> >>>> I noticed this too a few days ago, but didn't spend any time to chase >>>> it. >>>> >>>> >>>>> >>>>> Is the backtrace broken? The important line is >>>>> #10 0x00001000000ed1e0 in ?? () >>>>> >>>>> Which is inside the test executable. Did you try "osv syms"? >>>>> >>>>> I did. There were couple of libraries that had missing debug info >>>>> missing. Installed still did not help and backtrace looks like this - >>>>> broken: >>>>> >>>> >>>> I'm still guessing the address 0x00001000000ed1e0 is in the actual test >>>> itself, not any of the libraries. >>>> It is possible that "osv syms" gets confused because this bug happens >>>> while the test executable is being unloaded, so maybe it's no longer in >>>> the >>>> list. You can comment out the code that removes the library from the list >>>> that "osv syms" uses, and then maybe gdb will be able to know about it. >>>> >>>> >>>> >>>>> It seems there is a C++ destructor in the test code (or Boost >>>>>> framework) being run, which causes a crash. But I don't know what it >>>>>> is... >>>>>> >>>>>> Not sure it is related but our usr.manifest.skel pulls ancient >>>>> libgc_s.so from externals. >>>>> >>>> >>>> The reason why we have this file is explained in commit >>>> be565320c082c00069614c850d29b42831b3dea6 >>>> >>>> >>>>> Changing it to pull from host creates other issues which I will send >>>>> email about. >>>>> >>>> >>>> :-( >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "OSv Development" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/osv-dev/861da7d5-7a27-4a0a-bb03-2d417c425e4b%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/osv-dev/861da7d5-7a27-4a0a-bb03-2d417c425e4b%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- You received this message because you are subscribed to the Google Groups "OSv Development" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/osv-dev/2d079b97-aa7b-487b-b72c-15cdddb185c8%40googlegroups.com.
