Re: [osv-dev] Some boost based unit tests failing on Fedora 30 when linked with boost from host - 1.69.0

Waldek Kozaczuk Tue, 03 Dec 2019 20:56:37 -0800

After experimenting a lot and trying various things here is the small patch 
that seems to fix this issue:


diff --git a/core/elf.cc b/core/elf.cc
index 9a26a2e7..37e72519 100644
--- a/core/elf.cc
+++ b/core/elf.cc
@@ -1021,8 +1021,10 @@ void 
object::load_needed(std::vector<std::shared_ptr<object>>& loaded_objects)
 void object::unload_needed()
 {
-    _needed.clear();
     _used_by_resolve_plt_got.clear();
+    while (!_needed.empty()) {
+        _needed.pop_back();
+    }
 }
 
 ulong object::get_tls_size()

As you can see in essence this patch slightly changes the order in which 
dependant objects are unloaded. It reverses the order of unloading objects 
in DT_NEEDED list.

Obviously this also changes the order in which FINI functions are called 
(or objects finalized). But I am not sure the wrong order of executing the 
FINI functions was really the issue or order of unloading which affected 
the execution?

I did play to manually force certain order of executing INIT functions to 
make it what it should be per #334 and it did not seem to matter. The 
unloading order seems to be the key.

Also, I even try to reverse the order in DT_NEEDE like so:
 $(boost-tests:%=$(out)/tests/%): LIBS += \
-       -lboost_unit_test_framework \
-       -lboost_filesystem
+       -lboost_filesystem \
+       -lboost_unit_test_framework


and with new patch the all tests would also pass with both boost 1.69 and 
1.67

Please note that with boost 1.69 we have more elaborate chains of 
dependencies with new libraries like boost_chrono and boost_timer so that 
is possibly why we see these crashes only with boost 1.69.

On Tuesday, December 3, 2019 at 12:10:01 AM UTC-5, Waldek Kozaczuk wrote:
>
> So this WIP patch seems to make all tests pass including the boost ones 
> that used to fail:
>
> diff --git a/include/osv/elf.hh b/include/osv/elf.hh
> index 9a0415dc..ea597c08 100644
> --- a/include/osv/elf.hh
> +++ b/include/osv/elf.hh
> @@ -425,6 +425,7 @@ protected:
>      std::unique_ptr<char[]> _section_names_cache;
>      bool _is_executable;
>      bool is_core();
> +    bool _finalized = false;
>  
>      std::unordered_map<std::string,void*> _cached_symbols;
>  
> @@ -434,6 +435,7 @@ protected:
>      std::vector<std::shared_ptr<elf::object>> _needed;
>      std::unordered_set<std::shared_ptr<elf::object>> 
> _used_by_resolve_plt_got;
>      mutex _used_by_resolve_plt_got_mutex;
> +    std::vector<elf::object*> _fini_objects;
>      // Allow objects on program->_modules to be usable for the threads
>      // currently initializing them, but not yet visible for other threads.
>      // This simplifies the code (the initializer can use the regular 
> lookup
> @@ -457,6 +459,7 @@ private:
>      bool visible(void) const;
>  public:
>      void setprivate(bool);
> +    void add_fini_object(elf::object *_obj) { 
> _fini_objects.push_back(_obj); }
>  };
>  
>  class file : public object {
>
> diff --git a/core/elf.cc b/core/elf.cc
> index 909559a2..e4184e58 100644
> --- a/core/elf.cc
> +++ b/core/elf.cc
> @@ -1096,6 +1096,9 @@ void object::run_init_funcs(int argc, char** argv)
>  // Run the object's static destructors or similar finalization
>  void object::run_fini_funcs()
>  {
> +    if (_finalized) {
> +        return;
> +    }
>      if (dynamic_exists(DT_FINI_ARRAY)) {
>          auto funcs = dynamic_ptr<void (*)()>(DT_FINI_ARRAY);
>          auto nr = dynamic_val(DT_FINI_ARRAYSZ) / sizeof(*funcs);
> @@ -1113,6 +1116,11 @@ void object::run_fini_funcs()
>              reinterpret_cast<void(*)()>(func)();
>          }
>      }
> +    for (auto dep : _fini_objects) {
> +        dep->run_fini_funcs(); 
> +    }
> +    _finalized = true;
>  }
>  
>  void* object::tls_addr()
> @@ -1397,8 +1405,14 @@ void program::init_library(int argc, char** argv)
>              loaded_objects[i]->setprivate(true);
>          }
>          for (int i = size - 1; i >= 0; i--) {
>              loaded_objects[i]->run_init_funcs(argc, argv);
>          }
> +        if (size > 0) {
> +            for (unsigned i = 1; i < size; i++) {
> +                
> loaded_objects[0]->add_fini_object(loaded_objects[i].get());
> +            }
> +        }
>          for (unsigned i = 0; i < size; i++) {
>              loaded_objects[i]->setprivate(false);
>          }
>
> The essence of the patch is to capture the order in which objects should 
> be finalized which is the exact opposite of the one in which INIT functions 
> are called. 
>
> The good part is that it fixes the tests and seems to prove that the order 
> of finalization was the culprit of the crashes. The bad part is that I am 
> not convinced it is correct because I think it prematurely executes FINI 
> functions.
>
> Imagine a scenario like this with two chains of dependencies between 
> objects:
> A -> B -> C -> D -> E 
>                F -> D ->E
>
> where A gets removed but F stays in use longer. In which case we call FINI 
> for A, B, C, D, E even though D and E is still in use by F.
>
> I am not sure how to implement it yet, but in general, we should remember 
> the FINI order for all the objects, not just the top ones as my patch does. 
> And then somehow detect if dependant objects are about to be removed per 
> unload_needed() and only call FINI functions for those in the remembered 
> order. 
>
> So in the scenario above in run_fini_funcs() on behalf of A, call 
> run_fini_funcs() for B and C only in that order and stop here. Then only 
> when F is removed and its run_fini_funcs() called, call run_fini_funcs 
> for D and E in that order. The key is how would know ahead of   
> run_fini_funcs 
> that for example B and C are to removed but D and E not yet.
>
> How to accomplish that I am not sure.
>
> On Sunday, December 1, 2019 at 11:43:11 PM UTC-5, Waldek Kozaczuk wrote:
>>
>> The other test tst-rename.so passes most of the time but sometimes fails 
>> with similar stack trace. 
>>
>> I wonder if this code is deterministic:
>> void object::unload_needed()
>> {
>>     _needed.clear();
>>     _used_by_resolve_plt_got.clear();
>> }
>>
>> The _needed is a vector but _used_by_resolve_plt_got is an unordered set.
>>
>> On Sunday, December 1, 2019 at 5:30:55 PM UTC-5, Waldek Kozaczuk wrote:
>>>
>>> The problem is that order in which objects are unloaded (calling 
>>> program::remove_object()) is completely non-deterministic and based on 
>>> where the last reference through shared_ptr is removed and remove_object() 
>>> called.
>>>
>>> We could save the order in which init functions are called by pushing 
>>> objects removed in init_library() on another stack to capture the reversed 
>>> object. But when would we act on it?
>>>
>>> On Sunday, December 1, 2019 at 5:21:13 PM UTC-5, Waldek Kozaczuk wrote:
>>>>
>>>> Here is a little bit better-improved stacktrace after I used osv 
>>>> load_elf t manually load unit_test object:
>>>>
>>>> osv load-elf /lib64/libboost_unit_test_framework.so.1.69.0 
>>>> 0x0000000000088000
>>>>
>>>> (gdb) bt
>>>> #0  0x00000000403a3cf2 in processor::cli_hlt () at 
>>>> arch/x64/processor.hh:247
>>>> #1  arch::halt_no_interrupts () at arch/x64/arch.hh:48
>>>> #2  osv::halt () at arch/x64/power.cc:26
>>>> #3  0x0000000040239dae in abort (fmt=fmt@entry=0x4064155f "Aborted\n") 
>>>> at runtime.cc:132
>>>> #4  0x00000000402028ab in abort () at runtime.cc:98
>>>> #5  0x000000004021981e in osv::generate_signal (siginfo=..., 
>>>> ef=0xffff80000123c068) at libc/signal.cc:124
>>>> #6  0x0000000040463f7f in osv::handle_mmap_fault (addr=<optimized out>, 
>>>> sig=<optimized out>, ef=<optimized out>) at libc/signal.cc:139
>>>> #7  0x000000004033da62 in mmu::vm_fault (addr=17592187015168, 
>>>> addr@entry=17592187015648, ef=ef@entry=0xffff80000123c068) at 
>>>> core/mmu.cc:1337
>>>> #8  0x000000004039dc30 in page_fault (ef=0xffff80000123c068) at 
>>>> arch/x64/mmu.cc:42
>>>> #9  <signal handler called>
>>>> #10 0x00001000000ed1e0 in std::__cxx11::basic_string<char, 
>>>> std::char_traits<char>, std::allocator<char> >::~basic_string() ()
>>>> #11 0x000000004023a357 in __cxxabiv1::__cxa_finalize (dso=<optimized 
>>>> out>) at runtime.cc:183
>>>> #12 0x000010000041f5f7 in __do_global_dtors_aux ()
>>>> #13 0x00002000001008f0 in ?? ()
>>>> #14 0x0000000040352f34 in elf::object::run_fini_funcs 
>>>> (this=0xffffa000015d0a00) at core/elf.cc:1105
>>>> Backtrace stopped: frame did not save the PC
>>>>
>>>> It looks like it crashes when trying to call the destructor of a string 
>>>> object.
>>>>
>>>> Also here is what ELF spec (
>>>> https://refspecs.linuxfoundation.org/elf/elf.pdf) says about calling 
>>>> termination functions (FINI*) in the "Initialization and Termination 
>>>> Functions" section:
>>>> "Similarly, shared objects may have termination functions, which are 
>>>> executed with the atexit(BA_OS) mechanism after the base process begins 
>>>> its 
>>>> termination sequence. The *order in which the dynamic linker calls 
>>>> termination functions is the exact reverse order of their corresponding 
>>>> initialization functions*. If a shared object has a termination 
>>>> function, but no initialization function, the termination function will 
>>>> execute in the order it would have as if the shared object's 
>>>> initialization 
>>>> function was present. The dynamic linker ensures that it will not execute 
>>>> any initialization or termination functions more than once."
>>>>
>>>> Clearly our linker does not behave like this. Obviously I do not have 
>>>> 100% proof that the incorrect order in which dynamic linker executes 
>>>> termination functions leads to the crash we are seeing. But based on the 
>>>> printouts from another test that succeeds it seems likely to be a culprit.
>>>>
>>>> On Sunday, December 1, 2019 at 7:36:20 AM UTC-5, Nadav Har'El wrote:
>>>>>
>>>>>
>>>>> On Sat, Nov 30, 2019 at 5:32 AM Waldek Kozaczuk <[email protected]> 
>>>>> wrote:
>>>>>
>>>>>> I think I have solved the mystery. It looks like another bug in the 
>>>>>> dynamic linker. This time related to an order in which objects are 
>>>>>> unloaded 
>>>>>> and FINI* functions are executed.
>>>>>>
>>>>>
>>>>> I'm afraid I didn't quite follow the details below and what's wrong 
>>>>> with the "wrong" order or what caused it.
>>>>>
>>>>> I wonder if it's related to to issue 
>>>>> https://github.com/cloudius-systems/osv/issues/334 but I don't know 
>>>>> how.
>>>>>
>>>>>  
>>>>>
>>>>>>
>>>>>> As you can tell in the last test boost_filesystem, FINI function got 
>>>>>> executed after the libboost_unit_test_framework FINI function got 
>>>>>> called and unloaded. If you look at the first one, the order is opposite 
>>>>>> and that why it works.
>>>>>>
>>>>>
>>>>> Why is this order wrong?
>>>>>
>>>>> If I understand correctly, our test loads (DT_NEED) 
>>>>> libboost_unit_test_framework 
>>>>> and that loads boost_filesystem
>>>>> So you'd expect the program::remove_object() code to first finalize 
>>>>> unit_test_framework, and only then unload its dependency 
>>>>> (boost_filesystem).
>>>>> So it sounds the order you call wrong, is right, no? If it's not right 
>>>>> - then at least it was as intended ;-) And if this intent is wrong, it 
>>>>> should be fixed (in remove_object())
>>>>>
>>>>> But there is a complication: in modules/tests/Makefile I see our test 
>>>>> depends on boost_filesystem *directly*.
>>>>> So now the order that boost_filesystem gets unloaded may become random 
>>>>> (because we analyze the entire DAG for dependencies....).
>>>>> This may explain why in some cases we got the "opposite" order and 
>>>>> everything worked.
>>>>>  
>>>>>>
>>>>>>
>>>>>> I think we need to change unloading logic to first execute all FINI 
>>>>>> functions of all objects probably on the same thread that the INIT ones 
>>>>>> got 
>>>>>> executed and only then unload the objects.
>>>>>>
>>>>>
>>>>> This is an interesting point. Indeed the loading code first loads all 
>>>>> the necessary objects in multiple levels, and only then runs all the init 
>>>>> functions (in some hopefully correct object), the unloading code doesn't 
>>>>> do 
>>>>> this. But I don't think there's an easy way to do this, if at all.... 
>>>>> It's 
>>>>> not always true that objects get unloaded in one large batch. It can also 
>>>>> happen like I mentioned above in our own example - boost_test_framework 
>>>>> says it wants to unload boost_filesystem, BUT, it is not the last holder 
>>>>> of 
>>>>> boost_filesystem because the application also dt_needed boost_filesystem 
>>>>> directly. So we can't unload boost_filesystem just yet... And need to 
>>>>> delay 
>>>>> its unload.
>>>>>
>>>>>  
>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> On Thursday, November 28, 2019 at 3:19:57 PM UTC-5, Nadav Har'El 
>>>>>> wrote:
>>>>>>>
>>>>>>> On Thu, Nov 28, 2019 at 10:04 PM Waldek Kozaczuk <[email protected]> 
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>  
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Please note I have removed this line from the tests Makefile:
>>>>>>>> -$(boost-tests:%=$(out)/tests/%): CXXFLAGS += 
>>>>>>>> -D_GLIBCXX_USE_CXX11_ABI=0 \
>>>>>>>> I do not fully understand the significance of it nor the reasons 
>>>>>>>> for it in the first place (why was it there). Could it cause this 
>>>>>>>> issue?
>>>>>>>>
>>>>>>>
>>>>>>> This line was needed because we try to combine newly combined test 
>>>>>>> code and some super-antique version of Boost from external/.
>>>>>>> It is explained in commit 6a3bff38a281e65ee715bab4fadef63e0918f7d3. 
>>>>>>> You shouldn't need it if you stop using the antique boost.
>>>>>>>  
>>>>>>>
>>>>>>>> BTW I also now noticed this linker warning (not sure if related):
>>>>>>>>   LIBOSV.SO
>>>>>>>>   STRIP loader.elf -> loader-stripped.elf 
>>>>>>>> strip: 
>>>>>>>> build/release.x64/loader.elf[.gnu.build.attributes.text._ZN5boost6system23dummy_exported_functionEv]:
>>>>>>>>  
>>>>>>>> Warning: version note missing - assuming version 3
>>>>>>>> strip:build/release.x64/loader-stripped.elf[.gnu.build.attributes.unlikely]:
>>>>>>>>  
>>>>>>>> error: failed to copy merged notes into output: file in wrong format
>>>>>>>>   LZ loader-stripped.elf
>>>>>>>>
>>>>>>>>
>>>>>>> I noticed this too a few days ago, but didn't spend any time to 
>>>>>>> chase it.
>>>>>>>  
>>>>>>>
>>>>>>>>  
>>>>>>>> Is the backtrace broken? The important line is 
>>>>>>>> #10 0x00001000000ed1e0 in ?? ()
>>>>>>>>
>>>>>>>> Which is inside the test executable. Did you try "osv syms"?
>>>>>>>>
>>>>>>>> I did. There were couple of libraries that had missing debug info 
>>>>>>>> missing. Installed still did not help and backtrace looks like this - 
>>>>>>>> broken: 
>>>>>>>>
>>>>>>>
>>>>>>> I'm still guessing the address 0x00001000000ed1e0 is in the actual 
>>>>>>> test itself, not any of the libraries.
>>>>>>> It is possible that "osv syms" gets confused because this bug 
>>>>>>> happens while the test executable is being unloaded, so maybe it's no 
>>>>>>> longer in the list. You can comment out the code that removes the 
>>>>>>> library 
>>>>>>> from the list that "osv syms" uses, and then maybe gdb will be able to 
>>>>>>> know 
>>>>>>> about it.
>>>>>>>
>>>>>>>  
>>>>>>>
>>>>>>>> It seems there is a C++ destructor in the test code (or Boost 
>>>>>>>>> framework) being run, which causes a crash. But I don't know what it 
>>>>>>>>> is...
>>>>>>>>>
>>>>>>>>> Not sure it is related but our usr.manifest.skel pulls ancient 
>>>>>>>> libgc_s.so from externals.
>>>>>>>>
>>>>>>>
>>>>>>> The reason why we have this file is explained in commit 
>>>>>>> be565320c082c00069614c850d29b42831b3dea6
>>>>>>>  
>>>>>>>
>>>>>>>> Changing it to pull from host creates other issues which I will 
>>>>>>>> send email about.  
>>>>>>>>
>>>>>>>
>>>>>>> :-(
>>>>>>>
>>>>>> -- 
>>>>>> You received this message because you are subscribed to the Google 
>>>>>> Groups "OSv Development" group.
>>>>>> To unsubscribe from this group and stop receiving emails from it, 
>>>>>> send an email to [email protected].
>>>>>> To view this discussion on the web visit 
>>>>>> https://groups.google.com/d/msgid/osv-dev/861da7d5-7a27-4a0a-bb03-2d417c425e4b%40googlegroups.com
>>>>>>  
>>>>>> <https://groups.google.com/d/msgid/osv-dev/861da7d5-7a27-4a0a-bb03-2d417c425e4b%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>> .
>>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/osv-dev/59cfa598-8ac6-44da-aba4-d194ca7dfbe7%40googlegroups.com.

Re: [osv-dev] Some boost based unit tests failing on Fedora 30 when linked with boost from host - 1.69.0

Reply via email to