On Wed, 4 Aug 2021 at 18:19, Maged Michael wrote:
>
> Sorry. I totally missed the rest of your message and the patch. My fuzzy 
> eyesight, which usually guesses correctly 90% of the time, mistook "Secondly" 
> on a line by itself for "Sincerely" :-)

:-)

> The noinlining was based on looking at generated code. That was for clang. It 
> was inlining the _M_last_use function for every instance of _M_release (e.g., 
> ~shared_ptr). This optimization with the noinline for _M_release_last_use 
> ended up reducing massive binary text sizes by 1.5% (several megabytes)., 
> which was an extra benefit. Without the noinline we saw code size increase.

Wow, that is a convincing argument for making it not inline, thanks.

> IIUC, we van use the following. Right?
>
> __attribute__((__noinline__))

Right.

> I didn't understand the part about programmers doing #define noinline 1. I 
> don't see code in the patch that uses noinline.

This is a valid C++ program:

#define noinline 1
#include <memory>
int main() { }

But if anything in <memory> uses "noinline" then this valid program
will not compile. Which is why we must use ((__noinline__)) instead of
((noinline)).



>
> How about something like this comment?
>
> // Noinline to avoid code size increase.

Great, thanks.

On Wed, 4 Aug 2021 at 18:34, Maged Michael wrote:
> Actually I take back what I said. Sorry. I think the logic in your patch is 
> correct. I missed some of the atomic decrements.
> But I'd be concerned about performance. If we make _M_release_last_use 
> noinline then we are adding overhead to the fast path of the original logic 
> (where both counts are 1).

Oh, I see. So the code duplication serves a purpose. We want the
_M_release_last_use() code to be non-inline for the new logic, because
in the common case we never call it (we either detect that both counts
are 1 and do the dispose & destroy without atomic decrements, or we do
a single decrement and don't dispose or destroy anything). But for the
old logic, we do want that code inlined into _M_release (or
_M_release_orig as it was called in your patch). Have I got that right
now?

What if we remove the __noinline__ from _M_release_last_use() so that
it can be inlined, but than add a noinline wrapper that can be called
when we don't want to inline it?

So:
      // Called by _M_release() when the use count reaches zero.
      void
      _M_release_last_use() noexcept
      {
        // unchanged from previous patch, but without the attribute.
        // ...
      }

      // As above, but 'noinline' to reduce code size on the cold path.
      __attribute__((__noinline__))
      void
      _M_release_last_use_cold() noexcept
      { _M_release_last_use(); }


And then:

  template<>
    inline void
    _Sp_counted_base<_S_atomic>::_M_release() noexcept
    {
      _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_use_count);
#if ! _GLIBCXX_TSAN
      constexpr bool __lock_free
        = __atomic_always_lock_free(sizeof(long long), 0)
        && __atomic_always_lock_free(sizeof(_Atomic_word), 0);
      constexpr bool __double_word
        = sizeof(long long) == 2 * sizeof(_Atomic_word);
      // The ref-count members follow the vptr, so are aligned to
      // alignof(void*).
      constexpr bool __aligned = __alignof(long long) <= alignof(void*);
      if _GLIBCXX17_CONSTEXPR (__lock_free && __double_word && __aligned)
        {
          constexpr long long __unique_ref
            = 1LL + (1LL << (__CHAR_BIT__ * sizeof(_Atomic_word)));
          auto __both_counts = reinterpret_cast<long long*>(&_M_use_count);

          _GLIBCXX_SYNCHRONIZATION_HAPPENS_BEFORE(&_M_weak_count);
          if (__atomic_load_n(__both_counts, __ATOMIC_ACQUIRE) == __unique_ref)
            {
              // Both counts are 1, so there are no weak references and
              // we are releasing the last strong reference. No other
              // threads can observe the effects of this _M_release()
              // call (e.g. calling use_count()) without a data race.
              *(long long*)(&_M_use_count) = 0;
              _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_use_count);
              _GLIBCXX_SYNCHRONIZATION_HAPPENS_AFTER(&_M_weak_count);
              _M_dispose();
              _M_destroy();
              return;
            }
          if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
            {
              _M_release_last_use_cold();
              return;
            }
        }
      else
#endif
      if (__gnu_cxx::__exchange_and_add_dispatch(&_M_use_count, -1) == 1)
        {
          _M_release_last_use();
        }
    }


So we use the noinline version for the else branch in the new logic,
but the can-inline version for the old logic. Would that work?

We could also consider adding __attribute__((__cold__)) to the
_M_release_last_use_cold() function, and/or add [[__unlikely__]] to
the 'if' that calls it, but maybe that's overkill.

It seems like this will slightly pessimize the case where the last use
count is dropped but there are weak refs, as that will take the cold
path now. That's probably OK, given the benefits for the "both counts
are 1" case, and that it might reduce the code size for the "use count
> 1" case.

It's past 8pm here so I'll come back to this tomorrow and do some code
size measurements of my own, now that I have a clearer understanding
of what your original patch was optimizing for.

Reply via email to