Re: [PATCH 1/2] libstdc++: Optimize std::counting_semaphore for futex path

Jonathan Wakely Thu, 05 Jun 2025 09:44:31 -0700

On Thu, 5 Jun 2025 at 15:42, Tomasz Kaminski <tkami...@redhat.com> wrote:
>
>
>
> On Thu, Jun 5, 2025 at 12:41 PM Jonathan Wakely <jwak...@redhat.com> wrote:
>>
>> On Thu, 5 Jun 2025 at 10:00, Tomasz Kaminski <tkami...@redhat.com> wrote:
>> >
>> >
>> >
>> > On Thu, Jun 5, 2025 at 12:06 AM Jonathan Wakely <jwak...@redhat.com> wrote:
>> >>
>> >> Rename __semaphore_base to __semaphore_impl, because it's not used as a
>> >> base class. Replace the three identical lambda expressions with a named
>> >> class, __semaphore_impl::_Can_acquire, which stores the most recent
>> >> value of the counter as a data member.
>> >>
>> >> Add a new __platform_semaphore_impl class template which uses
>> >> __platform_wait_t for the counter and uses more efficient atomic waits
>> >> for the acquire functions, when __platform_wait is available. For a
>> >> binary semaphore some members are further optimized because we know the
>> >> counter can only be zero or one.
>> >>
>> >> Also add a bare wait flag to __atomic_wait_address_v, for consistency
>> >> with __atomic_wait_address_until_v and __atomic_wait_address_for_v and
>> >> to allow semaphores to use it without the redundant overhead of tracking
>> >> waiters.
>> >>
>> >> libstdc++-v3/ChangeLog:
>> >>
>> >>         * include/bits/atomic_wait.h (__atomic_wait_address_v): Add bare
>> >>         wait flag.
>> >>         * include/bits/semaphore_base.h (__semaphore_base): Rename to
>> >>         __semaphore_impl. Replace local variable and predicate lambda
>> >>         with _Can_acquire struct.
>> >>         (__platform_semaphore_impl): New class template.
>> >>         (__semaphore_impl): Remove alias template.
>> >>         * include/std/semaphore:
>> >> ---
>> >>
>> >> This replaces the two patches:
>> >>
>> >> [PATCH 3/4] libstdc++: Optimize std::binary_semaphore
>> >> [PATCH 4/4] libstdc++: Optimize std::counting_semaphore for futex path
>> >>
>> >> The problem with those patches was that the "optimized" implementations
>> >> were used if _Max <= numeric_limits<__platform_wait_t>::max(), which is
>> >> always true for many non-futex targets because we do:
>> >>
>> >> # if ATOMIC_LONG_LOCK_FREE == 2
>> >>     using __platform_wait_t = unsigned long;
>> >> # else
>> >>     using __platform_wait_t = unsigned int;
>> >> # endif
>> >>
>> >> The earlier patches should have been checking either
>> >> _GLIBCXX_HAVE_PLATFORM_WAIT or the __platform_wait_uses_type variable
>> >> template to see if an efficient (i.e. not condition_variable-based)
>> >> implementation is actually supported. It's not useful to check if the
>> >> semaphore's maximum value fits in the __platform_wait_t type if that
>> >> type doesn't have fast wait/wake operations like a futex.
>> >>
>> >> The new proposal in this patch only selects the optimized
>> >> implementations if they can be implemented using __platform_wait.  The
>> >> __semaphore_impl always uses the portable atomic wait API (using an
>> >> accessor function and predicate). As Tomasz suggested, a new class
>> >> template provides the optimized implementation for both
>> >> counting_semaphore and binary_semaphore, differing only in
>> >> _M_get_current and _M_try_acquire.
>> >>
>> >> Tested x86_64-linux and sparc-solaris.
>> >
>> > LGTM, one major suggestion to encapsulate the __pred._M_cur == 0 into 
>> > named member
>> > function.
>> > I have also checked that  __atomic_wait_address_until takes __pred by 
>> > reference,
>> > and does not make copies, so encapsulating __cur there is OK.
>> >>
>> >>
>> >>  libstdc++-v3/include/bits/atomic_wait.h    |   6 +-
>> >>  libstdc++-v3/include/bits/semaphore_base.h | 241 +++++++++++++++------
>> >>  libstdc++-v3/include/std/semaphore         |   5 +-
>> >>  3 files changed, 186 insertions(+), 66 deletions(-)
>> >>
>> >> diff --git a/libstdc++-v3/include/bits/atomic_wait.h 
>> >> b/libstdc++-v3/include/bits/atomic_wait.h
>> >> index 815726c16ccb..d32fda1c2435 100644
>> >> --- a/libstdc++-v3/include/bits/atomic_wait.h
>> >> +++ b/libstdc++-v3/include/bits/atomic_wait.h
>> >> @@ -249,12 +249,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >>        // C++26 will return __val
>> >>      }
>> >>
>> >> +  // Wait on __addr while *__addr == __old is true.
>> >> +  // Cannot be used for proxy waits.
>> >>    inline void
>> >>    __atomic_wait_address_v(const __detail::__platform_wait_t* __addr,
>> >>                           __detail::__platform_wait_t __old,
>> >> -                         int __order)
>> >> +                         int __order, bool __bare_wait = false)
>> >>    {
>> >> -    __detail::__wait_args __args{ __addr, __old, __order };
>> >> +    __detail::__wait_args __args{ __addr, __old, __order, __bare_wait };
>> >>      // C++26 will not ignore the return value here
>> >>      __detail::__wait_impl(__addr, __args);
>> >>    }
>> >> diff --git a/libstdc++-v3/include/bits/semaphore_base.h 
>> >> b/libstdc++-v3/include/bits/semaphore_base.h
>> >> index 3f7a33ccd51a..a1311e8690c7 100644
>> >> --- a/libstdc++-v3/include/bits/semaphore_base.h
>> >> +++ b/libstdc++-v3/include/bits/semaphore_base.h
>> >> @@ -46,23 +46,20 @@ namespace std _GLIBCXX_VISIBILITY(default)
>> >>  {
>> >>  _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >>
>> >> -  template<bool _Platform_wait>
>> >> -  struct __semaphore_base
>> >> +  struct __semaphore_impl
>> >>    {
>> >> -    using __count_type = __conditional_t<_Platform_wait,
>> >> -                                        __detail::__platform_wait_t,
>> >> -                                        ptrdiff_t>;
>> >> +    using __count_type = ptrdiff_t;
>> >>
>> >>      static constexpr ptrdiff_t _S_max
>> >>        = __gnu_cxx::__int_traits<__count_type>::__max;
>> >>
>> >>      constexpr explicit
>> >> -    __semaphore_base(__count_type __count) noexcept
>> >> +    __semaphore_impl(__count_type __count) noexcept
>> >>      : _M_counter(__count)
>> >>      { }
>> >>
>> >> -    __semaphore_base(const __semaphore_base&) = delete;
>> >> -    __semaphore_base& operator=(const __semaphore_base&) = delete;
>> >> +    __semaphore_impl(const __semaphore_impl&) = delete;
>> >> +    __semaphore_impl& operator=(const __semaphore_impl&) = delete;
>> >>
>> >>      // Load the current counter value.
>> >>      _GLIBCXX_ALWAYS_INLINE __count_type
>> >> @@ -71,8 +68,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >>
>> >>      // Try to acquire the semaphore (i.e. decrement the counter).
>> >>      // Returns false if the current counter is zero, or if another thread
>> >> -    // decrements the value first. In the latter case, __cur is set to 
>> >> the
>> >> -    // new value.
>> >> +    // changes the value first. In the latter case, __cur is set to the 
>> >> new
>> >> +    // value.
>> >>      _GLIBCXX_ALWAYS_INLINE bool
>> >>      _M_do_try_acquire(__count_type& __cur) noexcept
>> >>      {
>> >> @@ -85,24 +82,19 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >>                                                     
>> >> memory_order::relaxed);
>> >>      }
>> >>
>> >> +    // Keep trying to acquire the semaphore in a loop until it succeeds.
>> >>      void
>> >>      _M_acquire() noexcept
>> >>      {
>> >> -      auto const __vfn = [this]{ return _M_get_current(); };
>> >> -      auto __val = __vfn();
>> >> -      auto const __pred = [&__val](__count_type __cur) {
>> >> -       if (__cur > 0)
>> >> -         {
>> >> -           __val = __cur;
>> >> -           return true;
>> >> -         }
>> >> -       return false;
>> >> -      };
>> >> -      while (!_M_do_try_acquire(__val))
>> >> -       if (__val == 0)
>> >> +      auto __vfn = [this]{ return _M_get_current(); };
>> >> +      _Can_acquire __pred{__vfn()};
>> >> +      while (!_M_do_try_acquire(__pred._M_val))
>> >> +       if (__pred._M_val == 0)
>> >>           std::__atomic_wait_address(&_M_counter, __pred, __vfn, true);
>> >>      }
>> >>
>> >> +    // Try to acquire the semaphore, retrying a small number of times
>> >> +    // in case of contention.
>> >>      bool
>> >>      _M_try_acquire() noexcept
>> >>      {
>> >> @@ -116,23 +108,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >>        bool
>> >>        _M_try_acquire_until(const chrono::time_point<_Clock, _Duration>& 
>> >> __atime) noexcept
>> >>        {
>> >> -       auto const __vfn = [this]{ return _M_get_current(); };
>> >> -       auto __val = __vfn();
>> >> -       auto const __pred = [&__val](__count_type __cur) {
>> >> -         if (__cur > 0)
>> >> -           {
>> >> -             __val = __cur;
>> >> -             return true;
>> >> -           }
>> >> -         return false;
>> >> -       };
>> >> -       while (!_M_do_try_acquire(__val))
>> >> -         if (__val == 0)
>> >> -           {
>> >> -             if (!std::__atomic_wait_address_until(&_M_counter, __pred,
>> >> -                                                   __vfn, __atime, true))
>> >> -               return false; // timed out
>> >> -           }
>> >> +       auto __vfn = [this]{ return _M_get_current(); };
>> >> +       _Can_acquire __pred{__vfn()};
>> >> +       while (!_M_do_try_acquire(__pred._M_val))
>> >> +         if (__pred._M_val == 0)
>> >> +           if (!std::__atomic_wait_address_until(&_M_counter, __pred,
>> >> +                                                 __vfn, __atime, true))
>> >> +             return false; // timed out
>> >>         return true;
>> >>        }
>> >>
>> >> @@ -140,23 +122,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >>        bool
>> >>        _M_try_acquire_for(const chrono::duration<_Rep, _Period>& __rtime) 
>> >> noexcept
>> >>        {
>> >> -       auto const __vfn = [this]{ return _M_get_current(); };
>> >> -       auto __val = __vfn();
>> >> -       auto const __pred = [&__val](__count_type __cur) {
>> >> -         if (__cur > 0)
>> >> -           {
>> >> -             __val = __cur;
>> >> -             return true;
>> >> -           }
>> >> -         return false;
>> >> -       };
>> >> -       while (!_M_do_try_acquire(__val))
>> >> -         if (__val == 0)
>> >> -           {
>> >> -             if (!std::__atomic_wait_address_for(&_M_counter, __pred,
>> >> -                                                 __vfn, __rtime, true))
>> >> -               return false; // timed out
>> >> -           }
>> >> +       auto __vfn = [this]{ return _M_get_current(); };
>> >> +       _Can_acquire __pred{__vfn()};
>> >> +       while (!_M_do_try_acquire(__pred._M_val))
>> >> +         if (__pred._M_val == 0)
>> >
>> > I think we could have some named member funciton here,
>> > like __pred.should_wait(). But not strictly necessary.
>>
>> I like the suggestion. How about naming the predicate struct differently:
>>
>>     struct _Available
>>     {
>>       __count_type _M_val; // Cache of the last value loaded from _M_counter.
>>
>>       // Argument should be the latest value of the counter.
>>       // Returns true (and caches the value) if it's non-zero and so can be
>>       // decremented to acquire the semaphore. Returns false otherwise.
>>       bool operator()(__count_type __cur) noexcept
>>       {
>>         if (__cur == 0)
>>           return false;
>>         _M_val = __cur;
>>         return true;
>>       }
>>
>>       // Returns true if the last-known value is non-zero.
>>       explicit operator bool() const noexcept { _M_val > 0; }
>>     };
>>
>> And then it would be used like this:
>>
>>     void
>>     _M_acquire() noexcept
>>     {
>>       auto __vfn = [this]{ return _M_get_current(); };
>>       _Available __is_available{__vfn()};
>>       while (!_M_do_try_acquire(__is_available._M_val))
>>         if (!__is_available)
>
> I think the bool conversion here is too cute, the two uses look "simialar".


I don't see a problem with that, they *are* similar. One tests the
last known value, the other updates the last-known value and then
tests it.

The operator() function could be implemented as:

bool operator()(count_type cur)
{
  _M_val = cur;
  return static_cast<bool>(*this);
}

That would be equivalent, but it's implemented slightly differently to
avoid a redundant store (don't bother to store zero, because it must
already be zero if the predicate was being tested by the waiting
function).

I did consider naming the operator bool() as operator()() instead, so
the callable could be used with one argument (to update the cache and
test it) and with no arguments (to test the previously cached value).

Then it would be used like this:

    void
    _M_acquire() noexcept
    {
      auto __vfn = [this]{ return _M_get_current(); };
      _Available __is_available{__vfn()};
      while (!_M_do_try_acquire(__is_available._M_val))
        if (!__is_available())
          std::__atomic_wait_address(&_M_counter, __is_available, __vfn, true);
    }

> Maye rename use a named member like "current()" or "last()", or 
> "is_available.currently()".
>>
>>           std::__atomic_wait_address(&_M_counter, __is_available, __vfn, 
>> true);
>>     }
>>
>> So __is_available(vfn()) tests whether it's available based on the new
>> value returned by vfn(), which is the predicate used by the atomic
>> waiting functions, and (bool)__is_available tests whether it's
>> available based on the cached value, which is what we use to decide
>> whether to call the waiting function.
>>
>> Too cute? I think it makes the logic quite readable.
>>

Re: [PATCH 1/2] libstdc++: Optimize std::counting_semaphore for futex path

Reply via email to