Re: [PATCH 1/2] libstdc++: Optimize std::counting_semaphore for futex path

Jonathan Wakely Thu, 05 Jun 2025 03:51:10 -0700

On Thu, 5 Jun 2025 at 10:00, Tomasz Kaminski <tkami...@redhat.com> wrote:
>
>
>
> On Thu, Jun 5, 2025 at 12:06 AM Jonathan Wakely <jwak...@redhat.com> wrote:
>>
>> Rename __semaphore_base to __semaphore_impl, because it's not used as a
>> base class. Replace the three identical lambda expressions with a named
>> class, __semaphore_impl::_Can_acquire, which stores the most recent
>> value of the counter as a data member.
>>
>> Add a new __platform_semaphore_impl class template which uses
>> __platform_wait_t for the counter and uses more efficient atomic waits
>> for the acquire functions, when __platform_wait is available. For a
>> binary semaphore some members are further optimized because we know the
>> counter can only be zero or one.
>>
>> Also add a bare wait flag to __atomic_wait_address_v, for consistency
>> with __atomic_wait_address_until_v and __atomic_wait_address_for_v and
>> to allow semaphores to use it without the redundant overhead of tracking
>> waiters.
>>
>> libstdc++-v3/ChangeLog:
>>
>>         * include/bits/atomic_wait.h (__atomic_wait_address_v): Add bare
>>         wait flag.
>>         * include/bits/semaphore_base.h (__semaphore_base): Rename to
>>         __semaphore_impl. Replace local variable and predicate lambda
>>         with _Can_acquire struct.
>>         (__platform_semaphore_impl): New class template.
>>         (__semaphore_impl): Remove alias template.
>>         * include/std/semaphore:
>> ---
>>
>> This replaces the two patches:
>>
>> [PATCH 3/4] libstdc++: Optimize std::binary_semaphore
>> [PATCH 4/4] libstdc++: Optimize std::counting_semaphore for futex path
>>
>> The problem with those patches was that the "optimized" implementations
>> were used if _Max <= numeric_limits<__platform_wait_t>::max(), which is
>> always true for many non-futex targets because we do:
>>
>> # if ATOMIC_LONG_LOCK_FREE == 2
>>     using __platform_wait_t = unsigned long;
>> # else
>>     using __platform_wait_t = unsigned int;
>> # endif
>>
>> The earlier patches should have been checking either
>> _GLIBCXX_HAVE_PLATFORM_WAIT or the __platform_wait_uses_type variable
>> template to see if an efficient (i.e. not condition_variable-based)
>> implementation is actually supported. It's not useful to check if the
>> semaphore's maximum value fits in the __platform_wait_t type if that
>> type doesn't have fast wait/wake operations like a futex.
>>
>> The new proposal in this patch only selects the optimized
>> implementations if they can be implemented using __platform_wait.  The
>> __semaphore_impl always uses the portable atomic wait API (using an
>> accessor function and predicate). As Tomasz suggested, a new class
>> template provides the optimized implementation for both
>> counting_semaphore and binary_semaphore, differing only in
>> _M_get_current and _M_try_acquire.
>>
>> Tested x86_64-linux and sparc-solaris.
>
> LGTM, one major suggestion to encapsulate the __pred._M_cur == 0 into named 
> member
> function.
> I have also checked that  __atomic_wait_address_until takes __pred by 
> reference,
> and does not make copies, so encapsulating __cur there is OK.
>>
>>
>>  libstdc++-v3/include/bits/atomic_wait.h    |   6 +-
>>  libstdc++-v3/include/bits/semaphore_base.h | 241 +++++++++++++++------
>>  libstdc++-v3/include/std/semaphore         |   5 +-
>>  3 files changed, 186 insertions(+), 66 deletions(-)
>>
>> diff --git a/libstdc++-v3/include/bits/atomic_wait.h 
>> b/libstdc++-v3/include/bits/atomic_wait.h
>> index 815726c16ccb..d32fda1c2435 100644
>> --- a/libstdc++-v3/include/bits/atomic_wait.h
>> +++ b/libstdc++-v3/include/bits/atomic_wait.h
>> @@ -249,12 +249,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>>        // C++26 will return __val
>>      }
>>
>> +  // Wait on __addr while *__addr == __old is true.
>> +  // Cannot be used for proxy waits.
>>    inline void
>>    __atomic_wait_address_v(const __detail::__platform_wait_t* __addr,
>>                           __detail::__platform_wait_t __old,
>> -                         int __order)
>> +                         int __order, bool __bare_wait = false)
>>    {
>> -    __detail::__wait_args __args{ __addr, __old, __order };
>> +    __detail::__wait_args __args{ __addr, __old, __order, __bare_wait };
>>      // C++26 will not ignore the return value here
>>      __detail::__wait_impl(__addr, __args);
>>    }
>> diff --git a/libstdc++-v3/include/bits/semaphore_base.h 
>> b/libstdc++-v3/include/bits/semaphore_base.h
>> index 3f7a33ccd51a..a1311e8690c7 100644
>> --- a/libstdc++-v3/include/bits/semaphore_base.h
>> +++ b/libstdc++-v3/include/bits/semaphore_base.h
>> @@ -46,23 +46,20 @@ namespace std _GLIBCXX_VISIBILITY(default)
>>  {
>>  _GLIBCXX_BEGIN_NAMESPACE_VERSION
>>
>> -  template<bool _Platform_wait>
>> -  struct __semaphore_base
>> +  struct __semaphore_impl
>>    {
>> -    using __count_type = __conditional_t<_Platform_wait,
>> -                                        __detail::__platform_wait_t,
>> -                                        ptrdiff_t>;
>> +    using __count_type = ptrdiff_t;
>>
>>      static constexpr ptrdiff_t _S_max
>>        = __gnu_cxx::__int_traits<__count_type>::__max;
>>
>>      constexpr explicit
>> -    __semaphore_base(__count_type __count) noexcept
>> +    __semaphore_impl(__count_type __count) noexcept
>>      : _M_counter(__count)
>>      { }
>>
>> -    __semaphore_base(const __semaphore_base&) = delete;
>> -    __semaphore_base& operator=(const __semaphore_base&) = delete;
>> +    __semaphore_impl(const __semaphore_impl&) = delete;
>> +    __semaphore_impl& operator=(const __semaphore_impl&) = delete;
>>
>>      // Load the current counter value.
>>      _GLIBCXX_ALWAYS_INLINE __count_type
>> @@ -71,8 +68,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>>
>>      // Try to acquire the semaphore (i.e. decrement the counter).
>>      // Returns false if the current counter is zero, or if another thread
>> -    // decrements the value first. In the latter case, __cur is set to the
>> -    // new value.
>> +    // changes the value first. In the latter case, __cur is set to the new
>> +    // value.
>>      _GLIBCXX_ALWAYS_INLINE bool
>>      _M_do_try_acquire(__count_type& __cur) noexcept
>>      {
>> @@ -85,24 +82,19 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>>                                                     memory_order::relaxed);
>>      }
>>
>> +    // Keep trying to acquire the semaphore in a loop until it succeeds.
>>      void
>>      _M_acquire() noexcept
>>      {
>> -      auto const __vfn = [this]{ return _M_get_current(); };
>> -      auto __val = __vfn();
>> -      auto const __pred = [&__val](__count_type __cur) {
>> -       if (__cur > 0)
>> -         {
>> -           __val = __cur;
>> -           return true;
>> -         }
>> -       return false;
>> -      };
>> -      while (!_M_do_try_acquire(__val))
>> -       if (__val == 0)
>> +      auto __vfn = [this]{ return _M_get_current(); };
>> +      _Can_acquire __pred{__vfn()};
>> +      while (!_M_do_try_acquire(__pred._M_val))
>> +       if (__pred._M_val == 0)
>>           std::__atomic_wait_address(&_M_counter, __pred, __vfn, true);
>>      }
>>
>> +    // Try to acquire the semaphore, retrying a small number of times
>> +    // in case of contention.
>>      bool
>>      _M_try_acquire() noexcept
>>      {
>> @@ -116,23 +108,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>>        bool
>>        _M_try_acquire_until(const chrono::time_point<_Clock, _Duration>& 
>> __atime) noexcept
>>        {
>> -       auto const __vfn = [this]{ return _M_get_current(); };
>> -       auto __val = __vfn();
>> -       auto const __pred = [&__val](__count_type __cur) {
>> -         if (__cur > 0)
>> -           {
>> -             __val = __cur;
>> -             return true;
>> -           }
>> -         return false;
>> -       };
>> -       while (!_M_do_try_acquire(__val))
>> -         if (__val == 0)
>> -           {
>> -             if (!std::__atomic_wait_address_until(&_M_counter, __pred,
>> -                                                   __vfn, __atime, true))
>> -               return false; // timed out
>> -           }
>> +       auto __vfn = [this]{ return _M_get_current(); };
>> +       _Can_acquire __pred{__vfn()};
>> +       while (!_M_do_try_acquire(__pred._M_val))
>> +         if (__pred._M_val == 0)
>> +           if (!std::__atomic_wait_address_until(&_M_counter, __pred,
>> +                                                 __vfn, __atime, true))
>> +             return false; // timed out
>>         return true;
>>        }
>>
>> @@ -140,23 +122,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>>        bool
>>        _M_try_acquire_for(const chrono::duration<_Rep, _Period>& __rtime) 
>> noexcept
>>        {
>> -       auto const __vfn = [this]{ return _M_get_current(); };
>> -       auto __val = __vfn();
>> -       auto const __pred = [&__val](__count_type __cur) {
>> -         if (__cur > 0)
>> -           {
>> -             __val = __cur;
>> -             return true;
>> -           }
>> -         return false;
>> -       };
>> -       while (!_M_do_try_acquire(__val))
>> -         if (__val == 0)
>> -           {
>> -             if (!std::__atomic_wait_address_for(&_M_counter, __pred,
>> -                                                 __vfn, __rtime, true))
>> -               return false; // timed out
>> -           }
>> +       auto __vfn = [this]{ return _M_get_current(); };
>> +       _Can_acquire __pred{__vfn()};
>> +       while (!_M_do_try_acquire(__pred._M_val))
>> +         if (__pred._M_val == 0)
>
> I think we could have some named member funciton here,
> like __pred.should_wait(). But not strictly necessary.


I like the suggestion. How about naming the predicate struct differently:

    struct _Available
    {
      __count_type _M_val; // Cache of the last value loaded from _M_counter.

      // Argument should be the latest value of the counter.
      // Returns true (and caches the value) if it's non-zero and so can be
      // decremented to acquire the semaphore. Returns false otherwise.
      bool operator()(__count_type __cur) noexcept
      {
        if (__cur == 0)
          return false;
        _M_val = __cur;
        return true;
      }

      // Returns true if the last-known value is non-zero.
      explicit operator bool() const noexcept { _M_val > 0; }
    };

And then it would be used like this:

    void
    _M_acquire() noexcept
    {
      auto __vfn = [this]{ return _M_get_current(); };
      _Available __is_available{__vfn()};
      while (!_M_do_try_acquire(__is_available._M_val))
        if (!__is_available)
          std::__atomic_wait_address(&_M_counter, __is_available, __vfn, true);
    }

So __is_available(vfn()) tests whether it's available based on the new
value returned by vfn(), which is the predicate used by the atomic
waiting functions, and (bool)__is_available tests whether it's
available based on the cached value, which is what we use to decide
whether to call the waiting function.

Too cute? I think it makes the logic quite readable.

Re: [PATCH 1/2] libstdc++: Optimize std::counting_semaphore for futex path

Reply via email to