Re: [PATCH 1/2] libstdc++: Optimize std::counting_semaphore for futex path

Jonathan Wakely Thu, 05 Jun 2025 14:10:01 -0700

On Thu, 5 Jun 2025 at 18:58, Tomasz Kaminski <tkami...@redhat.com> wrote:
>
>
>
> On Thu, Jun 5, 2025 at 6:44 PM Jonathan Wakely <jwak...@redhat.com> wrote:
>>
>> On Thu, 5 Jun 2025 at 15:42, Tomasz Kaminski <tkami...@redhat.com> wrote:
>> >
>> >
>> >
>> > On Thu, Jun 5, 2025 at 12:41 PM Jonathan Wakely <jwak...@redhat.com> wrote:
>> >>
>> >> On Thu, 5 Jun 2025 at 10:00, Tomasz Kaminski <tkami...@redhat.com> wrote:
>> >> >
>> >> >
>> >> >
>> >> > On Thu, Jun 5, 2025 at 12:06 AM Jonathan Wakely <jwak...@redhat.com> 
>> >> > wrote:
>> >> >>
>> >> >> Rename __semaphore_base to __semaphore_impl, because it's not used as a
>> >> >> base class. Replace the three identical lambda expressions with a named
>> >> >> class, __semaphore_impl::_Can_acquire, which stores the most recent
>> >> >> value of the counter as a data member.
>> >> >>
>> >> >> Add a new __platform_semaphore_impl class template which uses
>> >> >> __platform_wait_t for the counter and uses more efficient atomic waits
>> >> >> for the acquire functions, when __platform_wait is available. For a
>> >> >> binary semaphore some members are further optimized because we know the
>> >> >> counter can only be zero or one.
>> >> >>
>> >> >> Also add a bare wait flag to __atomic_wait_address_v, for consistency
>> >> >> with __atomic_wait_address_until_v and __atomic_wait_address_for_v and
>> >> >> to allow semaphores to use it without the redundant overhead of 
>> >> >> tracking
>> >> >> waiters.
>> >> >>
>> >> >> libstdc++-v3/ChangeLog:
>> >> >>
>> >> >>         * include/bits/atomic_wait.h (__atomic_wait_address_v): Add 
>> >> >> bare
>> >> >>         wait flag.
>> >> >>         * include/bits/semaphore_base.h (__semaphore_base): Rename to
>> >> >>         __semaphore_impl. Replace local variable and predicate lambda
>> >> >>         with _Can_acquire struct.
>> >> >>         (__platform_semaphore_impl): New class template.
>> >> >>         (__semaphore_impl): Remove alias template.
>> >> >>         * include/std/semaphore:
>> >> >> ---
>> >> >>
>> >> >> This replaces the two patches:
>> >> >>
>> >> >> [PATCH 3/4] libstdc++: Optimize std::binary_semaphore
>> >> >> [PATCH 4/4] libstdc++: Optimize std::counting_semaphore for futex path
>> >> >>
>> >> >> The problem with those patches was that the "optimized" implementations
>> >> >> were used if _Max <= numeric_limits<__platform_wait_t>::max(), which is
>> >> >> always true for many non-futex targets because we do:
>> >> >>
>> >> >> # if ATOMIC_LONG_LOCK_FREE == 2
>> >> >>     using __platform_wait_t = unsigned long;
>> >> >> # else
>> >> >>     using __platform_wait_t = unsigned int;
>> >> >> # endif
>> >> >>
>> >> >> The earlier patches should have been checking either
>> >> >> _GLIBCXX_HAVE_PLATFORM_WAIT or the __platform_wait_uses_type variable
>> >> >> template to see if an efficient (i.e. not condition_variable-based)
>> >> >> implementation is actually supported. It's not useful to check if the
>> >> >> semaphore's maximum value fits in the __platform_wait_t type if that
>> >> >> type doesn't have fast wait/wake operations like a futex.
>> >> >>
>> >> >> The new proposal in this patch only selects the optimized
>> >> >> implementations if they can be implemented using __platform_wait.  The
>> >> >> __semaphore_impl always uses the portable atomic wait API (using an
>> >> >> accessor function and predicate). As Tomasz suggested, a new class
>> >> >> template provides the optimized implementation for both
>> >> >> counting_semaphore and binary_semaphore, differing only in
>> >> >> _M_get_current and _M_try_acquire.
>> >> >>
>> >> >> Tested x86_64-linux and sparc-solaris.
>> >> >
>> >> > LGTM, one major suggestion to encapsulate the __pred._M_cur == 0 into 
>> >> > named member
>> >> > function.
>> >> > I have also checked that  __atomic_wait_address_until takes __pred by 
>> >> > reference,
>> >> > and does not make copies, so encapsulating __cur there is OK.
>> >> >>
>> >> >>
>> >> >>  libstdc++-v3/include/bits/atomic_wait.h    |   6 +-
>> >> >>  libstdc++-v3/include/bits/semaphore_base.h | 241 +++++++++++++++------
>> >> >>  libstdc++-v3/include/std/semaphore         |   5 +-
>> >> >>  3 files changed, 186 insertions(+), 66 deletions(-)
>> >> >>
>> >> >> diff --git a/libstdc++-v3/include/bits/atomic_wait.h 
>> >> >> b/libstdc++-v3/include/bits/atomic_wait.h
>> >> >> index 815726c16ccb..d32fda1c2435 100644
>> >> >> --- a/libstdc++-v3/include/bits/atomic_wait.h
>> >> >> +++ b/libstdc++-v3/include/bits/atomic_wait.h
>> >> >> @@ -249,12 +249,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >> >>        // C++26 will return __val
>> >> >>      }
>> >> >>
>> >> >> +  // Wait on __addr while *__addr == __old is true.
>> >> >> +  // Cannot be used for proxy waits.
>> >> >>    inline void
>> >> >>    __atomic_wait_address_v(const __detail::__platform_wait_t* __addr,
>> >> >>                           __detail::__platform_wait_t __old,
>> >> >> -                         int __order)
>> >> >> +                         int __order, bool __bare_wait = false)
>> >> >>    {
>> >> >> -    __detail::__wait_args __args{ __addr, __old, __order };
>> >> >> +    __detail::__wait_args __args{ __addr, __old, __order, __bare_wait 
>> >> >> };
>> >> >>      // C++26 will not ignore the return value here
>> >> >>      __detail::__wait_impl(__addr, __args);
>> >> >>    }
>> >> >> diff --git a/libstdc++-v3/include/bits/semaphore_base.h 
>> >> >> b/libstdc++-v3/include/bits/semaphore_base.h
>> >> >> index 3f7a33ccd51a..a1311e8690c7 100644
>> >> >> --- a/libstdc++-v3/include/bits/semaphore_base.h
>> >> >> +++ b/libstdc++-v3/include/bits/semaphore_base.h
>> >> >> @@ -46,23 +46,20 @@ namespace std _GLIBCXX_VISIBILITY(default)
>> >> >>  {
>> >> >>  _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >> >>
>> >> >> -  template<bool _Platform_wait>
>> >> >> -  struct __semaphore_base
>> >> >> +  struct __semaphore_impl
>> >> >>    {
>> >> >> -    using __count_type = __conditional_t<_Platform_wait,
>> >> >> -                                        __detail::__platform_wait_t,
>> >> >> -                                        ptrdiff_t>;
>> >> >> +    using __count_type = ptrdiff_t;
>> >> >>
>> >> >>      static constexpr ptrdiff_t _S_max
>> >> >>        = __gnu_cxx::__int_traits<__count_type>::__max;
>> >> >>
>> >> >>      constexpr explicit
>> >> >> -    __semaphore_base(__count_type __count) noexcept
>> >> >> +    __semaphore_impl(__count_type __count) noexcept
>> >> >>      : _M_counter(__count)
>> >> >>      { }
>> >> >>
>> >> >> -    __semaphore_base(const __semaphore_base&) = delete;
>> >> >> -    __semaphore_base& operator=(const __semaphore_base&) = delete;
>> >> >> +    __semaphore_impl(const __semaphore_impl&) = delete;
>> >> >> +    __semaphore_impl& operator=(const __semaphore_impl&) = delete;
>> >> >>
>> >> >>      // Load the current counter value.
>> >> >>      _GLIBCXX_ALWAYS_INLINE __count_type
>> >> >> @@ -71,8 +68,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >> >>
>> >> >>      // Try to acquire the semaphore (i.e. decrement the counter).
>> >> >>      // Returns false if the current counter is zero, or if another 
>> >> >> thread
>> >> >> -    // decrements the value first. In the latter case, __cur is set 
>> >> >> to the
>> >> >> -    // new value.
>> >> >> +    // changes the value first. In the latter case, __cur is set to 
>> >> >> the new
>> >> >> +    // value.
>> >> >>      _GLIBCXX_ALWAYS_INLINE bool
>> >> >>      _M_do_try_acquire(__count_type& __cur) noexcept
>> >> >>      {
>> >> >> @@ -85,24 +82,19 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >> >>                                                     
>> >> >> memory_order::relaxed);
>> >> >>      }
>> >> >>
>> >> >> +    // Keep trying to acquire the semaphore in a loop until it 
>> >> >> succeeds.
>> >> >>      void
>> >> >>      _M_acquire() noexcept
>> >> >>      {
>> >> >> -      auto const __vfn = [this]{ return _M_get_current(); };
>> >> >> -      auto __val = __vfn();
>> >> >> -      auto const __pred = [&__val](__count_type __cur) {
>> >> >> -       if (__cur > 0)
>> >> >> -         {
>> >> >> -           __val = __cur;
>> >> >> -           return true;
>> >> >> -         }
>> >> >> -       return false;
>> >> >> -      };
>> >> >> -      while (!_M_do_try_acquire(__val))
>> >> >> -       if (__val == 0)
>> >> >> +      auto __vfn = [this]{ return _M_get_current(); };
>> >> >> +      _Can_acquire __pred{__vfn()};
>> >> >> +      while (!_M_do_try_acquire(__pred._M_val))
>> >> >> +       if (__pred._M_val == 0)
>> >> >>           std::__atomic_wait_address(&_M_counter, __pred, __vfn, true);
>> >> >>      }
>> >> >>
>> >> >> +    // Try to acquire the semaphore, retrying a small number of times
>> >> >> +    // in case of contention.
>> >> >>      bool
>> >> >>      _M_try_acquire() noexcept
>> >> >>      {
>> >> >> @@ -116,23 +108,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >> >>        bool
>> >> >>        _M_try_acquire_until(const chrono::time_point<_Clock, 
>> >> >> _Duration>& __atime) noexcept
>> >> >>        {
>> >> >> -       auto const __vfn = [this]{ return _M_get_current(); };
>> >> >> -       auto __val = __vfn();
>> >> >> -       auto const __pred = [&__val](__count_type __cur) {
>> >> >> -         if (__cur > 0)
>> >> >> -           {
>> >> >> -             __val = __cur;
>> >> >> -             return true;
>> >> >> -           }
>> >> >> -         return false;
>> >> >> -       };
>> >> >> -       while (!_M_do_try_acquire(__val))
>> >> >> -         if (__val == 0)
>> >> >> -           {
>> >> >> -             if (!std::__atomic_wait_address_until(&_M_counter, 
>> >> >> __pred,
>> >> >> -                                                   __vfn, __atime, 
>> >> >> true))
>> >> >> -               return false; // timed out
>> >> >> -           }
>> >> >> +       auto __vfn = [this]{ return _M_get_current(); };
>> >> >> +       _Can_acquire __pred{__vfn()};
>> >> >> +       while (!_M_do_try_acquire(__pred._M_val))
>> >> >> +         if (__pred._M_val == 0)
>> >> >> +           if (!std::__atomic_wait_address_until(&_M_counter, __pred,
>> >> >> +                                                 __vfn, __atime, 
>> >> >> true))
>> >> >> +             return false; // timed out
>> >> >>         return true;
>> >> >>        }
>> >> >>
>> >> >> @@ -140,23 +122,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
>> >> >>        bool
>> >> >>        _M_try_acquire_for(const chrono::duration<_Rep, _Period>& 
>> >> >> __rtime) noexcept
>> >> >>        {
>> >> >> -       auto const __vfn = [this]{ return _M_get_current(); };
>> >> >> -       auto __val = __vfn();
>> >> >> -       auto const __pred = [&__val](__count_type __cur) {
>> >> >> -         if (__cur > 0)
>> >> >> -           {
>> >> >> -             __val = __cur;
>> >> >> -             return true;
>> >> >> -           }
>> >> >> -         return false;
>> >> >> -       };
>> >> >> -       while (!_M_do_try_acquire(__val))
>> >> >> -         if (__val == 0)
>> >> >> -           {
>> >> >> -             if (!std::__atomic_wait_address_for(&_M_counter, __pred,
>> >> >> -                                                 __vfn, __rtime, 
>> >> >> true))
>> >> >> -               return false; // timed out
>> >> >> -           }
>> >> >> +       auto __vfn = [this]{ return _M_get_current(); };
>> >> >> +       _Can_acquire __pred{__vfn()};
>> >> >> +       while (!_M_do_try_acquire(__pred._M_val))
>> >> >> +         if (__pred._M_val == 0)
>> >> >
>> >> > I think we could have some named member funciton here,
>> >> > like __pred.should_wait(). But not strictly necessary.
>> >>
>> >> I like the suggestion. How about naming the predicate struct differently:
>> >>
>> >>     struct _Available
>> >>     {
>> >>       __count_type _M_val; // Cache of the last value loaded from 
>> >> _M_counter.
>> >>
>> >>       // Argument should be the latest value of the counter.
>> >>       // Returns true (and caches the value) if it's non-zero and so can 
>> >> be
>> >>       // decremented to acquire the semaphore. Returns false otherwise.
>> >>       bool operator()(__count_type __cur) noexcept
>> >>       {
>> >>         if (__cur == 0)
>> >>           return false;
>> >>         _M_val = __cur;
>> >>         return true;
>> >>       }
>> >>
>> >>       // Returns true if the last-known value is non-zero.
>> >>       explicit operator bool() const noexcept { _M_val > 0; }
>> >>     };
>> >>
>> >> And then it would be used like this:
>> >>
>> >>     void
>> >>     _M_acquire() noexcept
>> >>     {
>> >>       auto __vfn = [this]{ return _M_get_current(); };
>> >>       _Available __is_available{__vfn()};
>> >>       while (!_M_do_try_acquire(__is_available._M_val))
>> >>         if (!__is_available)
>> >
>> > I think the bool conversion here is too cute, the two uses look "simialar".
>>
>> I don't see a problem with that, they *are* similar. One tests the
>> last known value, the other updates the last-known value and then
>> tests it.
>>
>> The operator() function could be implemented as:
>>
>> bool operator()(count_type cur)
>> {
>>   _M_val = cur;
>>   return static_cast<bool>(*this);
>> }
>>
>> That would be equivalent, but it's implemented slightly differently to
>> avoid a redundant store (don't bother to store zero, because it must
>> already be zero if the predicate was being tested by the waiting
>> function).
>>
>> I did consider naming the operator bool() as operator()() instead, so
>> the callable could be used with one argument (to update the cache and
>> test it) and with no arguments (to test the previously cached value).
>
> I think the root of my concern was that, if is_available is used as a 
> variable, then it looks like
> we are calling __atomic_wait_address overload that accepts a value.


Ah yes, I see.

> Having operator()(),
> and using it uniformly as a functor/predicate alleviates this.

OK great.

One last question then:

>> Then it would be used like this:
>>
>>     void
>>     _M_acquire() noexcept
>>     {
>>       auto __vfn = [this]{ return _M_get_current(); };
>>       _Available __is_available{__vfn()};

Would this initialization be better if written like this:

    _Available __is_available{ ._M_val = __vfn() };

I think it makes the use of _M_val on the next line a bit clearer:

>>       while (!_M_do_try_acquire(__is_available._M_val))

Although maybe it's not a big deal, because the definition of
_Available and its _M_val member are only a few lines further down in
the same file, not far away in some other header.

>>         if (!__is_available())
>>           std::__atomic_wait_address(&_M_counter, __is_available, __vfn, 
>> true);
>>     }
>>
>> > Maye rename use a named member like "current()" or "last()", or 
>> > "is_available.currently()".
>> >>
>> >>           std::__atomic_wait_address(&_M_counter, __is_available, __vfn, 
>> >> true);
>> >>     }
>> >>
>> >> So __is_available(vfn()) tests whether it's available based on the new
>> >> value returned by vfn(), which is the predicate used by the atomic
>> >> waiting functions, and (bool)__is_available tests whether it's
>> >> available based on the cached value, which is what we use to decide
>> >> whether to call the waiting function.
>> >>
>> >> Too cute? I think it makes the logic quite readable.
>> >>
>>

Re: [PATCH 1/2] libstdc++: Optimize std::counting_semaphore for futex path

Reply via email to