Re: [PATCH 1/2] libstdc++: Optimize std::counting_semaphore for futex path

Tomasz Kaminski Thu, 05 Jun 2025 07:42:45 -0700

On Thu, Jun 5, 2025 at 12:41 PM Jonathan Wakely <jwak...@redhat.com> wrote:


> On Thu, 5 Jun 2025 at 10:00, Tomasz Kaminski <tkami...@redhat.com> wrote:
> >
> >
> >
> > On Thu, Jun 5, 2025 at 12:06 AM Jonathan Wakely <jwak...@redhat.com>
> wrote:
> >>
> >> Rename __semaphore_base to __semaphore_impl, because it's not used as a
> >> base class. Replace the three identical lambda expressions with a named
> >> class, __semaphore_impl::_Can_acquire, which stores the most recent
> >> value of the counter as a data member.
> >>
> >> Add a new __platform_semaphore_impl class template which uses
> >> __platform_wait_t for the counter and uses more efficient atomic waits
> >> for the acquire functions, when __platform_wait is available. For a
> >> binary semaphore some members are further optimized because we know the
> >> counter can only be zero or one.
> >>
> >> Also add a bare wait flag to __atomic_wait_address_v, for consistency
> >> with __atomic_wait_address_until_v and __atomic_wait_address_for_v and
> >> to allow semaphores to use it without the redundant overhead of tracking
> >> waiters.
> >>
> >> libstdc++-v3/ChangeLog:
> >>
> >>         * include/bits/atomic_wait.h (__atomic_wait_address_v): Add bare
> >>         wait flag.
> >>         * include/bits/semaphore_base.h (__semaphore_base): Rename to
> >>         __semaphore_impl. Replace local variable and predicate lambda
> >>         with _Can_acquire struct.
> >>         (__platform_semaphore_impl): New class template.
> >>         (__semaphore_impl): Remove alias template.
> >>         * include/std/semaphore:
> >> ---
> >>
> >> This replaces the two patches:
> >>
> >> [PATCH 3/4] libstdc++: Optimize std::binary_semaphore
> >> [PATCH 4/4] libstdc++: Optimize std::counting_semaphore for futex path
> >>
> >> The problem with those patches was that the "optimized" implementations
> >> were used if _Max <= numeric_limits<__platform_wait_t>::max(), which is
> >> always true for many non-futex targets because we do:
> >>
> >> # if ATOMIC_LONG_LOCK_FREE == 2
> >>     using __platform_wait_t = unsigned long;
> >> # else
> >>     using __platform_wait_t = unsigned int;
> >> # endif
> >>
> >> The earlier patches should have been checking either
> >> _GLIBCXX_HAVE_PLATFORM_WAIT or the __platform_wait_uses_type variable
> >> template to see if an efficient (i.e. not condition_variable-based)
> >> implementation is actually supported. It's not useful to check if the
> >> semaphore's maximum value fits in the __platform_wait_t type if that
> >> type doesn't have fast wait/wake operations like a futex.
> >>
> >> The new proposal in this patch only selects the optimized
> >> implementations if they can be implemented using __platform_wait.  The
> >> __semaphore_impl always uses the portable atomic wait API (using an
> >> accessor function and predicate). As Tomasz suggested, a new class
> >> template provides the optimized implementation for both
> >> counting_semaphore and binary_semaphore, differing only in
> >> _M_get_current and _M_try_acquire.
> >>
> >> Tested x86_64-linux and sparc-solaris.
> >
> > LGTM, one major suggestion to encapsulate the __pred._M_cur == 0 into
> named member
> > function.
> > I have also checked that  __atomic_wait_address_until takes __pred by
> reference,
> > and does not make copies, so encapsulating __cur there is OK.
> >>
> >>
> >>  libstdc++-v3/include/bits/atomic_wait.h    |   6 +-
> >>  libstdc++-v3/include/bits/semaphore_base.h | 241 +++++++++++++++------
> >>  libstdc++-v3/include/std/semaphore         |   5 +-
> >>  3 files changed, 186 insertions(+), 66 deletions(-)
> >>
> >> diff --git a/libstdc++-v3/include/bits/atomic_wait.h
> b/libstdc++-v3/include/bits/atomic_wait.h
> >> index 815726c16ccb..d32fda1c2435 100644
> >> --- a/libstdc++-v3/include/bits/atomic_wait.h
> >> +++ b/libstdc++-v3/include/bits/atomic_wait.h
> >> @@ -249,12 +249,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>        // C++26 will return __val
> >>      }
> >>
> >> +  // Wait on __addr while *__addr == __old is true.
> >> +  // Cannot be used for proxy waits.
> >>    inline void
> >>    __atomic_wait_address_v(const __detail::__platform_wait_t* __addr,
> >>                           __detail::__platform_wait_t __old,
> >> -                         int __order)
> >> +                         int __order, bool __bare_wait = false)
> >>    {
> >> -    __detail::__wait_args __args{ __addr, __old, __order };
> >> +    __detail::__wait_args __args{ __addr, __old, __order, __bare_wait
> };
> >>      // C++26 will not ignore the return value here
> >>      __detail::__wait_impl(__addr, __args);
> >>    }
> >> diff --git a/libstdc++-v3/include/bits/semaphore_base.h
> b/libstdc++-v3/include/bits/semaphore_base.h
> >> index 3f7a33ccd51a..a1311e8690c7 100644
> >> --- a/libstdc++-v3/include/bits/semaphore_base.h
> >> +++ b/libstdc++-v3/include/bits/semaphore_base.h
> >> @@ -46,23 +46,20 @@ namespace std _GLIBCXX_VISIBILITY(default)
> >>  {
> >>  _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>
> >> -  template<bool _Platform_wait>
> >> -  struct __semaphore_base
> >> +  struct __semaphore_impl
> >>    {
> >> -    using __count_type = __conditional_t<_Platform_wait,
> >> -                                        __detail::__platform_wait_t,
> >> -                                        ptrdiff_t>;
> >> +    using __count_type = ptrdiff_t;
> >>
> >>      static constexpr ptrdiff_t _S_max
> >>        = __gnu_cxx::__int_traits<__count_type>::__max;
> >>
> >>      constexpr explicit
> >> -    __semaphore_base(__count_type __count) noexcept
> >> +    __semaphore_impl(__count_type __count) noexcept
> >>      : _M_counter(__count)
> >>      { }
> >>
> >> -    __semaphore_base(const __semaphore_base&) = delete;
> >> -    __semaphore_base& operator=(const __semaphore_base&) = delete;
> >> +    __semaphore_impl(const __semaphore_impl&) = delete;
> >> +    __semaphore_impl& operator=(const __semaphore_impl&) = delete;
> >>
> >>      // Load the current counter value.
> >>      _GLIBCXX_ALWAYS_INLINE __count_type
> >> @@ -71,8 +68,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>
> >>      // Try to acquire the semaphore (i.e. decrement the counter).
> >>      // Returns false if the current counter is zero, or if another
> thread
> >> -    // decrements the value first. In the latter case, __cur is set to
> the
> >> -    // new value.
> >> +    // changes the value first. In the latter case, __cur is set to
> the new
> >> +    // value.
> >>      _GLIBCXX_ALWAYS_INLINE bool
> >>      _M_do_try_acquire(__count_type& __cur) noexcept
> >>      {
> >> @@ -85,24 +82,19 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>
>  memory_order::relaxed);
> >>      }
> >>
> >> +    // Keep trying to acquire the semaphore in a loop until it
> succeeds.
> >>      void
> >>      _M_acquire() noexcept
> >>      {
> >> -      auto const __vfn = [this]{ return _M_get_current(); };
> >> -      auto __val = __vfn();
> >> -      auto const __pred = [&__val](__count_type __cur) {
> >> -       if (__cur > 0)
> >> -         {
> >> -           __val = __cur;
> >> -           return true;
> >> -         }
> >> -       return false;
> >> -      };
> >> -      while (!_M_do_try_acquire(__val))
> >> -       if (__val == 0)
> >> +      auto __vfn = [this]{ return _M_get_current(); };
> >> +      _Can_acquire __pred{__vfn()};
> >> +      while (!_M_do_try_acquire(__pred._M_val))
> >> +       if (__pred._M_val == 0)
> >>           std::__atomic_wait_address(&_M_counter, __pred, __vfn, true);
> >>      }
> >>
> >> +    // Try to acquire the semaphore, retrying a small number of times
> >> +    // in case of contention.
> >>      bool
> >>      _M_try_acquire() noexcept
> >>      {
> >> @@ -116,23 +108,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>        bool
> >>        _M_try_acquire_until(const chrono::time_point<_Clock,
> _Duration>& __atime) noexcept
> >>        {
> >> -       auto const __vfn = [this]{ return _M_get_current(); };
> >> -       auto __val = __vfn();
> >> -       auto const __pred = [&__val](__count_type __cur) {
> >> -         if (__cur > 0)
> >> -           {
> >> -             __val = __cur;
> >> -             return true;
> >> -           }
> >> -         return false;
> >> -       };
> >> -       while (!_M_do_try_acquire(__val))
> >> -         if (__val == 0)
> >> -           {
> >> -             if (!std::__atomic_wait_address_until(&_M_counter, __pred,
> >> -                                                   __vfn, __atime,
> true))
> >> -               return false; // timed out
> >> -           }
> >> +       auto __vfn = [this]{ return _M_get_current(); };
> >> +       _Can_acquire __pred{__vfn()};
> >> +       while (!_M_do_try_acquire(__pred._M_val))
> >> +         if (__pred._M_val == 0)
> >> +           if (!std::__atomic_wait_address_until(&_M_counter, __pred,
> >> +                                                 __vfn, __atime, true))
> >> +             return false; // timed out
> >>         return true;
> >>        }
> >>
> >> @@ -140,23 +122,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>        bool
> >>        _M_try_acquire_for(const chrono::duration<_Rep, _Period>&
> __rtime) noexcept
> >>        {
> >> -       auto const __vfn = [this]{ return _M_get_current(); };
> >> -       auto __val = __vfn();
> >> -       auto const __pred = [&__val](__count_type __cur) {
> >> -         if (__cur > 0)
> >> -           {
> >> -             __val = __cur;
> >> -             return true;
> >> -           }
> >> -         return false;
> >> -       };
> >> -       while (!_M_do_try_acquire(__val))
> >> -         if (__val == 0)
> >> -           {
> >> -             if (!std::__atomic_wait_address_for(&_M_counter, __pred,
> >> -                                                 __vfn, __rtime, true))
> >> -               return false; // timed out
> >> -           }
> >> +       auto __vfn = [this]{ return _M_get_current(); };
> >> +       _Can_acquire __pred{__vfn()};
> >> +       while (!_M_do_try_acquire(__pred._M_val))
> >> +         if (__pred._M_val == 0)
> >
> > I think we could have some named member funciton here,
> > like __pred.should_wait(). But not strictly necessary.
>
> I like the suggestion. How about naming the predicate struct differently:
>
>     struct _Available
>     {
>       __count_type _M_val; // Cache of the last value loaded from
> _M_counter.
>
>       // Argument should be the latest value of the counter.
>       // Returns true (and caches the value) if it's non-zero and so can be
>       // decremented to acquire the semaphore. Returns false otherwise.
>       bool operator()(__count_type __cur) noexcept
>       {
>         if (__cur == 0)
>           return false;
>         _M_val = __cur;
>         return true;
>       }
>
>       // Returns true if the last-known value is non-zero.
>       explicit operator bool() const noexcept { _M_val > 0; }
>     };
>
> And then it would be used like this:
>
>     void
>     _M_acquire() noexcept
>     {
>       auto __vfn = [this]{ return _M_get_current(); };
>       _Available __is_available{__vfn()};
>       while (!_M_do_try_acquire(__is_available._M_val))
>         if (!__is_available)
>
I think the bool conversion here is too cute, the two uses look "simialar".
Maye rename use a named member like "current()" or "last()", or
"is_available.currently()".

>           std::__atomic_wait_address(&_M_counter, __is_available, __vfn,
> true);
>     }
>
> So __is_available(vfn()) tests whether it's available based on the new
> value returned by vfn(), which is the predicate used by the atomic
> waiting functions, and (bool)__is_available tests whether it's
> available based on the cached value, which is what we use to decide
> whether to call the waiting function.
>
> Too cute? I think it makes the logic quite readable.
>
>

Re: [PATCH 1/2] libstdc++: Optimize std::counting_semaphore for futex path

Reply via email to