Re: [PATCH 1/2] libstdc++: Optimize std::counting_semaphore for futex path

Tomasz Kaminski Thu, 05 Jun 2025 23:56:48 -0700

On Thu, Jun 5, 2025 at 11:09 PM Jonathan Wakely <jwak...@redhat.com> wrote:


> On Thu, 5 Jun 2025 at 18:58, Tomasz Kaminski <tkami...@redhat.com> wrote:
> >
> >
> >
> > On Thu, Jun 5, 2025 at 6:44 PM Jonathan Wakely <jwak...@redhat.com>
> wrote:
> >>
> >> On Thu, 5 Jun 2025 at 15:42, Tomasz Kaminski <tkami...@redhat.com>
> wrote:
> >> >
> >> >
> >> >
> >> > On Thu, Jun 5, 2025 at 12:41 PM Jonathan Wakely <jwak...@redhat.com>
> wrote:
> >> >>
> >> >> On Thu, 5 Jun 2025 at 10:00, Tomasz Kaminski <tkami...@redhat.com>
> wrote:
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Thu, Jun 5, 2025 at 12:06 AM Jonathan Wakely <
> jwak...@redhat.com> wrote:
> >> >> >>
> >> >> >> Rename __semaphore_base to __semaphore_impl, because it's not
> used as a
> >> >> >> base class. Replace the three identical lambda expressions with a
> named
> >> >> >> class, __semaphore_impl::_Can_acquire, which stores the most
> recent
> >> >> >> value of the counter as a data member.
> >> >> >>
> >> >> >> Add a new __platform_semaphore_impl class template which uses
> >> >> >> __platform_wait_t for the counter and uses more efficient atomic
> waits
> >> >> >> for the acquire functions, when __platform_wait is available. For
> a
> >> >> >> binary semaphore some members are further optimized because we
> know the
> >> >> >> counter can only be zero or one.
> >> >> >>
> >> >> >> Also add a bare wait flag to __atomic_wait_address_v, for
> consistency
> >> >> >> with __atomic_wait_address_until_v and
> __atomic_wait_address_for_v and
> >> >> >> to allow semaphores to use it without the redundant overhead of
> tracking
> >> >> >> waiters.
> >> >> >>
> >> >> >> libstdc++-v3/ChangeLog:
> >> >> >>
> >> >> >>         * include/bits/atomic_wait.h (__atomic_wait_address_v):
> Add bare
> >> >> >>         wait flag.
> >> >> >>         * include/bits/semaphore_base.h (__semaphore_base):
> Rename to
> >> >> >>         __semaphore_impl. Replace local variable and predicate
> lambda
> >> >> >>         with _Can_acquire struct.
> >> >> >>         (__platform_semaphore_impl): New class template.
> >> >> >>         (__semaphore_impl): Remove alias template.
> >> >> >>         * include/std/semaphore:
> >> >> >> ---
> >> >> >>
> >> >> >> This replaces the two patches:
> >> >> >>
> >> >> >> [PATCH 3/4] libstdc++: Optimize std::binary_semaphore
> >> >> >> [PATCH 4/4] libstdc++: Optimize std::counting_semaphore for futex
> path
> >> >> >>
> >> >> >> The problem with those patches was that the "optimized"
> implementations
> >> >> >> were used if _Max <= numeric_limits<__platform_wait_t>::max(),
> which is
> >> >> >> always true for many non-futex targets because we do:
> >> >> >>
> >> >> >> # if ATOMIC_LONG_LOCK_FREE == 2
> >> >> >>     using __platform_wait_t = unsigned long;
> >> >> >> # else
> >> >> >>     using __platform_wait_t = unsigned int;
> >> >> >> # endif
> >> >> >>
> >> >> >> The earlier patches should have been checking either
> >> >> >> _GLIBCXX_HAVE_PLATFORM_WAIT or the __platform_wait_uses_type
> variable
> >> >> >> template to see if an efficient (i.e. not
> condition_variable-based)
> >> >> >> implementation is actually supported. It's not useful to check if
> the
> >> >> >> semaphore's maximum value fits in the __platform_wait_t type if
> that
> >> >> >> type doesn't have fast wait/wake operations like a futex.
> >> >> >>
> >> >> >> The new proposal in this patch only selects the optimized
> >> >> >> implementations if they can be implemented using
> __platform_wait.  The
> >> >> >> __semaphore_impl always uses the portable atomic wait API (using
> an
> >> >> >> accessor function and predicate). As Tomasz suggested, a new class
> >> >> >> template provides the optimized implementation for both
> >> >> >> counting_semaphore and binary_semaphore, differing only in
> >> >> >> _M_get_current and _M_try_acquire.
> >> >> >>
> >> >> >> Tested x86_64-linux and sparc-solaris.
> >> >> >
> >> >> > LGTM, one major suggestion to encapsulate the __pred._M_cur == 0
> into named member
> >> >> > function.
> >> >> > I have also checked that  __atomic_wait_address_until takes __pred
> by reference,
> >> >> > and does not make copies, so encapsulating __cur there is OK.
> >> >> >>
> >> >> >>
> >> >> >>  libstdc++-v3/include/bits/atomic_wait.h    |   6 +-
> >> >> >>  libstdc++-v3/include/bits/semaphore_base.h | 241
> +++++++++++++++------
> >> >> >>  libstdc++-v3/include/std/semaphore         |   5 +-
> >> >> >>  3 files changed, 186 insertions(+), 66 deletions(-)
> >> >> >>
> >> >> >> diff --git a/libstdc++-v3/include/bits/atomic_wait.h
> b/libstdc++-v3/include/bits/atomic_wait.h
> >> >> >> index 815726c16ccb..d32fda1c2435 100644
> >> >> >> --- a/libstdc++-v3/include/bits/atomic_wait.h
> >> >> >> +++ b/libstdc++-v3/include/bits/atomic_wait.h
> >> >> >> @@ -249,12 +249,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> >> >>        // C++26 will return __val
> >> >> >>      }
> >> >> >>
> >> >> >> +  // Wait on __addr while *__addr == __old is true.
> >> >> >> +  // Cannot be used for proxy waits.
> >> >> >>    inline void
> >> >> >>    __atomic_wait_address_v(const __detail::__platform_wait_t*
> __addr,
> >> >> >>                           __detail::__platform_wait_t __old,
> >> >> >> -                         int __order)
> >> >> >> +                         int __order, bool __bare_wait = false)
> >> >> >>    {
> >> >> >> -    __detail::__wait_args __args{ __addr, __old, __order };
> >> >> >> +    __detail::__wait_args __args{ __addr, __old, __order,
> __bare_wait };
> >> >> >>      // C++26 will not ignore the return value here
> >> >> >>      __detail::__wait_impl(__addr, __args);
> >> >> >>    }
> >> >> >> diff --git a/libstdc++-v3/include/bits/semaphore_base.h
> b/libstdc++-v3/include/bits/semaphore_base.h
> >> >> >> index 3f7a33ccd51a..a1311e8690c7 100644
> >> >> >> --- a/libstdc++-v3/include/bits/semaphore_base.h
> >> >> >> +++ b/libstdc++-v3/include/bits/semaphore_base.h
> >> >> >> @@ -46,23 +46,20 @@ namespace std _GLIBCXX_VISIBILITY(default)
> >> >> >>  {
> >> >> >>  _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> >> >>
> >> >> >> -  template<bool _Platform_wait>
> >> >> >> -  struct __semaphore_base
> >> >> >> +  struct __semaphore_impl
> >> >> >>    {
> >> >> >> -    using __count_type = __conditional_t<_Platform_wait,
> >> >> >> -
> __detail::__platform_wait_t,
> >> >> >> -                                        ptrdiff_t>;
> >> >> >> +    using __count_type = ptrdiff_t;
> >> >> >>
> >> >> >>      static constexpr ptrdiff_t _S_max
> >> >> >>        = __gnu_cxx::__int_traits<__count_type>::__max;
> >> >> >>
> >> >> >>      constexpr explicit
> >> >> >> -    __semaphore_base(__count_type __count) noexcept
> >> >> >> +    __semaphore_impl(__count_type __count) noexcept
> >> >> >>      : _M_counter(__count)
> >> >> >>      { }
> >> >> >>
> >> >> >> -    __semaphore_base(const __semaphore_base&) = delete;
> >> >> >> -    __semaphore_base& operator=(const __semaphore_base&) =
> delete;
> >> >> >> +    __semaphore_impl(const __semaphore_impl&) = delete;
> >> >> >> +    __semaphore_impl& operator=(const __semaphore_impl&) =
> delete;
> >> >> >>
> >> >> >>      // Load the current counter value.
> >> >> >>      _GLIBCXX_ALWAYS_INLINE __count_type
> >> >> >> @@ -71,8 +68,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> >> >>
> >> >> >>      // Try to acquire the semaphore (i.e. decrement the counter).
> >> >> >>      // Returns false if the current counter is zero, or if
> another thread
> >> >> >> -    // decrements the value first. In the latter case, __cur is
> set to the
> >> >> >> -    // new value.
> >> >> >> +    // changes the value first. In the latter case, __cur is set
> to the new
> >> >> >> +    // value.
> >> >> >>      _GLIBCXX_ALWAYS_INLINE bool
> >> >> >>      _M_do_try_acquire(__count_type& __cur) noexcept
> >> >> >>      {
> >> >> >> @@ -85,24 +82,19 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> >> >>
>  memory_order::relaxed);
> >> >> >>      }
> >> >> >>
> >> >> >> +    // Keep trying to acquire the semaphore in a loop until it
> succeeds.
> >> >> >>      void
> >> >> >>      _M_acquire() noexcept
> >> >> >>      {
> >> >> >> -      auto const __vfn = [this]{ return _M_get_current(); };
> >> >> >> -      auto __val = __vfn();
> >> >> >> -      auto const __pred = [&__val](__count_type __cur) {
> >> >> >> -       if (__cur > 0)
> >> >> >> -         {
> >> >> >> -           __val = __cur;
> >> >> >> -           return true;
> >> >> >> -         }
> >> >> >> -       return false;
> >> >> >> -      };
> >> >> >> -      while (!_M_do_try_acquire(__val))
> >> >> >> -       if (__val == 0)
> >> >> >> +      auto __vfn = [this]{ return _M_get_current(); };
> >> >> >> +      _Can_acquire __pred{__vfn()};
> >> >> >> +      while (!_M_do_try_acquire(__pred._M_val))
> >> >> >> +       if (__pred._M_val == 0)
> >> >> >>           std::__atomic_wait_address(&_M_counter, __pred, __vfn,
> true);
> >> >> >>      }
> >> >> >>
> >> >> >> +    // Try to acquire the semaphore, retrying a small number of
> times
> >> >> >> +    // in case of contention.
> >> >> >>      bool
> >> >> >>      _M_try_acquire() noexcept
> >> >> >>      {
> >> >> >> @@ -116,23 +108,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> >> >>        bool
> >> >> >>        _M_try_acquire_until(const chrono::time_point<_Clock,
> _Duration>& __atime) noexcept
> >> >> >>        {
> >> >> >> -       auto const __vfn = [this]{ return _M_get_current(); };
> >> >> >> -       auto __val = __vfn();
> >> >> >> -       auto const __pred = [&__val](__count_type __cur) {
> >> >> >> -         if (__cur > 0)
> >> >> >> -           {
> >> >> >> -             __val = __cur;
> >> >> >> -             return true;
> >> >> >> -           }
> >> >> >> -         return false;
> >> >> >> -       };
> >> >> >> -       while (!_M_do_try_acquire(__val))
> >> >> >> -         if (__val == 0)
> >> >> >> -           {
> >> >> >> -             if (!std::__atomic_wait_address_until(&_M_counter,
> __pred,
> >> >> >> -                                                   __vfn,
> __atime, true))
> >> >> >> -               return false; // timed out
> >> >> >> -           }
> >> >> >> +       auto __vfn = [this]{ return _M_get_current(); };
> >> >> >> +       _Can_acquire __pred{__vfn()};
> >> >> >> +       while (!_M_do_try_acquire(__pred._M_val))
> >> >> >> +         if (__pred._M_val == 0)
> >> >> >> +           if (!std::__atomic_wait_address_until(&_M_counter,
> __pred,
> >> >> >> +                                                 __vfn, __atime,
> true))
> >> >> >> +             return false; // timed out
> >> >> >>         return true;
> >> >> >>        }
> >> >> >>
> >> >> >> @@ -140,23 +122,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> >> >>        bool
> >> >> >>        _M_try_acquire_for(const chrono::duration<_Rep, _Period>&
> __rtime) noexcept
> >> >> >>        {
> >> >> >> -       auto const __vfn = [this]{ return _M_get_current(); };
> >> >> >> -       auto __val = __vfn();
> >> >> >> -       auto const __pred = [&__val](__count_type __cur) {
> >> >> >> -         if (__cur > 0)
> >> >> >> -           {
> >> >> >> -             __val = __cur;
> >> >> >> -             return true;
> >> >> >> -           }
> >> >> >> -         return false;
> >> >> >> -       };
> >> >> >> -       while (!_M_do_try_acquire(__val))
> >> >> >> -         if (__val == 0)
> >> >> >> -           {
> >> >> >> -             if (!std::__atomic_wait_address_for(&_M_counter,
> __pred,
> >> >> >> -                                                 __vfn, __rtime,
> true))
> >> >> >> -               return false; // timed out
> >> >> >> -           }
> >> >> >> +       auto __vfn = [this]{ return _M_get_current(); };
> >> >> >> +       _Can_acquire __pred{__vfn()};
> >> >> >> +       while (!_M_do_try_acquire(__pred._M_val))
> >> >> >> +         if (__pred._M_val == 0)
> >> >> >
> >> >> > I think we could have some named member funciton here,
> >> >> > like __pred.should_wait(). But not strictly necessary.
> >> >>
> >> >> I like the suggestion. How about naming the predicate struct
> differently:
> >> >>
> >> >>     struct _Available
> >> >>     {
> >> >>       __count_type _M_val; // Cache of the last value loaded from
> _M_counter.
> >> >>
> >> >>       // Argument should be the latest value of the counter.
> >> >>       // Returns true (and caches the value) if it's non-zero and so
> can be
> >> >>       // decremented to acquire the semaphore. Returns false
> otherwise.
> >> >>       bool operator()(__count_type __cur) noexcept
> >> >>       {
> >> >>         if (__cur == 0)
> >> >>           return false;
> >> >>         _M_val = __cur;
> >> >>         return true;
> >> >>       }
> >> >>
> >> >>       // Returns true if the last-known value is non-zero.
> >> >>       explicit operator bool() const noexcept { _M_val > 0; }
> >> >>     };
> >> >>
> >> >> And then it would be used like this:
> >> >>
> >> >>     void
> >> >>     _M_acquire() noexcept
> >> >>     {
> >> >>       auto __vfn = [this]{ return _M_get_current(); };
> >> >>       _Available __is_available{__vfn()};
> >> >>       while (!_M_do_try_acquire(__is_available._M_val))
> >> >>         if (!__is_available)
> >> >
> >> > I think the bool conversion here is too cute, the two uses look
> "simialar".
> >>
> >> I don't see a problem with that, they *are* similar. One tests the
> >> last known value, the other updates the last-known value and then
> >> tests it.
> >>
> >> The operator() function could be implemented as:
> >>
> >> bool operator()(count_type cur)
> >> {
> >>   _M_val = cur;
> >>   return static_cast<bool>(*this);
> >> }
> >>
> >> That would be equivalent, but it's implemented slightly differently to
> >> avoid a redundant store (don't bother to store zero, because it must
> >> already be zero if the predicate was being tested by the waiting
> >> function).
> >>
> >> I did consider naming the operator bool() as operator()() instead, so
> >> the callable could be used with one argument (to update the cache and
> >> test it) and with no arguments (to test the previously cached value).
> >
> > I think the root of my concern was that, if is_available is used as a
> variable, then it looks like
> > we are calling __atomic_wait_address overload that accepts a value.
>
> Ah yes, I see.
>
> > Having operator()(),
> > and using it uniformly as a functor/predicate alleviates this.
>
> OK great.
>
> One last question then:
>
> >> Then it would be used like this:
> >>
> >>     void
> >>     _M_acquire() noexcept
> >>     {
> >>       auto __vfn = [this]{ return _M_get_current(); };
> >>       _Available __is_available{__vfn()};
>
> Would this initialization be better if written like this:
>
>     _Available __is_available{ ._M_val = __vfn() };
>
> I think it makes the use of _M_val on the next line a bit clearer:
>
> >>       while (!_M_do_try_acquire(__is_available._M_val))
>
> Although maybe it's not a big deal, because the definition of
> _Available and its _M_val member are only a few lines further down in
> the same file, not far away in some other header.
>
I think I am OK either way, no opinion here.

>
> >>         if (!__is_available())
> >>           std::__atomic_wait_address(&_M_counter, __is_available,
> __vfn, true);
> >>     }
> >>
> >> > Maye rename use a named member like "current()" or "last()", or
> "is_available.currently()".
> >> >>
> >> >>           std::__atomic_wait_address(&_M_counter, __is_available,
> __vfn, true);
> >> >>     }
> >> >>
> >> >> So __is_available(vfn()) tests whether it's available based on the
> new
> >> >> value returned by vfn(), which is the predicate used by the atomic
> >> >> waiting functions, and (bool)__is_available tests whether it's
> >> >> available based on the cached value, which is what we use to decide
> >> >> whether to call the waiting function.
> >> >>
> >> >> Too cute? I think it makes the logic quite readable.
> >> >>
> >>
>
>

Re: [PATCH 1/2] libstdc++: Optimize std::counting_semaphore for futex path

Reply via email to