Re: [PATCH 1/2] libstdc++: Optimize std::counting_semaphore for futex path

Tomasz Kaminski Thu, 05 Jun 2025 10:58:11 -0700

On Thu, Jun 5, 2025 at 6:44 PM Jonathan Wakely <jwak...@redhat.com> wrote:


> On Thu, 5 Jun 2025 at 15:42, Tomasz Kaminski <tkami...@redhat.com> wrote:
> >
> >
> >
> > On Thu, Jun 5, 2025 at 12:41 PM Jonathan Wakely <jwak...@redhat.com>
> wrote:
> >>
> >> On Thu, 5 Jun 2025 at 10:00, Tomasz Kaminski <tkami...@redhat.com>
> wrote:
> >> >
> >> >
> >> >
> >> > On Thu, Jun 5, 2025 at 12:06 AM Jonathan Wakely <jwak...@redhat.com>
> wrote:
> >> >>
> >> >> Rename __semaphore_base to __semaphore_impl, because it's not used
> as a
> >> >> base class. Replace the three identical lambda expressions with a
> named
> >> >> class, __semaphore_impl::_Can_acquire, which stores the most recent
> >> >> value of the counter as a data member.
> >> >>
> >> >> Add a new __platform_semaphore_impl class template which uses
> >> >> __platform_wait_t for the counter and uses more efficient atomic
> waits
> >> >> for the acquire functions, when __platform_wait is available. For a
> >> >> binary semaphore some members are further optimized because we know
> the
> >> >> counter can only be zero or one.
> >> >>
> >> >> Also add a bare wait flag to __atomic_wait_address_v, for consistency
> >> >> with __atomic_wait_address_until_v and __atomic_wait_address_for_v
> and
> >> >> to allow semaphores to use it without the redundant overhead of
> tracking
> >> >> waiters.
> >> >>
> >> >> libstdc++-v3/ChangeLog:
> >> >>
> >> >>         * include/bits/atomic_wait.h (__atomic_wait_address_v): Add
> bare
> >> >>         wait flag.
> >> >>         * include/bits/semaphore_base.h (__semaphore_base): Rename to
> >> >>         __semaphore_impl. Replace local variable and predicate lambda
> >> >>         with _Can_acquire struct.
> >> >>         (__platform_semaphore_impl): New class template.
> >> >>         (__semaphore_impl): Remove alias template.
> >> >>         * include/std/semaphore:
> >> >> ---
> >> >>
> >> >> This replaces the two patches:
> >> >>
> >> >> [PATCH 3/4] libstdc++: Optimize std::binary_semaphore
> >> >> [PATCH 4/4] libstdc++: Optimize std::counting_semaphore for futex
> path
> >> >>
> >> >> The problem with those patches was that the "optimized"
> implementations
> >> >> were used if _Max <= numeric_limits<__platform_wait_t>::max(), which
> is
> >> >> always true for many non-futex targets because we do:
> >> >>
> >> >> # if ATOMIC_LONG_LOCK_FREE == 2
> >> >>     using __platform_wait_t = unsigned long;
> >> >> # else
> >> >>     using __platform_wait_t = unsigned int;
> >> >> # endif
> >> >>
> >> >> The earlier patches should have been checking either
> >> >> _GLIBCXX_HAVE_PLATFORM_WAIT or the __platform_wait_uses_type variable
> >> >> template to see if an efficient (i.e. not condition_variable-based)
> >> >> implementation is actually supported. It's not useful to check if the
> >> >> semaphore's maximum value fits in the __platform_wait_t type if that
> >> >> type doesn't have fast wait/wake operations like a futex.
> >> >>
> >> >> The new proposal in this patch only selects the optimized
> >> >> implementations if they can be implemented using __platform_wait.
> The
> >> >> __semaphore_impl always uses the portable atomic wait API (using an
> >> >> accessor function and predicate). As Tomasz suggested, a new class
> >> >> template provides the optimized implementation for both
> >> >> counting_semaphore and binary_semaphore, differing only in
> >> >> _M_get_current and _M_try_acquire.
> >> >>
> >> >> Tested x86_64-linux and sparc-solaris.
> >> >
> >> > LGTM, one major suggestion to encapsulate the __pred._M_cur == 0 into
> named member
> >> > function.
> >> > I have also checked that  __atomic_wait_address_until takes __pred by
> reference,
> >> > and does not make copies, so encapsulating __cur there is OK.
> >> >>
> >> >>
> >> >>  libstdc++-v3/include/bits/atomic_wait.h    |   6 +-
> >> >>  libstdc++-v3/include/bits/semaphore_base.h | 241
> +++++++++++++++------
> >> >>  libstdc++-v3/include/std/semaphore         |   5 +-
> >> >>  3 files changed, 186 insertions(+), 66 deletions(-)
> >> >>
> >> >> diff --git a/libstdc++-v3/include/bits/atomic_wait.h
> b/libstdc++-v3/include/bits/atomic_wait.h
> >> >> index 815726c16ccb..d32fda1c2435 100644
> >> >> --- a/libstdc++-v3/include/bits/atomic_wait.h
> >> >> +++ b/libstdc++-v3/include/bits/atomic_wait.h
> >> >> @@ -249,12 +249,14 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> >>        // C++26 will return __val
> >> >>      }
> >> >>
> >> >> +  // Wait on __addr while *__addr == __old is true.
> >> >> +  // Cannot be used for proxy waits.
> >> >>    inline void
> >> >>    __atomic_wait_address_v(const __detail::__platform_wait_t* __addr,
> >> >>                           __detail::__platform_wait_t __old,
> >> >> -                         int __order)
> >> >> +                         int __order, bool __bare_wait = false)
> >> >>    {
> >> >> -    __detail::__wait_args __args{ __addr, __old, __order };
> >> >> +    __detail::__wait_args __args{ __addr, __old, __order,
> __bare_wait };
> >> >>      // C++26 will not ignore the return value here
> >> >>      __detail::__wait_impl(__addr, __args);
> >> >>    }
> >> >> diff --git a/libstdc++-v3/include/bits/semaphore_base.h
> b/libstdc++-v3/include/bits/semaphore_base.h
> >> >> index 3f7a33ccd51a..a1311e8690c7 100644
> >> >> --- a/libstdc++-v3/include/bits/semaphore_base.h
> >> >> +++ b/libstdc++-v3/include/bits/semaphore_base.h
> >> >> @@ -46,23 +46,20 @@ namespace std _GLIBCXX_VISIBILITY(default)
> >> >>  {
> >> >>  _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> >>
> >> >> -  template<bool _Platform_wait>
> >> >> -  struct __semaphore_base
> >> >> +  struct __semaphore_impl
> >> >>    {
> >> >> -    using __count_type = __conditional_t<_Platform_wait,
> >> >> -                                        __detail::__platform_wait_t,
> >> >> -                                        ptrdiff_t>;
> >> >> +    using __count_type = ptrdiff_t;
> >> >>
> >> >>      static constexpr ptrdiff_t _S_max
> >> >>        = __gnu_cxx::__int_traits<__count_type>::__max;
> >> >>
> >> >>      constexpr explicit
> >> >> -    __semaphore_base(__count_type __count) noexcept
> >> >> +    __semaphore_impl(__count_type __count) noexcept
> >> >>      : _M_counter(__count)
> >> >>      { }
> >> >>
> >> >> -    __semaphore_base(const __semaphore_base&) = delete;
> >> >> -    __semaphore_base& operator=(const __semaphore_base&) = delete;
> >> >> +    __semaphore_impl(const __semaphore_impl&) = delete;
> >> >> +    __semaphore_impl& operator=(const __semaphore_impl&) = delete;
> >> >>
> >> >>      // Load the current counter value.
> >> >>      _GLIBCXX_ALWAYS_INLINE __count_type
> >> >> @@ -71,8 +68,8 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> >>
> >> >>      // Try to acquire the semaphore (i.e. decrement the counter).
> >> >>      // Returns false if the current counter is zero, or if another
> thread
> >> >> -    // decrements the value first. In the latter case, __cur is set
> to the
> >> >> -    // new value.
> >> >> +    // changes the value first. In the latter case, __cur is set to
> the new
> >> >> +    // value.
> >> >>      _GLIBCXX_ALWAYS_INLINE bool
> >> >>      _M_do_try_acquire(__count_type& __cur) noexcept
> >> >>      {
> >> >> @@ -85,24 +82,19 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> >>
>  memory_order::relaxed);
> >> >>      }
> >> >>
> >> >> +    // Keep trying to acquire the semaphore in a loop until it
> succeeds.
> >> >>      void
> >> >>      _M_acquire() noexcept
> >> >>      {
> >> >> -      auto const __vfn = [this]{ return _M_get_current(); };
> >> >> -      auto __val = __vfn();
> >> >> -      auto const __pred = [&__val](__count_type __cur) {
> >> >> -       if (__cur > 0)
> >> >> -         {
> >> >> -           __val = __cur;
> >> >> -           return true;
> >> >> -         }
> >> >> -       return false;
> >> >> -      };
> >> >> -      while (!_M_do_try_acquire(__val))
> >> >> -       if (__val == 0)
> >> >> +      auto __vfn = [this]{ return _M_get_current(); };
> >> >> +      _Can_acquire __pred{__vfn()};
> >> >> +      while (!_M_do_try_acquire(__pred._M_val))
> >> >> +       if (__pred._M_val == 0)
> >> >>           std::__atomic_wait_address(&_M_counter, __pred, __vfn,
> true);
> >> >>      }
> >> >>
> >> >> +    // Try to acquire the semaphore, retrying a small number of
> times
> >> >> +    // in case of contention.
> >> >>      bool
> >> >>      _M_try_acquire() noexcept
> >> >>      {
> >> >> @@ -116,23 +108,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> >>        bool
> >> >>        _M_try_acquire_until(const chrono::time_point<_Clock,
> _Duration>& __atime) noexcept
> >> >>        {
> >> >> -       auto const __vfn = [this]{ return _M_get_current(); };
> >> >> -       auto __val = __vfn();
> >> >> -       auto const __pred = [&__val](__count_type __cur) {
> >> >> -         if (__cur > 0)
> >> >> -           {
> >> >> -             __val = __cur;
> >> >> -             return true;
> >> >> -           }
> >> >> -         return false;
> >> >> -       };
> >> >> -       while (!_M_do_try_acquire(__val))
> >> >> -         if (__val == 0)
> >> >> -           {
> >> >> -             if (!std::__atomic_wait_address_until(&_M_counter,
> __pred,
> >> >> -                                                   __vfn, __atime,
> true))
> >> >> -               return false; // timed out
> >> >> -           }
> >> >> +       auto __vfn = [this]{ return _M_get_current(); };
> >> >> +       _Can_acquire __pred{__vfn()};
> >> >> +       while (!_M_do_try_acquire(__pred._M_val))
> >> >> +         if (__pred._M_val == 0)
> >> >> +           if (!std::__atomic_wait_address_until(&_M_counter,
> __pred,
> >> >> +                                                 __vfn, __atime,
> true))
> >> >> +             return false; // timed out
> >> >>         return true;
> >> >>        }
> >> >>
> >> >> @@ -140,23 +122,13 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >> >>        bool
> >> >>        _M_try_acquire_for(const chrono::duration<_Rep, _Period>&
> __rtime) noexcept
> >> >>        {
> >> >> -       auto const __vfn = [this]{ return _M_get_current(); };
> >> >> -       auto __val = __vfn();
> >> >> -       auto const __pred = [&__val](__count_type __cur) {
> >> >> -         if (__cur > 0)
> >> >> -           {
> >> >> -             __val = __cur;
> >> >> -             return true;
> >> >> -           }
> >> >> -         return false;
> >> >> -       };
> >> >> -       while (!_M_do_try_acquire(__val))
> >> >> -         if (__val == 0)
> >> >> -           {
> >> >> -             if (!std::__atomic_wait_address_for(&_M_counter,
> __pred,
> >> >> -                                                 __vfn, __rtime,
> true))
> >> >> -               return false; // timed out
> >> >> -           }
> >> >> +       auto __vfn = [this]{ return _M_get_current(); };
> >> >> +       _Can_acquire __pred{__vfn()};
> >> >> +       while (!_M_do_try_acquire(__pred._M_val))
> >> >> +         if (__pred._M_val == 0)
> >> >
> >> > I think we could have some named member funciton here,
> >> > like __pred.should_wait(). But not strictly necessary.
> >>
> >> I like the suggestion. How about naming the predicate struct
> differently:
> >>
> >>     struct _Available
> >>     {
> >>       __count_type _M_val; // Cache of the last value loaded from
> _M_counter.
> >>
> >>       // Argument should be the latest value of the counter.
> >>       // Returns true (and caches the value) if it's non-zero and so
> can be
> >>       // decremented to acquire the semaphore. Returns false otherwise.
> >>       bool operator()(__count_type __cur) noexcept
> >>       {
> >>         if (__cur == 0)
> >>           return false;
> >>         _M_val = __cur;
> >>         return true;
> >>       }
> >>
> >>       // Returns true if the last-known value is non-zero.
> >>       explicit operator bool() const noexcept { _M_val > 0; }
> >>     };
> >>
> >> And then it would be used like this:
> >>
> >>     void
> >>     _M_acquire() noexcept
> >>     {
> >>       auto __vfn = [this]{ return _M_get_current(); };
> >>       _Available __is_available{__vfn()};
> >>       while (!_M_do_try_acquire(__is_available._M_val))
> >>         if (!__is_available)
> >
> > I think the bool conversion here is too cute, the two uses look
> "simialar".
>
> I don't see a problem with that, they *are* similar. One tests the
> last known value, the other updates the last-known value and then
> tests it.
>
> The operator() function could be implemented as:
>
> bool operator()(count_type cur)
> {
>   _M_val = cur;
>   return static_cast<bool>(*this);
> }
>
> That would be equivalent, but it's implemented slightly differently to
> avoid a redundant store (don't bother to store zero, because it must
> already be zero if the predicate was being tested by the waiting
> function).
>
> I did consider naming the operator bool() as operator()() instead, so
> the callable could be used with one argument (to update the cache and
> test it) and with no arguments (to test the previously cached value).
>
I think the root of my concern was that, if is_available is used as a
variable, then it looks like
we are calling __atomic_wait_address overload that accepts a value. Having
operator()(),
and using it uniformly as a functor/predicate alleviates this.

>
> Then it would be used like this:
>
>     void
>     _M_acquire() noexcept
>     {
>       auto __vfn = [this]{ return _M_get_current(); };
>       _Available __is_available{__vfn()};
>       while (!_M_do_try_acquire(__is_available._M_val))
>         if (!__is_available())
>           std::__atomic_wait_address(&_M_counter, __is_available, __vfn,
> true);
>     }
>
> > Maye rename use a named member like "current()" or "last()", or
> "is_available.currently()".
> >>
> >>           std::__atomic_wait_address(&_M_counter, __is_available,
> __vfn, true);
> >>     }
> >>
> >> So __is_available(vfn()) tests whether it's available based on the new
> >> value returned by vfn(), which is the predicate used by the atomic
> >> waiting functions, and (bool)__is_available tests whether it's
> >> available based on the cached value, which is what we use to decide
> >> whether to call the waiting function.
> >>
> >> Too cute? I think it makes the logic quite readable.
> >>
>
>

Re: [PATCH 1/2] libstdc++: Optimize std::counting_semaphore for futex path

Reply via email to