Re: [PATCH v1 3/6] libstdc++: Improve low-rank layout_{left, right}::stride.

Tomasz Kaminski Wed, 30 Jul 2025 02:59:51 -0700

On Wed, Jul 30, 2025 at 10:56 AM Luc Grosheintz <luc.groshei...@gmail.com>
wrote:


>
>
> On 7/28/25 13:04, Tomasz Kaminski wrote:
> > On Mon, Jul 28, 2025 at 10:24 AM Tomasz Kaminski <tkami...@redhat.com>
> > wrote:
> >
> >>
> >>
> >>
> >> On Mon, Jul 28, 2025 at 10:03 AM Luc Grosheintz <
> luc.groshei...@gmail.com>
> >> wrote:
> >>
> >>>
> >>>
> >>> On 7/28/25 08:02, Tomasz Kaminski wrote:
> >>>> On Sun, Jul 27, 2025 at 2:47 PM Luc Grosheintz <
> >>> luc.groshei...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> The methods layout_{left,right}::mapping::stride are defined
> >>>>> as
> >>>>>
> >>>>>     \prod_{i = 0}^r E[i]
> >>>>>     \prod_{i = r+1}^n E[i]
> >>>>>
> >>>>> This is computed as the product of a pre-comupted static product and
> >>> the
> >>>>> product of the required dynamic extents.
> >>>>>
> >>>>> Disassembly shows that even for low-rank extents, i.e. rank == 1 and
> >>>>> rank == 2, with at least one  dynamic extent, the generated code
> loads
> >>>>> two values; and then runs the loop over at most one element, e.g.
> >>>>>
> >>>>>    220:  48 8b 0c f5 00 00 00   mov    rcx,QWORD PTR [rsi*8+0x0]
> >>>>>    227:  00
> >>>>>    228:  48 8b 04 f5 00 00 00   mov    rax,QWORD PTR [rsi*8+0x0]
> >>>>>    22f:  00
> >>>>>    230:  48 c1 e1 02            shl    rcx,0x2
> >>>>>    234:  74 1a                  je     250 <stride_left_d5+0x30>
> >>>>>    236:  48 01 f9               add    rcx,rdi
> >>>>>    239:  0f 1f 80 00 00 00 00   nop    DWORD PTR [rax+0x0]
> >>>>>    240:  48 63 17               movsxd rdx,DWORD PTR [rdi]
> >>>>>    243:  48 83 c7 04            add    rdi,0x4
> >>>>>    247:  48 0f af c2            imul   rax,rdx
> >>>>>    24b:  48 39 f9               cmp    rcx,rdi
> >>>>>    24e:  75 f0                  jne    240 <stride_left_d5+0x20>
> >>>>>    250:  c3                     ret
> >>>>>
> >>>>> If there's no dynamic extents, it simply loads the precomputed
> product
> >>>>> of static extents.
> >>>>>
> >>>>> For rank == 1 the answer is constant `1`; for rank == 2 it's either 1
> >>> or
> >>>>> extents.extent(k), with k == 0 for layout_left and k == 1 for
> >>>>> layout_right.
> >>>>>
> >>>>> Consider,
> >>>>>
> >>>>>     using Ed = std::extents<int, dyn>;
> >>>>>     int stride_left_d(const std::layout_left::mapping<Ed>& m, size_t
> r)
> >>>>>     { return m.stride(r); }
> >>>>>
> >>>>>     using E3d = std::extents<int, 3, dyn>;
> >>>>>     int stride_left_3d(const std::layout_left::mapping<E3d>& m,
> size_t
> >>> r)
> >>>>>     { return m.stride(r); }
> >>>>>
> >>>>>     using Ed5 = std::extents<int, dyn, 5>;
> >>>>>     int stride_left_d5(const std::layout_left::mapping<Ed5>& m,
> size_t
> >>> r)
> >>>>>     { return m.stride(r); }
> >>>>>
> >>>>> The optimized code for these three cases is:
> >>>>>
> >>>>>     0000000000000060 <stride_left_d>:
> >>>>>     60:  b8 01 00 00 00         mov    eax,0x1
> >>>>>     65:  c3                     ret
> >>>>>
> >>>>>     0000000000000090 <stride_left_3d>:
> >>>>>     90:  48 83 fe 01            cmp    rsi,0x1
> >>>>>     94:  19 c0                  sbb    eax,eax
> >>>>>     96:  83 e0 fe               and    eax,0xfffffffe
> >>>>>     99:  83 c0 03               add    eax,0x3
> >>>>>     9c:  c3                     ret
> >>>>>
> >>>>>     00000000000000a0 <stride_left_d5>:
> >>>>>     a0:  b8 01 00 00 00         mov    eax,0x1
> >>>>>     a5:  48 85 f6               test   rsi,rsi
> >>>>>     a8:  74 02                  je     ac <stride_left_d5+0xc>
> >>>>>     aa:  8b 07                  mov    eax,DWORD PTR [rdi]
> >>>>>    i ac:  c3                     ret
> >>>>>
> >>>>> For rank == 1 it simply returns 1 (as expected). For rank == 2, it
> >>>>> either implements a branchless formula, or conditionally loads one
> >>>>> value. In all cases involving a dynamic extent this seems like it's
> >>>>> always doing clearly less work, both in terms of computation and
> loads.
> >>>>>
> >>>>> For rank == 2, it trades loading one value for a branchless sequence
> of
> >>>>> four instructions that don't require loading any values.
> >>>>>
> >>>> I will put this optimization into the __fwd_prod and __rev_pord
> >>> functions,
> >>>> so it will be applied for all uses. This will also avoid us creating
> >>> this
> >>>> caching
> >>>> tables for such small ranks.
> >>>
> >>> The problem is that we don't have the same amount of information in
> >>> the stride and __fwd_prod. The valid values for __r are 1, ..., rank;
> >>> for __fwd_prod it's inclusive, while in stride it's exclusive.
> Therefore,
> >>> we can do the optimization with one comparison less in stride than in
> >>> __fwd_prod.
> >>>
> >> Make sense, and I am OK having this optimization there. However, out of
> >> curiosity, if we put always_inline on the __fwd_prod, wouldn't compiler
> be
> >> able
> >> to eliminate __rank == __i. Also once we have separate call to __size,
> we
> >> could
> >> put assertions (just as a comment) that for __fwd_prod and __rev_prod
> __r
> >> is
> >> required to be smaller than rank().
> >>
> > I think I would pursue that direction, and have similar if constexpr in
> the
> > __size function:
> > __rank == 0 -> 1
> > __rank == 1 -> extent(0)
> > __rank == 2 -> extent(0) * extent(1)
> > __rank == rank_dynamic() -> __ext_prod with __sta_ext == 1
> > otherwise -> use __sta_extr.
>
> We know that __size is optimized very aggressively and effectively. On
> -O2 everything is inlined and no symbols are created, no static storage
> is needed to compute __size. This is regardless of rank.
>
Are you referring to the final binary here, or does the same happen even if
we compile an object file? I am also interested in impact of the linker.

>
> The following:
>
>    using E1 = std::extents<int, dyn>;
>    int size1(const std::mdspan<double, E1>& md)
>    { return md.size(); }
>
>    using E2 = std::extents<int, 3, 5, dyn, dyn, 7, dyn, 11>;
>    int size2(const std::mdspan<double, E2>& md)
>    { return md.size(); }
>
>    using E3 = std::extents<int, 3, dyn>;
>    int size3(const std::mdspan<double, E3>& md)
>    { return md.size(); }
>
> when compiled with -O2 -c size.o
>
> $ nm size.o
> 0000000000000000 T size1
> 0000000000000010 T size2
> 0000000000000030 T size3
>
> $ objdump -h size.o
> size.o:     file format elf64-x86-64
>
> Sections:
> Idx Name          Size      VMA               LMA               File off
> Algn
>    0 .text         00000037  0000000000000000  0000000000000000  00000040
> 2**4
>                    CONTENTS, ALLOC, LOAD, READONLY, CODE
>    1 .data         00000000  0000000000000000  0000000000000000  00000077
> 2**0
>                    CONTENTS, ALLOC, LOAD, DATA
>    2 .bss          00000000  0000000000000000  0000000000000000  00000077
> 2**0
>                    ALLOC
>    3 .comment      0000002b  0000000000000000  0000000000000000  00000077
> 2**0
>                    CONTENTS, READONLY
>    4 .note.GNU-stack 00000000  0000000000000000  0000000000000000
> 000000a2  2**0
>                    CONTENTS, READONLY
>    5 .note.gnu.property 00000030  0000000000000000  0000000000000000
> 000000a8  2**3
>                    CONTENTS, ALLOC, LOAD, READONLY, DATA
>    6 .eh_frame     00000058  0000000000000000  0000000000000000  000000d8
> 2**3
>                    CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA
>
> $ objdump -d size.o
> 0000000000000000 <size1>:
>     0:  8b 07                   mov    (%rdi),%eax
>     2:  c3                      ret
>     3:  66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
>     a:  00 00 00 00
>     e:  66 90                   xchg   %ax,%ax
>
> 0000000000000010 <size2>:
>    10:  48 63 07                movslq (%rdi),%rax
>    13:  48 63 57 04             movslq 0x4(%rdi),%rdx
>    17:  48 0f af c2             imul   %rdx,%rax
>    1b:  0f af 47 08             imul   0x8(%rdi),%eax
>    1f:  69 c0 83 04 00 00       imul   $0x483,%eax,%eax
>    25:  c3                      ret
>    26:  66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
>    2d:  00 00 00
>
> 0000000000000030 <size3>:
>    30:  48 63 07                movslq (%rdi),%rax
>    33:  8d 04 40                lea    (%rax,%rax,2),%eax
>    36:  c3                      ret
>
> Hence, I'm not sure where I need to look to find the cost of not putting
> the constexpr-if in __size. If I can see why it's slower, I'm happy to
> add them. However, since I've looked and can't see a cost, I'm reluctant
> to add them because they require test coverage for each branch (otherwise
> they're never compiled and could contain complete non-sense).
>
Indeed, form you checks it looks like if constexpr are not necessary at all,
however I would like __size to still avoid referencing _Rev_prod/_Fwd_prod
at all.
As alternative, what you think about defining the size as follows:
     constexpr size_t __rank = _Extents::rank();
     constexpr size_t __sta_prod =__static_extents_prod(__exts);
     return __extents_prod(__exts, __sta_prod, 0u, __rank);
This will alleviate or my concerns about creating _Rev_prod/_Fws_prod
symbols,
and allows us to decuple __fwd_prod/_rev_prod functions from it, and give
them
stronger preconditions.


> >
> >>
> >>> I purposefully checked mdspan::size (part of the previous commit); and
> >>> on optimized builds it fully unrolls the loop and does everything
> >>> automatically, meaning it doesn't need help and we're not repeating
> >>> the same optimization several times.
> >>>
> >>> As for avoiding the tables for small ranks, we can refactor _RevProd
> >>> as follows:
> >>>
> >>>     template<size_t... _Extents>
> >>>       struct _RevProd
> >>>       {
> >>>          size_t _S_value(size_t i)
> >>>          { return _S_data[i]; }
> >>>
> >>>       private:
> >>>          array _S_data = consteval ...;
> >>>       }
> >>>
> >>> and use partial specialization to eliminate _S_data.
> >>>
> >>>>
> >>>>>
> >>>>> libstdc++-v3/ChangeLog:
> >>>>>
> >>>>>           * include/std/mdspan (layout_left::mapping::stride):
> Optimize
> >>>>>           for rank <= 2.
> >>>>>           (layout_right::mapping::stride): Ditto.
> >>>>>
> >>>>> Signed-off-by: Luc Grosheintz <luc.groshei...@gmail.com>
> >>>>> ---
> >>>>>    libstdc++-v3/include/std/mdspan | 14 ++++++++++++--
> >>>>>    1 file changed, 12 insertions(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/libstdc++-v3/include/std/mdspan
> >>>>> b/libstdc++-v3/include/std/mdspan
> >>>>> index 06ccf3e3827..f288af96cdb 100644
> >>>>> --- a/libstdc++-v3/include/std/mdspan
> >>>>> +++ b/libstdc++-v3/include/std/mdspan
> >>>>> @@ -652,7 +652,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>>>>          requires (extents_type::rank() > 0)
> >>>>>          {
> >>>>>           __glibcxx_assert(__i < extents_type::rank());
> >>>>> -       return __mdspan::__fwd_prod(_M_extents, __i);
> >>>>> +       if constexpr (extents_type::rank() == 1)
> >>>>> +         return 1;
> >>>>> +       else if constexpr (extents_type::rank() == 2)
> >>>>> +         return __i == 0 ? 1 : _M_extents.extent(0);
> >>>>> +       else
> >>>>> +         return __mdspan::__fwd_prod(_M_extents, __i);
> >>>>>          }
> >>>>>
> >>>>>          template<typename _OExtents>
> >>>>> @@ -797,7 +802,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION
> >>>>>          requires (extents_type::rank() > 0)
> >>>>>          {
> >>>>>           __glibcxx_assert(__i < extents_type::rank());
> >>>>> -       return __mdspan::__rev_prod(_M_extents, __i);
> >>>>> +       if constexpr (extents_type::rank() == 1)
> >>>>> +         return 1;
> >>>>> +       else if constexpr (extents_type::rank() == 2)
> >>>>> +         return __i == 0 ? _M_extents.extent(1) : 1;
> >>>>> +       else
> >>>>> +         return __mdspan::__rev_prod(_M_extents, __i);
> >>>>>          }
> >>>>>
> >>>>>          template<typename _OExtents>
> >>>>> --
> >>>>> 2.50.0
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >
>
>

Re: [PATCH v1 3/6] libstdc++: Improve low-rank layout_{left, right}::stride.

Reply via email to