On Wed, Jul 30, 2025 at 10:56 AM Luc Grosheintz <luc.groshei...@gmail.com> wrote:
> > > On 7/28/25 13:04, Tomasz Kaminski wrote: > > On Mon, Jul 28, 2025 at 10:24 AM Tomasz Kaminski <tkami...@redhat.com> > > wrote: > > > >> > >> > >> > >> On Mon, Jul 28, 2025 at 10:03 AM Luc Grosheintz < > luc.groshei...@gmail.com> > >> wrote: > >> > >>> > >>> > >>> On 7/28/25 08:02, Tomasz Kaminski wrote: > >>>> On Sun, Jul 27, 2025 at 2:47 PM Luc Grosheintz < > >>> luc.groshei...@gmail.com> > >>>> wrote: > >>>> > >>>>> The methods layout_{left,right}::mapping::stride are defined > >>>>> as > >>>>> > >>>>> \prod_{i = 0}^r E[i] > >>>>> \prod_{i = r+1}^n E[i] > >>>>> > >>>>> This is computed as the product of a pre-comupted static product and > >>> the > >>>>> product of the required dynamic extents. > >>>>> > >>>>> Disassembly shows that even for low-rank extents, i.e. rank == 1 and > >>>>> rank == 2, with at least one dynamic extent, the generated code > loads > >>>>> two values; and then runs the loop over at most one element, e.g. > >>>>> > >>>>> 220: 48 8b 0c f5 00 00 00 mov rcx,QWORD PTR [rsi*8+0x0] > >>>>> 227: 00 > >>>>> 228: 48 8b 04 f5 00 00 00 mov rax,QWORD PTR [rsi*8+0x0] > >>>>> 22f: 00 > >>>>> 230: 48 c1 e1 02 shl rcx,0x2 > >>>>> 234: 74 1a je 250 <stride_left_d5+0x30> > >>>>> 236: 48 01 f9 add rcx,rdi > >>>>> 239: 0f 1f 80 00 00 00 00 nop DWORD PTR [rax+0x0] > >>>>> 240: 48 63 17 movsxd rdx,DWORD PTR [rdi] > >>>>> 243: 48 83 c7 04 add rdi,0x4 > >>>>> 247: 48 0f af c2 imul rax,rdx > >>>>> 24b: 48 39 f9 cmp rcx,rdi > >>>>> 24e: 75 f0 jne 240 <stride_left_d5+0x20> > >>>>> 250: c3 ret > >>>>> > >>>>> If there's no dynamic extents, it simply loads the precomputed > product > >>>>> of static extents. > >>>>> > >>>>> For rank == 1 the answer is constant `1`; for rank == 2 it's either 1 > >>> or > >>>>> extents.extent(k), with k == 0 for layout_left and k == 1 for > >>>>> layout_right. > >>>>> > >>>>> Consider, > >>>>> > >>>>> using Ed = std::extents<int, dyn>; > >>>>> int stride_left_d(const std::layout_left::mapping<Ed>& m, size_t > r) > >>>>> { return m.stride(r); } > >>>>> > >>>>> using E3d = std::extents<int, 3, dyn>; > >>>>> int stride_left_3d(const std::layout_left::mapping<E3d>& m, > size_t > >>> r) > >>>>> { return m.stride(r); } > >>>>> > >>>>> using Ed5 = std::extents<int, dyn, 5>; > >>>>> int stride_left_d5(const std::layout_left::mapping<Ed5>& m, > size_t > >>> r) > >>>>> { return m.stride(r); } > >>>>> > >>>>> The optimized code for these three cases is: > >>>>> > >>>>> 0000000000000060 <stride_left_d>: > >>>>> 60: b8 01 00 00 00 mov eax,0x1 > >>>>> 65: c3 ret > >>>>> > >>>>> 0000000000000090 <stride_left_3d>: > >>>>> 90: 48 83 fe 01 cmp rsi,0x1 > >>>>> 94: 19 c0 sbb eax,eax > >>>>> 96: 83 e0 fe and eax,0xfffffffe > >>>>> 99: 83 c0 03 add eax,0x3 > >>>>> 9c: c3 ret > >>>>> > >>>>> 00000000000000a0 <stride_left_d5>: > >>>>> a0: b8 01 00 00 00 mov eax,0x1 > >>>>> a5: 48 85 f6 test rsi,rsi > >>>>> a8: 74 02 je ac <stride_left_d5+0xc> > >>>>> aa: 8b 07 mov eax,DWORD PTR [rdi] > >>>>> i ac: c3 ret > >>>>> > >>>>> For rank == 1 it simply returns 1 (as expected). For rank == 2, it > >>>>> either implements a branchless formula, or conditionally loads one > >>>>> value. In all cases involving a dynamic extent this seems like it's > >>>>> always doing clearly less work, both in terms of computation and > loads. > >>>>> > >>>>> For rank == 2, it trades loading one value for a branchless sequence > of > >>>>> four instructions that don't require loading any values. > >>>>> > >>>> I will put this optimization into the __fwd_prod and __rev_pord > >>> functions, > >>>> so it will be applied for all uses. This will also avoid us creating > >>> this > >>>> caching > >>>> tables for such small ranks. > >>> > >>> The problem is that we don't have the same amount of information in > >>> the stride and __fwd_prod. The valid values for __r are 1, ..., rank; > >>> for __fwd_prod it's inclusive, while in stride it's exclusive. > Therefore, > >>> we can do the optimization with one comparison less in stride than in > >>> __fwd_prod. > >>> > >> Make sense, and I am OK having this optimization there. However, out of > >> curiosity, if we put always_inline on the __fwd_prod, wouldn't compiler > be > >> able > >> to eliminate __rank == __i. Also once we have separate call to __size, > we > >> could > >> put assertions (just as a comment) that for __fwd_prod and __rev_prod > __r > >> is > >> required to be smaller than rank(). > >> > > I think I would pursue that direction, and have similar if constexpr in > the > > __size function: > > __rank == 0 -> 1 > > __rank == 1 -> extent(0) > > __rank == 2 -> extent(0) * extent(1) > > __rank == rank_dynamic() -> __ext_prod with __sta_ext == 1 > > otherwise -> use __sta_extr. > > We know that __size is optimized very aggressively and effectively. On > -O2 everything is inlined and no symbols are created, no static storage > is needed to compute __size. This is regardless of rank. > Are you referring to the final binary here, or does the same happen even if we compile an object file? I am also interested in impact of the linker. > > The following: > > using E1 = std::extents<int, dyn>; > int size1(const std::mdspan<double, E1>& md) > { return md.size(); } > > using E2 = std::extents<int, 3, 5, dyn, dyn, 7, dyn, 11>; > int size2(const std::mdspan<double, E2>& md) > { return md.size(); } > > using E3 = std::extents<int, 3, dyn>; > int size3(const std::mdspan<double, E3>& md) > { return md.size(); } > > when compiled with -O2 -c size.o > > $ nm size.o > 0000000000000000 T size1 > 0000000000000010 T size2 > 0000000000000030 T size3 > > $ objdump -h size.o > size.o: file format elf64-x86-64 > > Sections: > Idx Name Size VMA LMA File off > Algn > 0 .text 00000037 0000000000000000 0000000000000000 00000040 > 2**4 > CONTENTS, ALLOC, LOAD, READONLY, CODE > 1 .data 00000000 0000000000000000 0000000000000000 00000077 > 2**0 > CONTENTS, ALLOC, LOAD, DATA > 2 .bss 00000000 0000000000000000 0000000000000000 00000077 > 2**0 > ALLOC > 3 .comment 0000002b 0000000000000000 0000000000000000 00000077 > 2**0 > CONTENTS, READONLY > 4 .note.GNU-stack 00000000 0000000000000000 0000000000000000 > 000000a2 2**0 > CONTENTS, READONLY > 5 .note.gnu.property 00000030 0000000000000000 0000000000000000 > 000000a8 2**3 > CONTENTS, ALLOC, LOAD, READONLY, DATA > 6 .eh_frame 00000058 0000000000000000 0000000000000000 000000d8 > 2**3 > CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATA > > $ objdump -d size.o > 0000000000000000 <size1>: > 0: 8b 07 mov (%rdi),%eax > 2: c3 ret > 3: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1) > a: 00 00 00 00 > e: 66 90 xchg %ax,%ax > > 0000000000000010 <size2>: > 10: 48 63 07 movslq (%rdi),%rax > 13: 48 63 57 04 movslq 0x4(%rdi),%rdx > 17: 48 0f af c2 imul %rdx,%rax > 1b: 0f af 47 08 imul 0x8(%rdi),%eax > 1f: 69 c0 83 04 00 00 imul $0x483,%eax,%eax > 25: c3 ret > 26: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1) > 2d: 00 00 00 > > 0000000000000030 <size3>: > 30: 48 63 07 movslq (%rdi),%rax > 33: 8d 04 40 lea (%rax,%rax,2),%eax > 36: c3 ret > > Hence, I'm not sure where I need to look to find the cost of not putting > the constexpr-if in __size. If I can see why it's slower, I'm happy to > add them. However, since I've looked and can't see a cost, I'm reluctant > to add them because they require test coverage for each branch (otherwise > they're never compiled and could contain complete non-sense). > Indeed, form you checks it looks like if constexpr are not necessary at all, however I would like __size to still avoid referencing _Rev_prod/_Fwd_prod at all. As alternative, what you think about defining the size as follows: constexpr size_t __rank = _Extents::rank(); constexpr size_t __sta_prod =__static_extents_prod(__exts); return __extents_prod(__exts, __sta_prod, 0u, __rank); This will alleviate or my concerns about creating _Rev_prod/_Fws_prod symbols, and allows us to decuple __fwd_prod/_rev_prod functions from it, and give them stronger preconditions. > > > >> > >>> I purposefully checked mdspan::size (part of the previous commit); and > >>> on optimized builds it fully unrolls the loop and does everything > >>> automatically, meaning it doesn't need help and we're not repeating > >>> the same optimization several times. > >>> > >>> As for avoiding the tables for small ranks, we can refactor _RevProd > >>> as follows: > >>> > >>> template<size_t... _Extents> > >>> struct _RevProd > >>> { > >>> size_t _S_value(size_t i) > >>> { return _S_data[i]; } > >>> > >>> private: > >>> array _S_data = consteval ...; > >>> } > >>> > >>> and use partial specialization to eliminate _S_data. > >>> > >>>> > >>>>> > >>>>> libstdc++-v3/ChangeLog: > >>>>> > >>>>> * include/std/mdspan (layout_left::mapping::stride): > Optimize > >>>>> for rank <= 2. > >>>>> (layout_right::mapping::stride): Ditto. > >>>>> > >>>>> Signed-off-by: Luc Grosheintz <luc.groshei...@gmail.com> > >>>>> --- > >>>>> libstdc++-v3/include/std/mdspan | 14 ++++++++++++-- > >>>>> 1 file changed, 12 insertions(+), 2 deletions(-) > >>>>> > >>>>> diff --git a/libstdc++-v3/include/std/mdspan > >>>>> b/libstdc++-v3/include/std/mdspan > >>>>> index 06ccf3e3827..f288af96cdb 100644 > >>>>> --- a/libstdc++-v3/include/std/mdspan > >>>>> +++ b/libstdc++-v3/include/std/mdspan > >>>>> @@ -652,7 +652,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > >>>>> requires (extents_type::rank() > 0) > >>>>> { > >>>>> __glibcxx_assert(__i < extents_type::rank()); > >>>>> - return __mdspan::__fwd_prod(_M_extents, __i); > >>>>> + if constexpr (extents_type::rank() == 1) > >>>>> + return 1; > >>>>> + else if constexpr (extents_type::rank() == 2) > >>>>> + return __i == 0 ? 1 : _M_extents.extent(0); > >>>>> + else > >>>>> + return __mdspan::__fwd_prod(_M_extents, __i); > >>>>> } > >>>>> > >>>>> template<typename _OExtents> > >>>>> @@ -797,7 +802,12 @@ _GLIBCXX_BEGIN_NAMESPACE_VERSION > >>>>> requires (extents_type::rank() > 0) > >>>>> { > >>>>> __glibcxx_assert(__i < extents_type::rank()); > >>>>> - return __mdspan::__rev_prod(_M_extents, __i); > >>>>> + if constexpr (extents_type::rank() == 1) > >>>>> + return 1; > >>>>> + else if constexpr (extents_type::rank() == 2) > >>>>> + return __i == 0 ? _M_extents.extent(1) : 1; > >>>>> + else > >>>>> + return __mdspan::__rev_prod(_M_extents, __i); > >>>>> } > >>>>> > >>>>> template<typename _OExtents> > >>>>> -- > >>>>> 2.50.0 > >>>>> > >>>>> > >>>> > >>> > >>> > > > >