https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109287
Bug ID: 109287 Summary: Optimizing sal shr pairs when inlining function Product: gcc Version: 12.2.0 URL: https://gcc.godbolt.org/z/aPTsjc1sM Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: milasudril at gmail dot com Target Milestone: --- Target: x86-64_linux_gnu I was trying to construct a span type to be used for working with a tile-based image ``` #include <cstdint> #include <type_traits> #include <cstddef> template<class T, size_t TileSize> class span_2d_tiled { public: using IndexType = size_t; static constexpr size_t tile_size() { return TileSize; } constexpr explicit span_2d_tiled(): span_2d_tiled{0u, 0u, nullptr} {} constexpr explicit span_2d_tiled(IndexType w, IndexType h, T* ptr): m_tilecount_x{1 + (w - 1)/TileSize}, m_tilecount_y{1 + (h - 1)/TileSize}, m_ptr{ptr} {} constexpr auto tilecount_x() const { return m_tilecount_x; } constexpr auto tilecount_y() const { return m_tilecount_y; } constexpr T& operator()(IndexType x, IndexType y) const { auto const x_tile = x/TileSize; auto const y_tile = y/TileSize; auto const x_offset = x%TileSize; auto const y_offset = y%TileSize; auto const tile_start = y_tile*m_tilecount_x + x_tile; return *(m_ptr + tile_start + y_offset*TileSize + x_offset); } private: IndexType m_tilecount_x; IndexType m_tilecount_y; T* m_ptr; }; template<size_t TileSize, class Func> void visit_tiles(size_t x_count, size_t y_count, Func&& f) { for(size_t k = 0; k != y_count; ++k) { for(size_t l = 0; l != x_count; ++l) { for(size_t y = 0; y != TileSize; ++y) { for(size_t x = 0; x != TileSize; ++x) { f(l*TileSize + x, k*TileSize + y); } } } } } void do_stuff(float); void call_do_stuff(span_2d_tiled<float, 16> foo) { visit_tiles<decltype(foo)::tile_size()>(foo.tilecount_x(), foo.tilecount_y(), [foo](size_t x, size_t y){ do_stuff(foo(x, y)); }); } ``` Here, the user of this API wants to access individual pixels. Thus, the coordinates are transformed before calling f. To do so, we multiply by TileSize and adds the appropriate offset. In the callback, the pixel value is looked up. But now we must find out what tile it is, and the offset within that tile, which means that the inverse transformation must be applied. As can be seen in the Godbolt link, GCC does not fully understand what is going on here. However, latest clang appears to do a much better job with the same settings. It also unrolls the inner loop, much better than if I used ``` #pragma GCC unroll 16 ```