https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109287
Bug ID: 109287
Summary: Optimizing sal shr pairs when inlining function
Product: gcc
Version: 12.2.0
URL: https://gcc.godbolt.org/z/aPTsjc1sM
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: milasudril at gmail dot com
Target Milestone: ---
Target: x86-64_linux_gnu
I was trying to construct a span type to be used for working with a tile-based
image
```
#include <cstdint>
#include <type_traits>
#include <cstddef>
template<class T, size_t TileSize>
class span_2d_tiled
{
public:
using IndexType = size_t;
static constexpr size_t tile_size()
{
return TileSize;
}
constexpr explicit span_2d_tiled(): span_2d_tiled{0u, 0u, nullptr} {}
constexpr explicit span_2d_tiled(IndexType w, IndexType h, T* ptr):
m_tilecount_x{1 + (w - 1)/TileSize},
m_tilecount_y{1 + (h - 1)/TileSize},
m_ptr{ptr}
{}
constexpr auto tilecount_x() const { return m_tilecount_x; }
constexpr auto tilecount_y() const { return m_tilecount_y; }
constexpr T& operator()(IndexType x, IndexType y) const
{
auto const x_tile = x/TileSize;
auto const y_tile = y/TileSize;
auto const x_offset = x%TileSize;
auto const y_offset = y%TileSize;
auto const tile_start = y_tile*m_tilecount_x + x_tile;
return *(m_ptr + tile_start + y_offset*TileSize + x_offset);
}
private:
IndexType m_tilecount_x;
IndexType m_tilecount_y;
T* m_ptr;
};
template<size_t TileSize, class Func>
void visit_tiles(size_t x_count, size_t y_count, Func&& f)
{
for(size_t k = 0; k != y_count; ++k)
{
for(size_t l = 0; l != x_count; ++l)
{
for(size_t y = 0; y != TileSize; ++y)
{
for(size_t x = 0; x != TileSize; ++x)
{
f(l*TileSize + x, k*TileSize + y);
}
}
}
}
}
void do_stuff(float);
void call_do_stuff(span_2d_tiled<float, 16> foo)
{
visit_tiles<decltype(foo)::tile_size()>(foo.tilecount_x(),
foo.tilecount_y(), [foo](size_t x, size_t y){
do_stuff(foo(x, y));
});
}
```
Here, the user of this API wants to access individual pixels. Thus, the
coordinates are transformed before calling f. To do so, we multiply by TileSize
and adds the appropriate offset. In the callback, the pixel value is looked up.
But now we must find out what tile it is, and the offset within that tile,
which means that the inverse transformation must be applied. As can be seen in
the Godbolt link, GCC does not fully understand what is going on here. However,
latest clang appears to do a much better job with the same settings. It also
unrolls the inner loop, much better than if I used
```
#pragma GCC unroll 16
```