https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124580
Bug ID: 124580
Summary: RISCV: Redundant memory loads in x264_pixel_sad
function
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: guohuawen7 at gmail dot com
CC: chenzhongyao.hit at gmail dot com
Target Milestone: ---
In SPEC2017's 525.x264_r benchmark, function x264_pixel_sad_x3_8x8 redundantly
loads the same fenc data three times:
void x264_pixel_sad_x3_8x8( uint8_t *fenc, uint8_t *pix0, uint8_t *pix1,
uint8_t *pix2, int i_stride, int scores[3] )
{
scores[0] = x264_pixel_sad_8x8( fenc, FENC_STRIDE, pix0, i_stride );
scores[1] = x264_pixel_sad_8x8( fenc, FENC_STRIDE, pix1, i_stride );
scores[2] = x264_pixel_sad_8x8( fenc, FENC_STRIDE, pix2, i_stride );
}
Current implementation causes 24 loads of fenc (8 rows × 3 calls), as can be
seen in the following link: https://godbolt.org/z/oWEvh8des. Since there are no
dependencies between the three SAD calculations, this could be optimized to
load each fenc row only once (8 total loads).
The optimization logic can be represented as follows:
orig:
Loop 1 (8 iterations): load fenc[y], load pix0[y], SAD -> sum0
Loop 2 (8 iterations): load fenc[y], load pix1[y], SAD -> sum1
Loop 3 (8 iterations): load fenc[y], load pix2[y], SAD -> sum2
Total fenc loads: 8 * 3 = 24
my goal:
Loop 1 (8 iterations):
load fenc[y] // loaded once
load pix0[y], SAD -> sum0
load pix1[y], SAD -> sum1
load pix2[y], SAD -> sum2
Total fenc loads: 8 * 1 = 8
I would like to implement this optimization. Could you advise where in the GCC
compiler this optimization would be most appropriate to implement?