On Thu, May 09, 2013 at 09:47:49PM +0200, Uros Bizjak wrote:
> On Thu, May 9, 2013 at 8:45 PM, Jakub Jelinek <[email protected]> wrote:
> > 2013-05-09 Jakub Jelinek <[email protected]>
> >
> > * config/i386/i386.md (rotateinv): New code attr.
> > (*<rotate_insn><mode>3_1, *<rotate_insn>si3_1_zext,
> > *<rotate_insn>qi3_1_slp): Emit rorl %eax instead of
> > roll $31, %eax, etc.
>
> OK.
Thanks. While talking about rotates, seems they are quite expensive
on SandyBridge if the destination (== source1) is memory.
With -O3 -mavx:
unsigned int a[1024] = { 0, 1, 2, 3, 4, 5, 6 };
__attribute__((noinline, noclone)) void
foo (void)
{
int i;
for (i = 0; i < 1024; i++)
{
int j = i & 31;
a[i] = (a[i] << j) ^ (a[i] >> ((-j) & 31));
}
}
int
main ()
{
int i;
for (i = 0; i < 1000000; i++)
foo ();
return 0;
}
the timing for vanilla gcc, gcc with the rotate recognizer patch and
the latter hand editted to use rotate on register gives:
Intel i7-2600 0m1.421s 0m1.914s 0m0.826s
AMD 8354 0m2.353s 0m0.988s 0m1.407s
The vanilla gcc loop is long:
.L2:
movl a(,%rax,4), %edi
movl %eax, %esi
andl $31, %esi
movl %esi, %ecx
negl %ecx
movl %edi, %edx
shrl %cl, %edx
movl %esi, %ecx
sall %cl, %edi
xorl %edi, %edx
movl %edx, a(,%rax,4)
addq $1, %rax
cmpq $1024, %rax
jne .L2
With rotate using mem:
.L2:
roll %cl, a(,%rcx,4)
addq $1, %rcx
cmpq $1024, %rcx
jne .L2
and with the hand tweaked rotate:
.L2:
movl a(,%rcx,4), %eax
roll %cl, %eax
addq $1, %rcx
cmpq $1024, %rcx
movl %eax, a-4(,%rcx,4)
jne .L2
So, on SandyBridge (haven't tried other Intel CPUs) it perhaps might be
beneficial to disallow rotate with =m , 0 constraints, while for the above
mentioned AMD it is best to use it. Guess the above isn't sufficient data
for that, we'd need info on more CPUs and more rotation sizes.
BTW, with -mavx2 we vectorize the loop without the rotate recognition patch
and don't vectorize anymore, I'll need to adjust the pattern recognizer to
handle those. The question is if for missing vector rotate optab
we can replace it with the probably faster and shorter
x r<< y -> (x << y) | (x >> (32 - y)) - which is invalid for y = 0,
but perhaps if the only thing the vector shift would do for shift by 32
would be either return 0, or x, it would still work, or if we want to go
for the (x << y) | (x >> ((-y) & 31)). The i?86/x86_64 vector shifts don't
truncate the shift count, so perhaps best would be a vector rotate expander
that expands it into some some special insn like vectorized y ? x >> (32 - y) : 0
One question is whether our LROTATE_EXPR or RROTATE_EXPR is only defined for
rotation counts 0 through bitsize - 1 (as is the case when we pattern detect
it, do we create *ROTATE_EXPR other way?), or whether we allow arbitrary
rotation counts (would assume the rotation be done always with modulo
bitsize). Because if the rotation count could be arbitrary, we'd need
to vectorize it as (x << (y & 31)) | (x >> ((-y) & 31))
Jakub