> This contradicts
> 
> /* X86_TUNE_READ_MODIFY_WRITE: Enable use of read modify write instructions
>    such as "add $1, mem".  */
> DEF_TUNE (X86_TUNE_READ_MODIFY_WRITE, "read_modify_write",
>           ~(m_PENT | m_LAKEMONT))
> 
> which enables "andl $0, (%edx)" for PentiumPro.   "andl $0, (%edx)" works
> well on PentiumPro.

It is also enabled for zen but it does not mean that andl $0, (%edx)
is a good way of clearing meomry when optimizing for speed.

jan@padlo:/tmp> cat t.c
int mem;
int
main()
{
        for (int i = 0; i < 1000000000; i++)
#ifdef AND
                asm volatile ("andl $0, %0":"=m"(mem));
#else
#ifdef SPLIT
                asm volatile ("xorl %%eax, %%eax; movl $0, 
%0":"=m"(mem)::"eax");
#else
                asm volatile ("movl $0, %0":"=m"(mem));
#endif
#endif
        return 0;
}
jan@padlo:/tmp> gcc  -O2 t.c ; time ./a.out

real    0m0.405s
user    0m0.403s
sys     0m0.002s
jan@padlo:/tmp> gcc  -O2 -DSPLIT t.c ; time ./a.out

real    0m0.406s
user    0m0.404s
sys     0m0.001s
jan@padlo:/tmp> gcc  -O2 -DAND t.c ; time ./a.out

real    0m2.824s
user    0m2.822s
sys     0m0.001s

Andl is slower then movl because it inroduces unnecesary memory read.
I don't have PentiumPro to test, but there -DSPLIT variant should be
bit better, since instruction exceed 7 bytes.

Looking into history of that knob, it was added by me
https://gcc.gnu.org/pipermail/gcc-patches/1999-July/014219.html

to control behaviour of splitter that split the move if it was longer
then 7 bytes which was impementing the following recommendation of the
Intel optimization manual:

"Avoid instructions that contain four or more micro-ops or instructions that 
are more than
seven bytes long. If possible, use instructions that require one
micro-op"

So the comment on SPLIT_LONG_MOVES is bit incorrect not mentining that
move needs to exceed long_insn threshold.

I am not sure how much we need to care about PPro perofmrance these days
though.

Honza

Reply via email to