> > Since read-modify-write is enabled for PentiumPro: > > /* X86_TUNE_READ_MODIFY_WRITE: Enable use of read modify write instructions > such as "add $1, mem". */ > DEF_TUNE (X86_TUNE_READ_MODIFY_WRITE, "read_modify_write", > ~(m_PENT | m_LAKEMONT)) > > should this > > /* Generate "and $0,mem" and "or $-1,mem", instead of "mov $0,mem" and > "mov $-1,mem" with shorter encoding for TARGET_SPLIT_LONG_MOVES with > TARGET_READ_MODIFY_WRITE or -Oz. */ > #define TARGET_USE_AND0_ORM1_STORE \ > ((TARGET_SPLIT_LONG_MOVES && TARGET_READ_MODIFY_WRITE) \ > || (optimize_insn_for_size_p () && optimize_size > 1))
I really think we are mixing performance and code size optimizations. I may be misremembering, but I believe that on PPro movl $0, (%edx) is slower than xorl %eax, %eax movl $0, (%edx) due to hardware limitations on decoding instructions with long encoding. However andl $0, (%edx) is even slower than both above since it is a read-modify-write instruction while both variants above does only write. I do not think hardware special cases this. Situation is different when you actually do read-modify-write If read_modify_write is set we produce: andl $1, (%edx) While if it is unset we will do: movl (%edx), %eax andl $0, %eax movl %eax,(%edx) which scheduled better on original Pentium provided extra register is available. Honza