------- Comment #3 from piotr dot wyderski at gmail dot com 2010-01-17 20:46 ------- This is a generic code, as it covers two bug reports. In fact, it will probably be used as a base for additional two missing optimization reports. So I thought it would be good to provide the code of the entire sandbox.
To be more specific: the vectors passed to combine() are constant. The compiler should not re-evaluate the base addresses of the m_Data arrays every iteration, as above: mov (%edi),%ecx ... mov 0x0(%ebp),%ecx ... mov (%esi),%ecx A single base address fetch phase and index-based addressing with scaled induction variable (by a factor of 16) will be more optimal, e.g.: // esi = src1 // edi = src2 // ebx = dst // edx = induction variable L0:cmpl %edx, max_index je L1: movdqa (%esi,%edx,1),%xmm0 por (%edi,%edx,1),%xmm0 pxor %xmm1, %xmm0 movdqa %xmm0, (%ebx, %edx, 1) add $16, %edx jmp L0 L1: as I would have written it by hand in assembler. An aggresively unrolled version (say, four-way) with prefetching for longer blocks will also be welcome. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=42779