This is an optimization pass which leads to dramatically better code on
at least one SPEC benchmark. Ian, Roger, Diego, would one of you care
to review this?
My concern is that as formulated, conditional store elimination is not
always a win.
Transforming
if (cond)
*p = x;
into
tmp = *p;
if (cond)
tmp = x;
*p = tmp;
on it's own, effectively transforms a conditional write to memory into an
unconditional write to memory.
On many platforms, even x86, this a pessimization. For example, the "Intel
Architecture Optimization Manual", available at
ftp://download.intel.com/design/PentiumII/manuals/24281603.PDF in section
3.5.5 "Write Allocation Effects", actually recommends the inverse
transformation. On page 3-21 they show how the "Sieve of Erastothenes"
benchmark can be sped up on Pentium class processors by transforming the
line
array[j] = 0;
into the equivalent
if (array[j] != 0)
array[j] = 0;
i.e. by introducing conditional stores.
The significant observation with Michael Matz's extremely impressive 26%
improvement on 456.hmmer is the interaction between this transformation with
other passes, that allow the conditional store to be hoisted out of a
critical loop. By reading the value into a "tmp" before the loop,
conditionally storing to the register tmp in the loop, then unconditionally
writing the result back afterwards, we dramatically reduce the number of
memory writes, rather than increase them as when this transformation is
applied in isolation.
I think the correct fix is not to apply this transformation everywhere, but
to correctly identify those loop cases where it helps and perform the loop
transformation there. i.e. conditional induction variable identification,
hoisting and sinking needs to be improved instead of pessimizing code to a
simpler form that allows our existing flawed passes to trigger.
I do very much like the loop-restricted version of this transformation, and
it's impressive impact of HMMR (whose author Sean Eddy is a good friend).
Perhaps Mark might give revised versions of this patch special dispensation
to be applied in stage 3. I'd not expect any correctness issues/bugs, just
performance trade-offs that need to be investigated. Perhaps we should even
apply this patch as is during stage 2, and allow the potential non-loop
performance degradations to be addressed as follow-up patches and therefore
regression fixes suitable for stage 3?
Congratulations again to Michael for this impressive performance
improvement.
Roger
--
Roger Sayle, Ph.D.
OpenEye Scientific Software,
Suite #D, 9 Bisbee Court,
Santa Fe, New Mexico, 87508.