On Fri, 15 Jun 2012, Richard Henderson wrote:

> ... but not actually fixing it.
> 
> I was hoping that the first patch might give the vectorizer enough
> info to solve the costing problem, but no such luck.  Nevertheless
> it would seem that not having the info present at all would be a
> bit of a hindrence when actually tweeking the vectorizer later...
> 
> FYI the "equivalence" of fabs/fmul in the integer simd costing is
> based upon the AMD K8 document I had handy.  I don't know that I've
> ever seen proper latencies and such for the Intel cpus?
> 
> The second patch implements something that I mentioned in the PR,
> that we ought to be decomposing expensive vector multiplies just
> like we do for scalar multiplies.  I'm really quite surprised that
> we'd not noticed that v*4 wasn't being implemented via shift...
> 
> The third patch implements something that Richi mentioned in the
> PR, that we ought not be trying to shuffle the elements of a
> const_vector around; do that beforehand.
> 
> ---
> 
> Some additional notes on the testcase in the PR:
> 
> The computation is 3 iterations of a hash function:
>       (x + 12345) * 914237 - 13.
> 
> The -13 folds into the subsequent +12345 well enough, and so the
> simple expansion of this results in 7 operations.  And that's 
> exactly what we get when unrolling and vectorizing.
> 
> However, for the non-vectorized version, combine smooshes together
> (2.5) iterations of the hash function, utilizing modulo arithmetic:
>       h1 = (x + 12345) * 914237 - 13
>       h2 = (h1 + 12345) * 914237 - 13
>       h3 = (h2 + 12345) * 914237 - 13
>          = x*764146064584710053 + 10318335160567660
>          = x*0x101597a5 + 0x9deb476c (mod 2**32)
> 
> which is of course only 2 operations (combine actually misses one
> and leaves an extra plus for 3 operations).  Which means that even
> leaving aside everything above, the vectorized code is having
> to work much harder than the scalar code in the end.
> 
> Manually adjusting complete_hash_func with the above substitution
> and suddenly even the pre-sse4 vectorized version is faster than
> the unvectorized version (with 100000 iterations and these patches):
> 
> scalar:       4.69 sec
> sse2: 2.46 sec
> sse4: 1.39 sec
> 
> So, it's not *really* the costing inside the vectorizer at all,
> and begs the question of why we're not taking modulo arithmetic
> into account earlier in the optimization path?

We certainly could at the expense of losing the fact that signed
expressions do not overflow.  I've started to add infrastructure
that would allow to do this without intermediate casting to unsigned
on the no-undefined-overflow branch a few years ago - but work on
it got stalled (and I won't be able to pick that up again in time
for 4.8, but hopefully for 4.9).  On the gimple-level it should
be the reassoc pass who eventually should decide that using
unsigned arithmetic is profitable because it can then associate.

Richard.

> Bootstrapped and tested on x86_64, but I'll leave some time for
> comment before committing any of this.
> 
> 
> r~
> 
> 
> Richard Henderson (3):
>   Add rtx costs for sse integer ops
>   Use synth_mult for vector multiplies vs scalar constant
>   Handle const_vector in mulv4si3 for pre-sse4.1.
> 
>  gcc/config/i386/i386-protos.h |    1 +
>  gcc/config/i386/i386.c        |  126 +++++++++++-
>  gcc/config/i386/predicates.md |    7 +
>  gcc/config/i386/sse.md        |   72 ++------
>  gcc/expmed.c                  |  438 
> +++++++++++++++++++++++------------------
>  gcc/machmode.h                |    8 +-
>  6 files changed, 386 insertions(+), 266 deletions(-)
> 
> 

-- 
Richard Guenther <rguent...@suse.de>
SUSE / SUSE Labs
SUSE LINUX Products GmbH - Nuernberg - AG Nuernberg - HRB 16746
GF: Jeff Hawn, Jennifer Guild, Felix Imend

Reply via email to