On Wed, 9 Jan 2019, Jakub Jelinek wrote:
> On Wed, Jan 09, 2019 at 11:10:25AM -0500, David Malcolm wrote:
> > extern void vf1()
> > {
> > #pragma vectorize enable
> > for ( int i = 0 ; i < 32768 ; i++ )
> > data [ i ] = std::sqrt ( data [ i ] ) ;
> > }
> >
> > Compiling on this x86_64 box with -fopt-info-vec-missed shows the
>
> > _7 = .SQRT (_1);
> > if (_1 u>= 0.0)
> > goto <bb 8>; [99.95%]
> > else
> > goto <bb 4>; [0.05%]
> >
> > <bb 8> [local count: 1062472912]:
> > goto <bb 5>; [100.00%]
> >
> > <bb 4> [local count: 531495]:
> > __builtin_sqrtf (_1);
> >
> > I'm not sure where that control flow came from: it isn't in
> > sqrt-test.cc.104t.stdarg
> > but is in
> > sqrt-test.cc.105t.cdce
> > so I think it's coming from the argument-range code in cdce.
> >
> > Arguably the location on the statement is wrong: it's on the loop
> > header, when it presumably should be on the std::sqrt call.
>
> See my either mail, it is the result of the -fmath-errno default,
> the inline emitted sqrt doesn't handle errno setting and we emit
> essentially x = sqrt (arg); if (__builtin_expect (arg < 0.0, 0)) sqrt (arg);
> where
> the former sqrt is inline using HW instructions and the latter is the
> library call.
>
> With some extra work we could vectorize it; e.g. if we make it handle
> OpenMP #pragma omp ordered simd efficiently, it would be the same thing
> - allow non-vectorizable portions of vectorized loops by doing there a
> scalar loop from 0 to vf-1 doing the non-vectorizable stuff + drop the
> limitation
> that the vectorized loop is a single bb. Essentially, in this case it would
> be
> vec1 = vec_load (data + i);
> vec2 = vec_sqrt (vec1);
> if (__builtin_expect (any (vec2 < 0.0)))
> {
> for (int i = 0; i < vf; i++)
> sqrt (vec2[i]);
> }
> vec_store (data + i, vec2);
> If that would turn to be way too hard, we could for the vectorization
> purposes hide that into the .SQRT internal fn, say add a fndecl argument to
> it if it should treat the exceptional cases some way so that the control
> flow isn't visible in the vectorized loop.
If we decide it's worth the trouble I'd rather do that in the epilogue
and thus make the any (vec2 < 0.0) a reduction. Like
smallest = min(smallest, vec1);
and after the loop do the errno thing on the smallest element.
That said, this is a transform that is probably worthwhile even
on scalar code, possibly easiest to code-gen right from the start
in the call-dce pass.
Richard.