Jakub Jelinek <ja...@redhat.com> writes:
> On Mon, Apr 09, 2018 at 06:47:45PM +0100, Richard Sandiford wrote:
>> In this PR we used WIDEN_SUM_EXPR to vectorise:
>>   short i, y;
>>   int sum;
>>   [...]
>>   for (i = x; i > 0; i--)
>>     sum += y;
>> with 4 ints and 8 shorts per vector.  The problem was that we set
>> the VF based only on the ints, then calculated the number of vector
>> copies based on the shorts, giving 4/8.  Previously that led to
>> ncopies==0, but after r249897 we pick it up as an ICE.
>> In this particular case we could vectorise the reduction by setting
>> ncopies based on the output type rather than the input type, but it
>> doesn't seem worth adding a special "optimisation" for such a
>> pathological case.  I think it's really an instance of the more general
>> problem that we can't vectorise using combinations of (say) 64-bit and
>> 128-bit vectors on targets that support both.
> We badly need that, there are plenty of PRs where we generate really large
> vectorized loop because of it e.g. on x86 where we can easily use 128-bit,
> 256-bit and 512-bit vectors; but I'm afraid it is not a stage4 material.

Yeah.  We also need it on AArch64 for a proper implementation of simd
clones for Advanced SIMD.

I think it's related to one of the most important missed optimisations
for SVE: when using mixed data sizes, it's usually better to store the
smaller data unpacked in wider lanes, and there's direct support for
loading and storing it that way.  In both the SVE and non-SVE cases,
we want the VF sometimes to be based on wider sizes rather than the
narrowest one.

FWIW, I have some patches queued for GCC 9 that should make it
easier to implement this (but no promises).  They're also supposed
to make it possible to compare the costs of different implementations
side-by-side, rather than always picking the first one that has
a lower cost than the scalar code.


Reply via email to