On Tue, Apr 10, 2018 at 12:40 PM, Richard Sandiford
<richard.sandif...@linaro.org> wrote:
> Jakub Jelinek <ja...@redhat.com> writes:
>> On Mon, Apr 09, 2018 at 06:47:45PM +0100, Richard Sandiford wrote:
>>> In this PR we used WIDEN_SUM_EXPR to vectorise:
>>>   short i, y;
>>>   int sum;
>>>   [...]
>>>   for (i = x; i > 0; i--)
>>>     sum += y;
>>> with 4 ints and 8 shorts per vector.  The problem was that we set
>>> the VF based only on the ints, then calculated the number of vector
>>> copies based on the shorts, giving 4/8.  Previously that led to
>>> ncopies==0, but after r249897 we pick it up as an ICE.
>>> In this particular case we could vectorise the reduction by setting
>>> ncopies based on the output type rather than the input type, but it
>>> doesn't seem worth adding a special "optimisation" for such a
>>> pathological case.  I think it's really an instance of the more general
>>> problem that we can't vectorise using combinations of (say) 64-bit and
>>> 128-bit vectors on targets that support both.
>> We badly need that, there are plenty of PRs where we generate really large
>> vectorized loop because of it e.g. on x86 where we can easily use 128-bit,
>> 256-bit and 512-bit vectors; but I'm afraid it is not a stage4 material.
> Yeah.  We also need it on AArch64 for a proper implementation of simd
> clones for Advanced SIMD.
> I think it's related to one of the most important missed optimisations
> for SVE: when using mixed data sizes, it's usually better to store the
> smaller data unpacked in wider lanes, and there's direct support for
> loading and storing it that way.  In both the SVE and non-SVE cases,
> we want the VF sometimes to be based on wider sizes rather than the
> narrowest one.

It's unfortunately not very easy to remove the limitation in full and in
general it widens the space we need to search for the best vectorization
even further...

> FWIW, I have some patches queued for GCC 9 that should make it
> easier to implement this (but no promises).  They're also supposed
> to make it possible to compare the costs of different implementations
> side-by-side, rather than always picking the first one that has
> a lower cost than the scalar code.

I have also a similar patch in the works.


> Thanks,
> Richard

Reply via email to