https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97770

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rguenth at gcc dot gnu.org

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #5)
> (In reply to Richard Biener from comment #4)
> > What's missing is middle-end folding support to narrow popcount to the
> > appropriate internal function call with byte/half-word width when target
> > support
> > is available.  But I'm quite sure there's no scalar popcount instruction
> > operating on half-word or byte pieces of a GPR?
> > 
> > Alternatively the vectorizer can use patterns to do this.
> 
> Yes, but for 64bit width, vectorizer generate suboptimal code.
> 
> sse #c3
> 
>   vector(2) long long unsigned int vect__4.6;
>   vector(2) long long unsigned int vect__4.5;
>   vector(2) long long unsigned int _8;
>   vector(2) long long unsigned int _26;
> 
>   ...
>   ...
> 
>   _8 = .POPCOUNT (vect__4.5_16);
>   _26 = .POPCOUNT (vect__4.6_9);
>   vect__5.7_22 = VEC_PACK_TRUNC_EXPR <_8, _26>; --- Why do we do this?
>   vector(4) int vect__5.7;
> 
> 
> It could generate directly
> 
>   v4di = .POPCOUNT (v4di);

I guess that the vectorized popcount IFN is defined to be VnDI -> VnDI
but we want to have VnSImode results.  This means the instruction is
wrongly modeled in vectorized form?

Note the vectorizer isn't very good in handling narrowing operations here.

If you can push the missing patterns I can have a look.  Bonus points for
a correctness testcase (from the above I think we're generating wrong code).

Reply via email to