On Tue, Jan 28, 2020 at 6:51 PM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Tue, Jan 28, 2020 at 9:12 AM Uros Bizjak <ubiz...@gmail.com> wrote:
> >
> > On Tue, Jan 28, 2020 at 4:34 PM H.J. Lu <hjl.to...@gmail.com> wrote:
> >
> > > > You could move
> > > >
> > > > (match_test "TARGET_AVX")
> > > >   (const_string "TI")
> > > >
> > > > up to bypass the cases below.
> > > >
> > >
> > > I don't think we can do that.   There are 2 cases where we prefer 
> > > movaps/movups:
> > >
> > > /* Use packed single precision instructions where posisble.  I.e.
> > > movups instead   of movupd.  */
> > > DEF_TUNE (X86_TUNE_SSE_PACKED_SINGLE_INSN_OPTIMAL,
> > > "sse_packed_single_insn_optimal",
> > >           m_BDVER | m_ZNVER)
> > >
> > > /* X86_TUNE_SSE_TYPELESS_STORES: Always movaps/movups for 128bit stores.  
> > >  */
> > > DEF_TUNE (X86_TUNE_SSE_TYPELESS_STORES, "sse_typeless_stores",
> > >           m_AMD_MULTIPLE | m_CORE_ALL | m_GENERIC)
> > >
> > > We should always use movaps/movups for 
> > > TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL.
> > > It is wrong to bypass TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL with 
> > > TARGET_AVX
> > > as m_BDVER | m_ZNVER support AVX.
> >
> > The reason for TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL on AMD target is
> > only insn size, as advised in e.g. Software Optimization Guide for the
> > AMD Family 15h Processors [1], section 7.1.2, where it is said:
> >
> > --quote--
> > 7.1.2 Reduce Instruction SizeOptimization
> >
> > Reduce the size of instructions when possible.
> >
> > Rationale
> >
> > Using smaller instruction sizes improves instruction fetch throughput.
> > Specific examples include the following:
> >
> > *In SIMD code, use the single-precision (PS) form of instructions
> > instead of the double-precision (PD) form. For example, for register
> > to register moves, MOVAPS achieves the same result as MOVAPD, but uses
> > one less byte to encode the instruction and has no prefix byte. Other
> > examples in which single-precision forms can be substituted for
> > double-precision forms include MOVUPS, MOVNTPS, XORPS, ORPS, ANDPS,
> > and SHUFPS.
> > ...
> > --/quote--
> >
> > Please note that this optimization applies only to non-AVX forms, as
> > demonstrated by:
> >
> >    0:   0f 28 c8                movaps %xmm0,%xmm1
> >    3:   66 0f 28 c8             movapd %xmm0,%xmm1
> >    7:   c5 f8 28 d1             vmovaps %xmm1,%xmm2
> >    b:   c5 f9 28 d1             vmovapd %xmm1,%xmm2
> >
> > Also note that MOVDQA is missing in the above optimization. It is
> > harmful to substitute MOVDQA with MOVAPS, as it can (and does)
> > introduce +1 cycle forwarding penalty between FLT (FPA/FPM) and INT
> > (VALU) FP clusters.
> >
> > Following the above optimization, it is obvious that
> > TARGET_SSE_PACKED_SINGLE_INSN_OPTIMAL handling was cargo-culted from
> > one pattern to another. Its use should be reviewed and fixed where not
> > appropriate.
> >
> > [1] https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf
> >
> > Uros.
>
> Here is the updated patch which moves TARGET_AVX before
> TARGET_SSE_TYPELESS_STORES.   OK for master if there is
> no regression?
>
> Thanks.


+       (match_test "TARGET_AVX")
+ (const_string "<sseinsnmode>")
        (and (match_test "<MODE_SIZE> == 16")

Only MODE_SIZE == 16 cases will be left here, since TARGET_AVX is
necessary for MODE_SIZE > 16. This test can be removed.

OK with the above change.

Thanks,
Uros.

> --
> H.J.

Reply via email to