https://llvm.org/bugs/show_bug.cgi?id=31202

            Bug ID: 31202
           Summary: PMULLD should be avoided if possible on Silvermont
           Product: libraries
           Version: trunk
          Hardware: PC
                OS: Windows NT
            Status: NEW
          Severity: normal
          Priority: P
         Component: Backend: X86
          Assignee: [email protected]
          Reporter: [email protected]
                CC: [email protected]
    Classification: Unclassified

For the following case:
define <4 x i32> @foo(<4 x i8> %A) {
  %z = zext <4 x i8> %A to <4 x i32>
  %m = mul nuw nsw <4 x i32> %z, <i32 18778, i32 18778, i32 18778, i32 18778>
  ret <4 x i32> %m
}

The following code is generated for Silvermont:
  pand    .LCPI1_0, %xmm0
  pmulld  .LCPI1_1, %xmm0
  retl

On Silvermont:
PMULLD has a throughput of 1/11 [instruction/cycles].
PMULHUW/PMULHW/PMULLW have a throughput of 1/2 [instruction/cycles].

Note that the multiplicands fit in 16-bits.

We would achieve a higher throughput with the following sequence:
  pshufb
  pmullw
  pmulhw
  punpcklwd

This issue was root caused by Farhana Aleen during analysis on internal
workloads which would regress if interleaving would be enabled for Silvermont
in X86TTI (so commit 284779 did not enable interleaving for some subtargets).
It turns out that with interleaving the vectorized IR prior to codegen is
decent for the chosen vectorization width. The issue reported here is one of
the major reasons for the slow-down (but fixing this issue alone only reduces
the regression).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
_______________________________________________
llvm-bugs mailing list
[email protected]
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to