On Thu, Nov 15, 2018 at 20:13:38 -0500, Emilio G. Cota wrote:
> I'll generate now some more perf numbers that we could include in the
> commit logs.

SPEC numbers are a net perf decrease, unfortunately:

                     Softmmu speedup for SPEC06int (test set)
   1.1 +-+--+----+----+----+----+----+----+---+----+----+----+----+----+--+-+
       |                                                                    |
       |                                                      aft+++        |
  1.05 +-+........................................................|.......+-+
       |                                               +++        |         |
       |                       +++                      |         |         |
       |   +++                  |                       |         |         |
     1 +-++++++++++++++++****++++++++++++++++++++++++++++++++++++***+++++++-+
       |    |         |  *  * ****                     ****      *|*        |
       |   ***  +++   |  *  * * |*       +++           *| *      *|*        |
  0.95 +-+.*|*..***...|..*..*.*.|*..+++...|............*|.*.+++..*|*..+++.+-+
       |   *|*  *+*  *** *  * * |*   |    |  +++       *| * ***  *|*  ***   |
       |   *+*  * *  *|* *  * *++*   |  ****  |        *| * *+*  *|*  *+*   |
       |   * *  * *  *|* *  * *  * **** * |* ****      *++* * *  *+*  * *   |
   0.9 +-+.*.*..*.*..*+*.*..*.*..*.*.|*.*.|*.*|.*......*..*.*.*..*.*..*.*.+-+
       |   * *  * *  * * *  * *  * *++* *++* *++* +++  *  * * *  * *  * *   |
       |   * *  * *  * * *  * *  * *  * *  * *  *  |   *  * * *  * *  * *   |
  0.85 +-+.*.*..*.*..*.*.*..*.*..*.*..*.*..*.*..*..|...*..*.*.*..*.*..*.*.+-+
       |   * *  * *  * * *  * *  * *  * *  * *  *  |   *  * * *  * *  * *   |
       |   * *  * *  * * *  * *  * *  * *  * *  * **** *  * * *  * *  * *   |
       |   * *  * *  * * *  * *  * *  * *  * *  * *| * *  * * *  * *  * *   |
   0.8 +-+.*.*..*.*..*.*.*..*.*..*.*..*.*..*.*..*.*|.*.*..*.*.*..*.*..*.*.+-+
       |   * *  * *  * * *  * *  * *  * *  * *  * *| * *  * * *  * *  * *   |
       |   * *  * *  * * *  * *  * *  * *  * *  * *++* *  * * *  * *  * *   |
  0.75 +-+-***--***--***-****-****-****-****-****-****-****-***--***--***-+-+
        401.bzi403.g429445.g456.462.libq464.h471.omn4483.xalancbgeomean
  png: https://imgur.com/aO39gyP

Turns out that the additional instructions are the problem,
despite the much lower icache miss rate. For instance, here
are some numbers for h264ref running on the not-so-recent
Xeon E5-2643 (i.e. Sandy Bridge):

- Before:
 1,137,737,512,668      instructions              #    2.02  insns per cycle
   563,574,505,040      cycles
     5,663,616,681      L1-icache-load-misses
     164.091239774 seconds time elapsed

- After:
 1,216,600,582,476      instructions              #    2.06  insns per cycle    
    
   591,888,969,223      cycles                                                  
    
     3,082,426,508      L1-icache-load-misses                                   
    

     172.232292897 seconds time elapsed

It's possible that newer machines with larger reorder buffers
will be able to take better advantage of the higher instruction
locality, hiding the latency of having to execute more instructions.
I'll test on Skylake tomorrow.

Thanks,

                E.

Reply via email to