Re: Performance is not composable

Avi Kivity Mon, 25 Mar 2019 04:54:30 -0700

consumeCPU() doesn't consume the CPU, but just some parts of it - a fewinteger ports. It leaves alone the instruction cache, the instructiondecoder, and most of the execution ports.

Given that the loop has 10 instructions, the function runs about 100instructions, and so it can run in parallel with previous instructionsfrom the caller.



Consider this code


   int a[1'000'000'000];

   fill a with random values in the range 0..999'999'999

   loop {

       idx = a[idx]

       consumeCPU(n)

   }

The call to consumeCPU may not have any effect on the loop performanceif n is low enough. The processor can be memory-bound and executeconsumeCPU() while waiting for previous reads to complete.



On 25/03/2019 10.44, Francesco Nigro wrote:

Thanks Avi,
The way CPU can parallelize works make totally sense to me, butprobably I've missed some context about the operation used to amortizeA (or B):
http://hg.openjdk.java.net/code-tools/jmh/file/5984e353dca7/jmh-core/src/main/java/org/openjdk/jmh/infra/Blackhole.java#l443
The point I'm not getting is why I can compare directly the resultsof a() and b() (with amortization), but I cannot use amortization() tocompare a() and b(): to me they both look incorrect with differentdegrees.
If BlackHole::consumeCPU is "trusty" (ie has nearly fixed cost aloneor if combined with other operation) it means that it can always beused for comparision, but if it is not,the risk is that althought it can amortize with success rawA() (orrawB()), it will interact with them and with low values of tokensthe cost of the composed operations couldn't be compared in a way tosay anything about rawA() vs rawB() cost.
I'm sure that there must be something that I'm missing here.
FYI that's the asm printed by JMH related to tokens = 10 and I can seethat it doesn't get inlined (probably too unpredictable?):
....[Hottest Region1]..............................................................................C2, level 4, org.openjdk.jmh.infra.Blackhole::consumeCPU, version 501(88 bytes)
</print_nmethod>
            Decoding compiled method 0x00007f15e9223bd0:
            Code:
            [Entry Point]
            [Verified Entry Point]
[Constants]
# {method} {0x00007f15fcde2298} 'consumeCPU''(J)V' in 'org/openjdk/jmh/infra/Blackhole'
              # parm0:    rsi:rsi   = long
              #      [sp+0x30]  (sp of caller)
0x00007f15e9223d20: mov    %eax,-0x14000(%rsp)
  2.31%  0x00007f15e9223d27: push   %rbp
0x00007f15e9223d28: sub    $0x20,%rsp  ;*synchronization entry
; -org.openjdk.jmh.infra.Blackhole::consumeCPU@-1 (line 456) 2.01% 0x00007f15e9223d2c: movabs $0x76fe0bb60,%r10 ; {oop(a'java/lang/Class' ='org/openjdk/jmh/infra/Blackhole')} 0.34% 0x00007f15e9223d36: mov 0x68(%r10),%r10 ;*getstaticconsumedCPU ; -org.openjdk.jmh.infra.Blackhole::consumeCPU@0 (line 456)
0x00007f15e9223d3a: test   %rsi,%rsi
         ╭ 0x00007f15e9223d3d: jle    0x00007f15e9223d75  ;*ifle
│ ; -org.openjdk.jmh.infra.Blackhole::consumeCPU@11 (line 466)
         │ 0x00007f15e9223d3f: movabs $0x5deece66d,%r11
  1.55%  │ 0x00007f15e9223d49: movabs $0xffffffffffff,%r8  ;*lload_2
│ ; -org.openjdk.jmh.infra.Blackhole::consumeCPU@14 (line 467)
  4.17%  │↗  0x00007f15e9223d53: mov    %r10,%r9
  4.22%  ││  0x00007f15e9223d56: imul   %r11,%r9
 24.75%  ││  0x00007f15e9223d5a: add    %rsi,%r9
 11.38%  ││  0x00007f15e9223d5d: dec    %rsi               ;*lsub
││ ; -org.openjdk.jmh.infra.Blackhole::consumeCPU@36 (line 466)
  3.63%  ││  0x00007f15e9223d60: add    $0xb,%r9
  6.00%  ││  0x00007f15e9223d64: and    %r8,%r9
9.23% ││ 0x00007f15e9223d67: add %r9,%r10 ;OopMap{off=74}
         ││                                        ;*goto
││ ; -org.openjdk.jmh.infra.Blackhole::consumeCPU@39 (line 466) 11.94% ││ 0x00007f15e9223d6a: test %eax,0x177e3290(%rip) #0x00007f1600a07000
         ││                                        ;*goto
││ ; -org.openjdk.jmh.infra.Blackhole::consumeCPU@39 (line 466)
         ││                                        ;   {poll}
  4.40%  ││  0x00007f15e9223d70: test   %rsi,%rsi
         │╰  0x00007f15e9223d73: jg     0x00007f15e9223d53  ;*ifle
│ ; -org.openjdk.jmh.infra.Blackhole::consumeCPU@11 (line 466)
         ↘ 0x00007f15e9223d75: cmp    $0x2a,%r10
           ╭ 0x00007f15e9223d79: je     0x00007f15e9223d87  ;*ifne
│ ; -org.openjdk.jmh.infra.Blackhole::consumeCPU@47 (line 474)
  1.79%    │ 0x00007f15e9223d7b: add    $0x20,%rsp
  0.28%    │ 0x00007f15e9223d7f: pop    %rbp
│ 0x00007f15e9223d80: test %eax,0x177e327a(%rip) #0x00007f1600a07000
           │                                       ;   {poll_return}
           │ 0x00007f15e9223d86: retq
           ↘ 0x00007f15e9223d87: mov    $0xffffff65,%esi
0x00007f15e9223d8c: mov    $0x2a,%r11d
0x00007f15e9223d92: cmp    %r11,%r10
0x00007f15e9223d95: mov    $0xffffffff,%ebp
0x00007f15e9223d9a: jl     0x00007f15e9223da4
0x00007f15e9223d9c: setne  %bpl
0x00007f15e9223da0: movzbl %bpl,%ebp          ;*lcmp
....................................................................................................
 88.00%  <total for region 1>
Looking at the code, it seems that all the comments onBlockHole::consumeCPU are correct (related to cmq and test) and it isindeed a very straightforward translation from code -> ASM (no loopunrolling or weird tricks by the JVM, just a mfence dropped somewhere,maybe).
Just safepoint polls on each loop iteration and on the end of method call.

Il giorno domenica 24 marzo 2019 10:12:56 UTC+1, Avi Kivity ha scritto:

    Suppose you have micro-operations A and B that take t(A) and t(B)
    to run. Running repeat(n, A+B) can take n*(t(a) + t(b)), or
    n*max(t(a), t(b)), or n*(t(a) + t(b) + huge_delta), or something else.


    Sometimes the CPU can completely parallelize A and B so running
    them in parallel takes no extra time compared to just one.
    Sometimes running them in sequence causes one of the caches to
    overflow and efficiency decreases dramatically. And sometimes
    running both can undo some quirk and you end up with them taking
    less time.


    Summary: CPUs are complicated.


    On 24/03/2019 10.11, Francesco Nigro wrote:
    Hi folks,

    while reading the awesome
    https://shipilev.net/blog/2014/nanotrusting-nanotime/
    <https://shipilev.net/blog/2014/nanotrusting-nanotime/> I have
    some questions on the "Building Performance Models" part.
    Specifically, when you want to compare 2 operations (ie A,B) and
    you want to emulate the same behaviour of a real application you
    need to amortize the cost of such operations:
    in JMH this is achieved with BlackHole.consumeCPU(int tokens),
    but any microbenchmark tool (if not on the JVM) could/should
    provide something similar.

    Said that, now the code to measure is not just A or B but is
    composed by 2 operations: amortization; A() (or B);
    For JMH that means:

    @Benchmark
    int a() {
    BlackHole.consumeCPU(tokens);
    //suppose that rawA() return an int and rawA() is the original
    call re A
    return rawA();
    }

    in JMH is up to the tool to avoid Dead Code Elimination when you
    return a value from a benchmarked method.

    The point of the article seems to be that given that "performance
    is not composable" if you want to compare A and B cost (with
    amortization) you cannot create a third benchmark:

    @Benchmark
    int amortization() {
        BlackHole.consumeCPU(tokens);
    }

    And use the benchmark results (eg throughput of calls) to be
    subtracted from the results of a() (or b()) to compare A and B costs.
    I don't understand the meaning of "performance is not composable"
    and I would appreciate your opinion on that, given that many people
    of this list have experience with benchmarking.

    Thanks,
    Franz
--You received this message because you are subscribed to the
    Google Groups "mechanical-sympathy" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to [email protected]
    <javascript:>.
    For more options, visit https://groups.google.com/d/optout
    <https://groups.google.com/d/optout>.
--
You received this message because you are subscribed to the GoogleGroups "mechanical-sympathy" group.To unsubscribe from this group and stop receiving emails from it, sendan email to [email protected]<mailto:[email protected]>.
For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Performance is not composable

Reply via email to