consumeCPU() doesn't consume the CPU, but just some parts of it - a few integer ports. It leaves alone the instruction cache, the instruction decoder, and most of the execution ports.

Given that the loop has 10 instructions, the function runs about 100 instructions, and so it can run in parallel with previous instructions from the caller.


Consider this code


   int a[1'000'000'000];

   fill a with random values in the range 0..999'999'999

   loop {

       idx = a[idx]

       consumeCPU(n)

   }


The call to consumeCPU may not have any effect on the loop performance if n is low enough. The processor can be memory-bound and execute consumeCPU() while waiting for previous reads to complete.


On 25/03/2019 10.44, Francesco Nigro wrote:
Thanks Avi,

The way CPU can parallelize works make totally sense to me, but probably I've missed some context about the operation used to amortize A (or B):
http://hg.openjdk.java.net/code-tools/jmh/file/5984e353dca7/jmh-core/src/main/java/org/openjdk/jmh/infra/Blackhole.java#l443

The point I'm not getting is why I can compare directly the results of a() and b() (with amortization), but I cannot use amortization() to compare a() and b(): to me they both look incorrect with different degrees.

If BlackHole::consumeCPU is "trusty" (ie has nearly fixed cost alone or if combined with other operation) it means that it can always be used for comparision, but if it is not, the risk is that althought it can amortize with success rawA() (or rawB()), it will interact with them and with low values of tokens the cost of the composed operations couldn't be compared in a way to say anything about rawA() vs rawB() cost.

I'm sure that there must be something that I'm missing here.
FYI that's the asm printed by JMH related to tokens = 10 and I can see that it doesn't get inlined (probably too unpredictable?):

....[Hottest Region 1].............................................................................. C2, level 4, org.openjdk.jmh.infra.Blackhole::consumeCPU, version 501 (88 bytes)

</print_nmethod>
            Decoding compiled method 0x00007f15e9223bd0:
            Code:
            [Entry Point]
            [Verified Entry Point]
[Constants]
              # {method} {0x00007f15fcde2298} &apos;consumeCPU&apos; &apos;(J)V&apos; in &apos;org/openjdk/jmh/infra/Blackhole&apos;
              # parm0:    rsi:rsi   = long
              #      [sp+0x30]  (sp of caller)
0x00007f15e9223d20: mov    %eax,-0x14000(%rsp)
  2.31%  0x00007f15e9223d27: push   %rbp
0x00007f15e9223d28: sub    $0x20,%rsp  ;*synchronization entry
                                      ; - org.openjdk.jmh.infra.Blackhole::consumeCPU@-1 (line 456)   2.01%  0x00007f15e9223d2c: movabs $0x76fe0bb60,%r10  ;   {oop(a &apos;java/lang/Class&apos; = &apos;org/openjdk/jmh/infra/Blackhole&apos;)}   0.34%  0x00007f15e9223d36: mov    0x68(%r10),%r10    ;*getstatic consumedCPU                                       ; - org.openjdk.jmh.infra.Blackhole::consumeCPU@0 (line 456)
0x00007f15e9223d3a: test   %rsi,%rsi
         ╭ 0x00007f15e9223d3d: jle    0x00007f15e9223d75  ;*ifle
         │                                       ; - org.openjdk.jmh.infra.Blackhole::consumeCPU@11 (line 466)
         │ 0x00007f15e9223d3f: movabs $0x5deece66d,%r11
  1.55%  │ 0x00007f15e9223d49: movabs $0xffffffffffff,%r8  ;*lload_2
         │                                       ; - org.openjdk.jmh.infra.Blackhole::consumeCPU@14 (line 467)
  4.17%  │↗  0x00007f15e9223d53: mov    %r10,%r9
  4.22%  ││  0x00007f15e9223d56: imul   %r11,%r9
 24.75%  ││  0x00007f15e9223d5a: add    %rsi,%r9
 11.38%  ││  0x00007f15e9223d5d: dec    %rsi               ;*lsub
         ││                                        ; - org.openjdk.jmh.infra.Blackhole::consumeCPU@36 (line 466)
  3.63%  ││  0x00007f15e9223d60: add    $0xb,%r9
  6.00%  ││  0x00007f15e9223d64: and    %r8,%r9
  9.23%  ││  0x00007f15e9223d67: add    %r9,%r10           ; OopMap{off=74}
         ││                                        ;*goto
         ││                                        ; - org.openjdk.jmh.infra.Blackhole::consumeCPU@39 (line 466)  11.94%  ││  0x00007f15e9223d6a: test   %eax,0x177e3290(%rip)        # 0x00007f1600a07000
         ││                                        ;*goto
         ││                                        ; - org.openjdk.jmh.infra.Blackhole::consumeCPU@39 (line 466)
         ││                                        ;   {poll}
  4.40%  ││  0x00007f15e9223d70: test   %rsi,%rsi
         │╰  0x00007f15e9223d73: jg     0x00007f15e9223d53  ;*ifle
         │                                       ; - org.openjdk.jmh.infra.Blackhole::consumeCPU@11 (line 466)
         ↘ 0x00007f15e9223d75: cmp    $0x2a,%r10
           ╭ 0x00007f15e9223d79: je     0x00007f15e9223d87  ;*ifne
           │                                       ; - org.openjdk.jmh.infra.Blackhole::consumeCPU@47 (line 474)
  1.79%    │ 0x00007f15e9223d7b: add    $0x20,%rsp
  0.28%    │ 0x00007f15e9223d7f: pop    %rbp
           │ 0x00007f15e9223d80: test   %eax,0x177e327a(%rip)        # 0x00007f1600a07000
           │                                       ;   {poll_return}
           │ 0x00007f15e9223d86: retq
           ↘ 0x00007f15e9223d87: mov    $0xffffff65,%esi
0x00007f15e9223d8c: mov    $0x2a,%r11d
0x00007f15e9223d92: cmp    %r11,%r10
0x00007f15e9223d95: mov    $0xffffffff,%ebp
0x00007f15e9223d9a: jl     0x00007f15e9223da4
0x00007f15e9223d9c: setne  %bpl
0x00007f15e9223da0: movzbl %bpl,%ebp          ;*lcmp
....................................................................................................
 88.00%  <total for region 1>

Looking at the code, it seems that all the comments on BlockHole::consumeCPU are correct (related to cmq and test) and it is indeed a very straightforward translation from code -> ASM (no loop unrolling or weird tricks by the JVM, just a mfence dropped somewhere, maybe).
Just safepoint polls on each loop iteration and on the end of method call.

Il giorno domenica 24 marzo 2019 10:12:56 UTC+1, Avi Kivity ha scritto:

    Suppose you have micro-operations A and B that take t(A) and t(B)
    to run. Running repeat(n, A+B) can take n*(t(a) + t(b)), or
    n*max(t(a), t(b)), or n*(t(a) + t(b) + huge_delta), or something else.


    Sometimes the CPU can completely parallelize A and B so running
    them in parallel takes no extra time compared to just one.
    Sometimes running them in sequence causes one of the caches to
    overflow and efficiency decreases dramatically. And sometimes
    running both can undo some quirk and you end up with them taking
    less time.


    Summary: CPUs are complicated.


    On 24/03/2019 10.11, Francesco Nigro wrote:
    Hi folks,

    while reading the awesome
    https://shipilev.net/blog/2014/nanotrusting-nanotime/
    <https://shipilev.net/blog/2014/nanotrusting-nanotime/> I have
    some questions on the "Building Performance Models" part.
    Specifically, when you want to compare 2 operations (ie A,B) and
    you want to emulate the same behaviour of a real application you
    need to amortize the cost of such operations:
    in JMH this is achieved with BlackHole.consumeCPU(int tokens),
    but any microbenchmark tool (if not on the JVM) could/should
    provide something similar.

    Said that, now the code to measure is not just A or B but is
    composed by 2 operations: amortization; A() (or B);
    For JMH that means:

    @Benchmark
    int a() {
    BlackHole.consumeCPU(tokens);
    //suppose that rawA() return an int and rawA() is the original
    call re A
    return rawA();
    }

    in JMH is up to the tool to avoid Dead Code Elimination when you
    return a value from a benchmarked method.

    The point of the article seems to be that given that "performance
    is not composable" if you want to compare A and B cost (with
    amortization) you cannot create a third benchmark:

    @Benchmark
    int amortization() {
        BlackHole.consumeCPU(tokens);
    }

    And use the benchmark results (eg throughput of calls) to be
    subtracted from the results of a() (or b()) to compare A and B costs.
    I don't understand the meaning of "performance is not composable"
    and I would appreciate your opinion on that, given that many people
    of this list have experience with benchmarking.

    Thanks,
    Franz
-- You received this message because you are subscribed to the
    Google Groups "mechanical-sympathy" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to [email protected]
    <javascript:>.
    For more options, visit https://groups.google.com/d/optout
    <https://groups.google.com/d/optout>.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected] <mailto:[email protected]>.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to