Re: Performance is not composable

Francesco Nigro Tue, 26 Mar 2019 07:46:11 -0700

> The call to consumeCPU may not have any effect on the loop performance if 
n is low enough


This it clear to me, but with "high enough" n (to be quantified, depending 
on the operation to be amortized and the HW/HW utilisation) it should 
produce a linear-ish time occupation directly proportional to the number of 
n: I suppose 
that the point of the author of the article is that it should be used for 
the sake of producing a deterministic load in order to let the operation to 
be amortized to fade its cost starting fron an n to be experimentally 
evaluated.

What I'm not getting is why comparing directly both the amortized benchmark 
results (that I've sent in the first email) is ok while it is not by using 
a third benchmark that just use BlackHole::consumeCPU.
I suppose that both the approaches risk to compare apple-to-oranges, just 
the latter is even more wrong because it introduce a third reference...

Il giorno lunedì 25 marzo 2019 12:54:01 UTC+1, Avi Kivity ha scritto:
>
> consumeCPU() doesn't consume the CPU, but just some parts of it - a few 
> integer ports. It leaves alone the instruction cache, the instruction 
> decoder, and most of the execution ports.
>
>
> Given that the loop has 10 instructions, the function runs about 100 
> instructions, and so it can run in parallel with previous instructions from 
> the caller.
>
>
> Consider this code
>
>
>    int a[1'000'000'000];
>
>    fill a with random values in the range 0..999'999'999
>
>    loop {
>
>        idx = a[idx]
>
>        consumeCPU(n)
>
>    }
>
>
> The call to consumeCPU may not have any effect on the loop performance if 
> n is low enough. The processor can be memory-bound and execute consumeCPU() 
> while waiting for previous reads to complete.
>
>
> On 25/03/2019 10.44, Francesco Nigro wrote:
>
> Thanks Avi, 
>
> The way CPU can parallelize works make totally sense to me, but probably 
> I've missed some context about the operation used to amortize A (or B):
>
> http://hg.openjdk.java.net/code-tools/jmh/file/5984e353dca7/jmh-core/src/main/java/org/openjdk/jmh/infra/Blackhole.java#l443
>
> The point I'm not getting is why I can compare directly the results of a() 
> and b() (with amortization), but I cannot use amortization() to compare a() 
> and b(): to me they both look incorrect with different degrees. 
>
> If BlackHole::consumeCPU is "trusty" (ie has nearly fixed cost alone or if 
> combined with other operation) it means that it can always be used for 
> comparision, but if it is not, 
> the risk is that althought it can amortize with success rawA() (or 
> rawB()), it will interact with them and with low values of tokens
> the cost of the composed operations couldn't be compared in a way to say 
> anything about rawA() vs rawB() cost.
>
> I'm sure that there must be something that I'm missing here.
> FYI that's the asm printed by JMH related to tokens = 10 and I can see 
> that it doesn't get inlined (probably too unpredictable?):
>
> ....[Hottest Region 
> 1]..............................................................................
> C2, level 4, org.openjdk.jmh.infra.Blackhole::consumeCPU, version 501 (88 
> bytes) 
>
>             </print_nmethod>
>             Decoding compiled method 0x00007f15e9223bd0:
>             Code:
>             [Entry Point]
>             [Verified Entry Point]
>             [Constants]
>               # {method} {0x00007f15fcde2298} &apos;consumeCPU&apos; 
> &apos;(J)V&apos; in &apos;org/openjdk/jmh/infra/Blackhole&apos;
>               # parm0:    rsi:rsi   = long
>               #           [sp+0x30]  (sp of caller)
>               0x00007f15e9223d20: mov    %eax,-0x14000(%rsp)
>   2.31%       0x00007f15e9223d27: push   %rbp
>               0x00007f15e9223d28: sub    $0x20,%rsp        
>  ;*synchronization entry
>                                                             ; - 
> org.openjdk.jmh.infra.Blackhole::consumeCPU@-1 (line 456)
>   2.01%       0x00007f15e9223d2c: movabs $0x76fe0bb60,%r10  ;   {oop(a 
> &apos;java/lang/Class&apos; = &apos;org/openjdk/jmh/infra/Blackhole&apos;)}
>   0.34%       0x00007f15e9223d36: mov    0x68(%r10),%r10    ;*getstatic 
> consumedCPU
>                                                             ; - 
> org.openjdk.jmh.infra.Blackhole::consumeCPU@0 (line 456)
>               0x00007f15e9223d3a: test   %rsi,%rsi
>          ╭    0x00007f15e9223d3d: jle    0x00007f15e9223d75  ;*ifle
>          │                                                  ; - 
> org.openjdk.jmh.infra.Blackhole::consumeCPU@11 (line 466)
>          │    0x00007f15e9223d3f: movabs $0x5deece66d,%r11
>   1.55%  │    0x00007f15e9223d49: movabs $0xffffffffffff,%r8  ;*lload_2
>          │                                                  ; - 
> org.openjdk.jmh.infra.Blackhole::consumeCPU@14 (line 467)
>   4.17%  │↗   0x00007f15e9223d53: mov    %r10,%r9
>   4.22%  ││   0x00007f15e9223d56: imul   %r11,%r9
>  24.75%  ││   0x00007f15e9223d5a: add    %rsi,%r9
>  11.38%  ││   0x00007f15e9223d5d: dec    %rsi               ;*lsub
>          ││                                                 ; - 
> org.openjdk.jmh.infra.Blackhole::consumeCPU@36 (line 466)
>   3.63%  ││   0x00007f15e9223d60: add    $0xb,%r9
>   6.00%  ││   0x00007f15e9223d64: and    %r8,%r9
>   9.23%  ││   0x00007f15e9223d67: add    %r9,%r10           ; 
> OopMap{off=74}
>          ││                                                 ;*goto
>          ││                                                 ; - 
> org.openjdk.jmh.infra.Blackhole::consumeCPU@39 (line 466)
>  11.94%  ││   0x00007f15e9223d6a: test   %eax,0x177e3290(%rip)        # 
> 0x00007f1600a07000
>          ││                                                 ;*goto
>          ││                                                 ; - 
> org.openjdk.jmh.infra.Blackhole::consumeCPU@39 (line 466)
>          ││                                                 ;   {poll}
>   4.40%  ││   0x00007f15e9223d70: test   %rsi,%rsi
>          │╰   0x00007f15e9223d73: jg     0x00007f15e9223d53  ;*ifle
>          │                                                  ; - 
> org.openjdk.jmh.infra.Blackhole::consumeCPU@11 (line 466)
>          ↘    0x00007f15e9223d75: cmp    $0x2a,%r10
>            ╭  0x00007f15e9223d79: je     0x00007f15e9223d87  ;*ifne
>            │                                                ; - 
> org.openjdk.jmh.infra.Blackhole::consumeCPU@47 (line 474)
>   1.79%    │  0x00007f15e9223d7b: add    $0x20,%rsp
>   0.28%    │  0x00007f15e9223d7f: pop    %rbp
>            │  0x00007f15e9223d80: test   %eax,0x177e327a(%rip)        # 
> 0x00007f1600a07000
>            │                                                ;  
>  {poll_return}
>            │  0x00007f15e9223d86: retq   
>            ↘  0x00007f15e9223d87: mov    $0xffffff65,%esi
>               0x00007f15e9223d8c: mov    $0x2a,%r11d
>               0x00007f15e9223d92: cmp    %r11,%r10
>               0x00007f15e9223d95: mov    $0xffffffff,%ebp
>               0x00007f15e9223d9a: jl     0x00007f15e9223da4
>               0x00007f15e9223d9c: setne  %bpl
>               0x00007f15e9223da0: movzbl %bpl,%ebp          ;*lcmp
>
> ....................................................................................................
>  88.00%  <total for region 1>
>
> Looking at the code, it seems that all the comments on 
> BlockHole::consumeCPU are correct (related to cmq and test) and it is 
> indeed a very straightforward translation from code -> ASM (no loop 
> unrolling or weird tricks by the JVM, just a mfence dropped somewhere, 
> maybe).
> Just safepoint polls on each loop iteration and on the end of method call.
>
> Il giorno domenica 24 marzo 2019 10:12:56 UTC+1, Avi Kivity ha scritto: 
>>
>> Suppose you have micro-operations A and B that take t(A) and t(B) to run. 
>> Running repeat(n, A+B) can take n*(t(a) + t(b)), or n*max(t(a), t(b)), or 
>> n*(t(a) + t(b) + huge_delta), or something else.
>>
>>
>> Sometimes the CPU can completely parallelize A and B so running them in 
>> parallel takes no extra time compared to just one. Sometimes running them 
>> in sequence causes one of the caches to overflow and efficiency decreases 
>> dramatically. And sometimes running both can undo some quirk and you end up 
>> with them taking less time.
>>
>>
>> Summary: CPUs are complicated.
>>
>>
>> On 24/03/2019 10.11, Francesco Nigro wrote:
>>
>> Hi folks, 
>>
>> while reading the awesome 
>> https://shipilev.net/blog/2014/nanotrusting-nanotime/ I have some 
>> questions on the "Building Performance Models" part.
>> Specifically, when you want to compare 2 operations (ie A,B) and you want 
>> to emulate the same behaviour of a real application you need to amortize 
>> the cost of such operations: 
>> in JMH this is achieved with BlackHole.consumeCPU(int tokens), but any 
>> microbenchmark tool (if not on the JVM) could/should provide something 
>> similar.
>>
>> Said that, now the code to measure is not just A or B but is composed by 
>> 2 operations: amortization; A() (or B);
>> For JMH that means:
>>
>> @Benchmark
>> int a() {
>>     BlackHole.consumeCPU(tokens);
>>     //suppose that rawA() return an int and rawA() is the original call 
>> re A
>>     return rawA();
>> }
>>
>> in JMH is up to the tool to avoid Dead Code Elimination when you return a 
>> value from a benchmarked method.
>>
>> The point of the article seems to be that given that "performance is not 
>> composable" if you want to compare A and B cost (with amortization) you 
>> cannot create a third benchmark:
>>
>> @Benchmark
>> int amortization() {
>>     BlackHole.consumeCPU(tokens);
>> }
>>
>> And use the benchmark results (eg throughput of calls) to be subtracted 
>> from the results of a() (or b()) to compare A and B costs.
>> I don't understand the meaning of "performance is not composable" and I 
>> would appreciate your opinion on that, given that many people
>> of this list have experience with benchmarking.
>>
>> Thanks,
>> Franz
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>> -- 
> You received this message because you are subscribed to the Google Groups 
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <javascript:>.
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Performance is not composable

Reply via email to