Thanks Avi,
The way CPU can parallelize works make totally sense to me, but probably
I've missed some context about the operation used to amortize A (or B):
http://hg.openjdk.java.net/code-tools/jmh/file/5984e353dca7/jmh-core/src/main/java/org/openjdk/jmh/infra/Blackhole.java#l443
The point I'm not getting is why I can compare directly the results of a()
and b() (with amortization), but I cannot use amortization() to compare a()
and b(): to me they both look incorrect with different degrees.
If BlackHole::consumeCPU is "trusty" (ie has nearly fixed cost alone or if
combined with other operation) it means that it can always be used for
comparision, but if it is not,
the risk is that althought it can amortize with success rawA() (or rawB()),
it will interact with them and with low values of tokens
the cost of the composed operations couldn't be compared in a way to say
anything about rawA() vs rawB() cost.
I'm sure that there must be something that I'm missing here.
FYI that's the asm printed by JMH related to tokens = 10 and I can see that
it doesn't get inlined (probably too unpredictable?):
....[Hottest Region
1]..............................................................................
C2, level 4, org.openjdk.jmh.infra.Blackhole::consumeCPU, version 501 (88
bytes)
</print_nmethod>
Decoding compiled method 0x00007f15e9223bd0:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} {0x00007f15fcde2298} 'consumeCPU'
'(J)V' in 'org/openjdk/jmh/infra/Blackhole'
# parm0: rsi:rsi = long
# [sp+0x30] (sp of caller)
0x00007f15e9223d20: mov %eax,-0x14000(%rsp)
2.31% 0x00007f15e9223d27: push %rbp
0x00007f15e9223d28: sub $0x20,%rsp
;*synchronization entry
; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@-1 (line 456)
2.01% 0x00007f15e9223d2c: movabs $0x76fe0bb60,%r10 ; {oop(a
'java/lang/Class' = 'org/openjdk/jmh/infra/Blackhole')}
0.34% 0x00007f15e9223d36: mov 0x68(%r10),%r10 ;*getstatic
consumedCPU
; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@0 (line 456)
0x00007f15e9223d3a: test %rsi,%rsi
╭ 0x00007f15e9223d3d: jle 0x00007f15e9223d75 ;*ifle
│ ; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@11 (line 466)
│ 0x00007f15e9223d3f: movabs $0x5deece66d,%r11
1.55% │ 0x00007f15e9223d49: movabs $0xffffffffffff,%r8 ;*lload_2
│ ; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@14 (line 467)
4.17% │↗ 0x00007f15e9223d53: mov %r10,%r9
4.22% ││ 0x00007f15e9223d56: imul %r11,%r9
24.75% ││ 0x00007f15e9223d5a: add %rsi,%r9
11.38% ││ 0x00007f15e9223d5d: dec %rsi ;*lsub
││ ; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@36 (line 466)
3.63% ││ 0x00007f15e9223d60: add $0xb,%r9
6.00% ││ 0x00007f15e9223d64: and %r8,%r9
9.23% ││ 0x00007f15e9223d67: add %r9,%r10 ; OopMap{off=74}
││ ;*goto
││ ; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@39 (line 466)
11.94% ││ 0x00007f15e9223d6a: test %eax,0x177e3290(%rip) #
0x00007f1600a07000
││ ;*goto
││ ; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@39 (line 466)
││ ; {poll}
4.40% ││ 0x00007f15e9223d70: test %rsi,%rsi
│╰ 0x00007f15e9223d73: jg 0x00007f15e9223d53 ;*ifle
│ ; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@11 (line 466)
↘ 0x00007f15e9223d75: cmp $0x2a,%r10
╭ 0x00007f15e9223d79: je 0x00007f15e9223d87 ;*ifne
│ ; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@47 (line 474)
1.79% │ 0x00007f15e9223d7b: add $0x20,%rsp
0.28% │ 0x00007f15e9223d7f: pop %rbp
│ 0x00007f15e9223d80: test %eax,0x177e327a(%rip) #
0x00007f1600a07000
│ ;
{poll_return}
│ 0x00007f15e9223d86: retq
↘ 0x00007f15e9223d87: mov $0xffffff65,%esi
0x00007f15e9223d8c: mov $0x2a,%r11d
0x00007f15e9223d92: cmp %r11,%r10
0x00007f15e9223d95: mov $0xffffffff,%ebp
0x00007f15e9223d9a: jl 0x00007f15e9223da4
0x00007f15e9223d9c: setne %bpl
0x00007f15e9223da0: movzbl %bpl,%ebp ;*lcmp
....................................................................................................
88.00% <total for region 1>
Looking at the code, it seems that all the comments on
BlockHole::consumeCPU are correct (related to cmq and test) and it is
indeed a very straightforward translation from code -> ASM (no loop
unrolling or weird tricks by the JVM, just a mfence dropped somewhere,
maybe).
Just safepoint polls on each loop iteration and on the end of method call.
Il giorno domenica 24 marzo 2019 10:12:56 UTC+1, Avi Kivity ha scritto:
>
> Suppose you have micro-operations A and B that take t(A) and t(B) to run.
> Running repeat(n, A+B) can take n*(t(a) + t(b)), or n*max(t(a), t(b)), or
> n*(t(a) + t(b) + huge_delta), or something else.
>
>
> Sometimes the CPU can completely parallelize A and B so running them in
> parallel takes no extra time compared to just one. Sometimes running them
> in sequence causes one of the caches to overflow and efficiency decreases
> dramatically. And sometimes running both can undo some quirk and you end up
> with them taking less time.
>
>
> Summary: CPUs are complicated.
>
>
> On 24/03/2019 10.11, Francesco Nigro wrote:
>
> Hi folks,
>
> while reading the awesome
> https://shipilev.net/blog/2014/nanotrusting-nanotime/ I have some
> questions on the "Building Performance Models" part.
> Specifically, when you want to compare 2 operations (ie A,B) and you want
> to emulate the same behaviour of a real application you need to amortize
> the cost of such operations:
> in JMH this is achieved with BlackHole.consumeCPU(int tokens), but any
> microbenchmark tool (if not on the JVM) could/should provide something
> similar.
>
> Said that, now the code to measure is not just A or B but is composed by 2
> operations: amortization; A() (or B);
> For JMH that means:
>
> @Benchmark
> int a() {
> BlackHole.consumeCPU(tokens);
> //suppose that rawA() return an int and rawA() is the original call re
> A
> return rawA();
> }
>
> in JMH is up to the tool to avoid Dead Code Elimination when you return a
> value from a benchmarked method.
>
> The point of the article seems to be that given that "performance is not
> composable" if you want to compare A and B cost (with amortization) you
> cannot create a third benchmark:
>
> @Benchmark
> int amortization() {
> BlackHole.consumeCPU(tokens);
> }
>
> And use the benchmark results (eg throughput of calls) to be subtracted
> from the results of a() (or b()) to compare A and B costs.
> I don't understand the meaning of "performance is not composable" and I
> would appreciate your opinion on that, given that many people
> of this list have experience with benchmarking.
>
> Thanks,
> Franz
> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected] <javascript:>.
> For more options, visit https://groups.google.com/d/optout.
>
>
--
You received this message because you are subscribed to the Google Groups
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.