Thanks Avi,
The way CPU can parallelize works make totally sense to me, but
probably I've missed some context about the operation used to amortize
A (or B):
http://hg.openjdk.java.net/code-tools/jmh/file/5984e353dca7/jmh-core/src/main/java/org/openjdk/jmh/infra/Blackhole.java#l443
The point I'm not getting is why I can compare directly the results
of a() and b() (with amortization), but I cannot use amortization() to
compare a() and b(): to me they both look incorrect with different
degrees.
If BlackHole::consumeCPU is "trusty" (ie has nearly fixed cost alone
or if combined with other operation) it means that it can always be
used for comparision, but if it is not,
the risk is that althought it can amortize with success rawA() (or
rawB()), it will interact with them and with low values of tokens
the cost of the composed operations couldn't be compared in a way to
say anything about rawA() vs rawB() cost.
I'm sure that there must be something that I'm missing here.
FYI that's the asm printed by JMH related to tokens = 10 and I can see
that it doesn't get inlined (probably too unpredictable?):
....[Hottest Region
1]..............................................................................
C2, level 4, org.openjdk.jmh.infra.Blackhole::consumeCPU, version 501
(88 bytes)
</print_nmethod>
Decoding compiled method 0x00007f15e9223bd0:
Code:
[Entry Point]
[Verified Entry Point]
[Constants]
# {method} {0x00007f15fcde2298} 'consumeCPU'
'(J)V' in 'org/openjdk/jmh/infra/Blackhole'
# parm0: rsi:rsi = long
# [sp+0x30] (sp of caller)
0x00007f15e9223d20: mov %eax,-0x14000(%rsp)
2.31% 0x00007f15e9223d27: push %rbp
0x00007f15e9223d28: sub $0x20,%rsp ;*synchronization entry
; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@-1 (line 456)
2.01% 0x00007f15e9223d2c: movabs $0x76fe0bb60,%r10 ; {oop(a
'java/lang/Class' =
'org/openjdk/jmh/infra/Blackhole')}
0.34% 0x00007f15e9223d36: mov 0x68(%r10),%r10 ;*getstatic
consumedCPU
; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@0 (line 456)
0x00007f15e9223d3a: test %rsi,%rsi
╭ 0x00007f15e9223d3d: jle 0x00007f15e9223d75 ;*ifle
│ ; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@11 (line 466)
│ 0x00007f15e9223d3f: movabs $0x5deece66d,%r11
1.55% │ 0x00007f15e9223d49: movabs $0xffffffffffff,%r8 ;*lload_2
│ ; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@14 (line 467)
4.17% │↗ 0x00007f15e9223d53: mov %r10,%r9
4.22% ││ 0x00007f15e9223d56: imul %r11,%r9
24.75% ││ 0x00007f15e9223d5a: add %rsi,%r9
11.38% ││ 0x00007f15e9223d5d: dec %rsi ;*lsub
││ ; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@36 (line 466)
3.63% ││ 0x00007f15e9223d60: add $0xb,%r9
6.00% ││ 0x00007f15e9223d64: and %r8,%r9
9.23% ││ 0x00007f15e9223d67: add %r9,%r10 ;
OopMap{off=74}
││ ;*goto
││ ; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@39 (line 466)
11.94% ││ 0x00007f15e9223d6a: test %eax,0x177e3290(%rip) #
0x00007f1600a07000
││ ;*goto
││ ; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@39 (line 466)
││ ; {poll}
4.40% ││ 0x00007f15e9223d70: test %rsi,%rsi
│╰ 0x00007f15e9223d73: jg 0x00007f15e9223d53 ;*ifle
│ ; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@11 (line 466)
↘ 0x00007f15e9223d75: cmp $0x2a,%r10
╭ 0x00007f15e9223d79: je 0x00007f15e9223d87 ;*ifne
│ ; -
org.openjdk.jmh.infra.Blackhole::consumeCPU@47 (line 474)
1.79% │ 0x00007f15e9223d7b: add $0x20,%rsp
0.28% │ 0x00007f15e9223d7f: pop %rbp
│ 0x00007f15e9223d80: test %eax,0x177e327a(%rip) #
0x00007f1600a07000
│ ; {poll_return}
│ 0x00007f15e9223d86: retq
↘ 0x00007f15e9223d87: mov $0xffffff65,%esi
0x00007f15e9223d8c: mov $0x2a,%r11d
0x00007f15e9223d92: cmp %r11,%r10
0x00007f15e9223d95: mov $0xffffffff,%ebp
0x00007f15e9223d9a: jl 0x00007f15e9223da4
0x00007f15e9223d9c: setne %bpl
0x00007f15e9223da0: movzbl %bpl,%ebp ;*lcmp
....................................................................................................
88.00% <total for region 1>
Looking at the code, it seems that all the comments on
BlockHole::consumeCPU are correct (related to cmq and test) and it is
indeed a very straightforward translation from code -> ASM (no loop
unrolling or weird tricks by the JVM, just a mfence dropped somewhere,
maybe).
Just safepoint polls on each loop iteration and on the end of method call.
Il giorno domenica 24 marzo 2019 10:12:56 UTC+1, Avi Kivity ha scritto:
Suppose you have micro-operations A and B that take t(A) and t(B)
to run. Running repeat(n, A+B) can take n*(t(a) + t(b)), or
n*max(t(a), t(b)), or n*(t(a) + t(b) + huge_delta), or something else.
Sometimes the CPU can completely parallelize A and B so running
them in parallel takes no extra time compared to just one.
Sometimes running them in sequence causes one of the caches to
overflow and efficiency decreases dramatically. And sometimes
running both can undo some quirk and you end up with them taking
less time.
Summary: CPUs are complicated.
On 24/03/2019 10.11, Francesco Nigro wrote:
Hi folks,
while reading the awesome
https://shipilev.net/blog/2014/nanotrusting-nanotime/
<https://shipilev.net/blog/2014/nanotrusting-nanotime/> I have
some questions on the "Building Performance Models" part.
Specifically, when you want to compare 2 operations (ie A,B) and
you want to emulate the same behaviour of a real application you
need to amortize the cost of such operations:
in JMH this is achieved with BlackHole.consumeCPU(int tokens),
but any microbenchmark tool (if not on the JVM) could/should
provide something similar.
Said that, now the code to measure is not just A or B but is
composed by 2 operations: amortization; A() (or B);
For JMH that means:
@Benchmark
int a() {
BlackHole.consumeCPU(tokens);
//suppose that rawA() return an int and rawA() is the original
call re A
return rawA();
}
in JMH is up to the tool to avoid Dead Code Elimination when you
return a value from a benchmarked method.
The point of the article seems to be that given that "performance
is not composable" if you want to compare A and B cost (with
amortization) you cannot create a third benchmark:
@Benchmark
int amortization() {
BlackHole.consumeCPU(tokens);
}
And use the benchmark results (eg throughput of calls) to be
subtracted from the results of a() (or b()) to compare A and B costs.
I don't understand the meaning of "performance is not composable"
and I would appreciate your opinion on that, given that many people
of this list have experience with benchmarking.
Thanks,
Franz
--
You received this message because you are subscribed to the
Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to [email protected]
<javascript:>.
For more options, visit https://groups.google.com/d/optout
<https://groups.google.com/d/optout>.
--
You received this message because you are subscribed to the Google
Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to [email protected]
<mailto:[email protected]>.
For more options, visit https://groups.google.com/d/optout.