> The call to consumeCPU may not have any effect on the loop performance if
n is low enough
This it clear to me, but with "high enough" n (to be quantified, depending
on the operation to be amortized and the HW/HW utilisation) it should
produce a linear-ish time occupation directly proportional to the number of
n: I suppose
that the point of the author of the article is that it should be used for
the sake of producing a deterministic load in order to let the operation to
be amortized to fade its cost starting fron an n to be experimentally
evaluated.
What I'm not getting is why comparing directly both the amortized benchmark
results (that I've sent in the first email) is ok while it is not by using
a third benchmark that just use BlackHole::consumeCPU.
I suppose that both the approaches risk to compare apple-to-oranges, just
the latter is even more wrong because it introduce a third reference...
Il giorno lunedì 25 marzo 2019 12:54:01 UTC+1, Avi Kivity ha scritto:
>
> consumeCPU() doesn't consume the CPU, but just some parts of it - a few
> integer ports. It leaves alone the instruction cache, the instruction
> decoder, and most of the execution ports.
>
>
> Given that the loop has 10 instructions, the function runs about 100
> instructions, and so it can run in parallel with previous instructions from
> the caller.
>
>
> Consider this code
>
>
> int a[1'000'000'000];
>
> fill a with random values in the range 0..999'999'999
>
> loop {
>
> idx = a[idx]
>
> consumeCPU(n)
>
> }
>
>
> The call to consumeCPU may not have any effect on the loop performance if
> n is low enough. The processor can be memory-bound and execute consumeCPU()
> while waiting for previous reads to complete.
>
>
> On 25/03/2019 10.44, Francesco Nigro wrote:
>
> Thanks Avi,
>
> The way CPU can parallelize works make totally sense to me, but probably
> I've missed some context about the operation used to amortize A (or B):
>
> http://hg.openjdk.java.net/code-tools/jmh/file/5984e353dca7/jmh-core/src/main/java/org/openjdk/jmh/infra/Blackhole.java#l443
>
> The point I'm not getting is why I can compare directly the results of a()
> and b() (with amortization), but I cannot use amortization() to compare a()
> and b(): to me they both look incorrect with different degrees.
>
> If BlackHole::consumeCPU is "trusty" (ie has nearly fixed cost alone or if
> combined with other operation) it means that it can always be used for
> comparision, but if it is not,
> the risk is that althought it can amortize with success rawA() (or
> rawB()), it will interact with them and with low values of tokens
> the cost of the composed operations couldn't be compared in a way to say
> anything about rawA() vs rawB() cost.
>
> I'm sure that there must be something that I'm missing here.
> FYI that's the asm printed by JMH related to tokens = 10 and I can see
> that it doesn't get inlined (probably too unpredictable?):
>
> ....[Hottest Region
> 1]..............................................................................
> C2, level 4, org.openjdk.jmh.infra.Blackhole::consumeCPU, version 501 (88
> bytes)
>
> </print_nmethod>
> Decoding compiled method 0x00007f15e9223bd0:
> Code:
> [Entry Point]
> [Verified Entry Point]
> [Constants]
> # {method} {0x00007f15fcde2298} 'consumeCPU'
> '(J)V' in 'org/openjdk/jmh/infra/Blackhole'
> # parm0: rsi:rsi = long
> # [sp+0x30] (sp of caller)
> 0x00007f15e9223d20: mov %eax,-0x14000(%rsp)
> 2.31% 0x00007f15e9223d27: push %rbp
> 0x00007f15e9223d28: sub $0x20,%rsp
> ;*synchronization entry
> ; -
> org.openjdk.jmh.infra.Blackhole::consumeCPU@-1 (line 456)
> 2.01% 0x00007f15e9223d2c: movabs $0x76fe0bb60,%r10 ; {oop(a
> 'java/lang/Class' = 'org/openjdk/jmh/infra/Blackhole')}
> 0.34% 0x00007f15e9223d36: mov 0x68(%r10),%r10 ;*getstatic
> consumedCPU
> ; -
> org.openjdk.jmh.infra.Blackhole::consumeCPU@0 (line 456)
> 0x00007f15e9223d3a: test %rsi,%rsi
> ╭ 0x00007f15e9223d3d: jle 0x00007f15e9223d75 ;*ifle
> │ ; -
> org.openjdk.jmh.infra.Blackhole::consumeCPU@11 (line 466)
> │ 0x00007f15e9223d3f: movabs $0x5deece66d,%r11
> 1.55% │ 0x00007f15e9223d49: movabs $0xffffffffffff,%r8 ;*lload_2
> │ ; -
> org.openjdk.jmh.infra.Blackhole::consumeCPU@14 (line 467)
> 4.17% │↗ 0x00007f15e9223d53: mov %r10,%r9
> 4.22% ││ 0x00007f15e9223d56: imul %r11,%r9
> 24.75% ││ 0x00007f15e9223d5a: add %rsi,%r9
> 11.38% ││ 0x00007f15e9223d5d: dec %rsi ;*lsub
> ││ ; -
> org.openjdk.jmh.infra.Blackhole::consumeCPU@36 (line 466)
> 3.63% ││ 0x00007f15e9223d60: add $0xb,%r9
> 6.00% ││ 0x00007f15e9223d64: and %r8,%r9
> 9.23% ││ 0x00007f15e9223d67: add %r9,%r10 ;
> OopMap{off=74}
> ││ ;*goto
> ││ ; -
> org.openjdk.jmh.infra.Blackhole::consumeCPU@39 (line 466)
> 11.94% ││ 0x00007f15e9223d6a: test %eax,0x177e3290(%rip) #
> 0x00007f1600a07000
> ││ ;*goto
> ││ ; -
> org.openjdk.jmh.infra.Blackhole::consumeCPU@39 (line 466)
> ││ ; {poll}
> 4.40% ││ 0x00007f15e9223d70: test %rsi,%rsi
> │╰ 0x00007f15e9223d73: jg 0x00007f15e9223d53 ;*ifle
> │ ; -
> org.openjdk.jmh.infra.Blackhole::consumeCPU@11 (line 466)
> ↘ 0x00007f15e9223d75: cmp $0x2a,%r10
> ╭ 0x00007f15e9223d79: je 0x00007f15e9223d87 ;*ifne
> │ ; -
> org.openjdk.jmh.infra.Blackhole::consumeCPU@47 (line 474)
> 1.79% │ 0x00007f15e9223d7b: add $0x20,%rsp
> 0.28% │ 0x00007f15e9223d7f: pop %rbp
> │ 0x00007f15e9223d80: test %eax,0x177e327a(%rip) #
> 0x00007f1600a07000
> │ ;
> {poll_return}
> │ 0x00007f15e9223d86: retq
> ↘ 0x00007f15e9223d87: mov $0xffffff65,%esi
> 0x00007f15e9223d8c: mov $0x2a,%r11d
> 0x00007f15e9223d92: cmp %r11,%r10
> 0x00007f15e9223d95: mov $0xffffffff,%ebp
> 0x00007f15e9223d9a: jl 0x00007f15e9223da4
> 0x00007f15e9223d9c: setne %bpl
> 0x00007f15e9223da0: movzbl %bpl,%ebp ;*lcmp
>
> ....................................................................................................
> 88.00% <total for region 1>
>
> Looking at the code, it seems that all the comments on
> BlockHole::consumeCPU are correct (related to cmq and test) and it is
> indeed a very straightforward translation from code -> ASM (no loop
> unrolling or weird tricks by the JVM, just a mfence dropped somewhere,
> maybe).
> Just safepoint polls on each loop iteration and on the end of method call.
>
> Il giorno domenica 24 marzo 2019 10:12:56 UTC+1, Avi Kivity ha scritto:
>>
>> Suppose you have micro-operations A and B that take t(A) and t(B) to run.
>> Running repeat(n, A+B) can take n*(t(a) + t(b)), or n*max(t(a), t(b)), or
>> n*(t(a) + t(b) + huge_delta), or something else.
>>
>>
>> Sometimes the CPU can completely parallelize A and B so running them in
>> parallel takes no extra time compared to just one. Sometimes running them
>> in sequence causes one of the caches to overflow and efficiency decreases
>> dramatically. And sometimes running both can undo some quirk and you end up
>> with them taking less time.
>>
>>
>> Summary: CPUs are complicated.
>>
>>
>> On 24/03/2019 10.11, Francesco Nigro wrote:
>>
>> Hi folks,
>>
>> while reading the awesome
>> https://shipilev.net/blog/2014/nanotrusting-nanotime/ I have some
>> questions on the "Building Performance Models" part.
>> Specifically, when you want to compare 2 operations (ie A,B) and you want
>> to emulate the same behaviour of a real application you need to amortize
>> the cost of such operations:
>> in JMH this is achieved with BlackHole.consumeCPU(int tokens), but any
>> microbenchmark tool (if not on the JVM) could/should provide something
>> similar.
>>
>> Said that, now the code to measure is not just A or B but is composed by
>> 2 operations: amortization; A() (or B);
>> For JMH that means:
>>
>> @Benchmark
>> int a() {
>> BlackHole.consumeCPU(tokens);
>> //suppose that rawA() return an int and rawA() is the original call
>> re A
>> return rawA();
>> }
>>
>> in JMH is up to the tool to avoid Dead Code Elimination when you return a
>> value from a benchmarked method.
>>
>> The point of the article seems to be that given that "performance is not
>> composable" if you want to compare A and B cost (with amortization) you
>> cannot create a third benchmark:
>>
>> @Benchmark
>> int amortization() {
>> BlackHole.consumeCPU(tokens);
>> }
>>
>> And use the benchmark results (eg throughput of calls) to be subtracted
>> from the results of a() (or b()) to compare A and B costs.
>> I don't understand the meaning of "performance is not composable" and I
>> would appreciate your opinion on that, given that many people
>> of this list have experience with benchmarking.
>>
>> Thanks,
>> Franz
>> --
>> You received this message because you are subscribed to the Google Groups
>> "mechanical-sympathy" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected] <javascript:>.
> For more options, visit https://groups.google.com/d/optout.
>
>
--
You received this message because you are subscribed to the Google Groups
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.