On Wed, Jan 17, 2024 at 2:07 AM Matthew de Detrich
<matthew.dedetr...@aiven.io.invalid> wrote:
> As you can see from the results there are some noticeable improvements
> (i.e. 5-10% in some cases) however I wouldn't take these results as
> complete gospel as I had to do the benchmarks on my M1 laptop (I had it in
> power plus used TGPro to put fans on max blast to reduce any variability,
> unfortunately I am currently overseas so I don't have a dedicated machine
> to test on).

Thanks for running these, Matthew. Alas, I cannot reproduce the
difference. How often did you run to reproduce the noticeable results
yourself?

Here are a few more ideas:

 - rather clock the cpu down and keep check of the actual frequencies
while running (e.g. using perf stat), instead of turning the fans up.
There might still be thermal reservoirs of unknown size and state even
with highest fan settings, especially in laptops.
 - always run more than one fork, many benchmarks have (at least)
bimodal distributions of performance in steady state varying by a few
percent. If you know there's a multimodal distribution, make a rough
estimation how many forks you need to really make sure aren't
comparing one coin flip with the other one on the other side.
 - if the error bars overlap a lot that's a sign that you need to run
more benchmarks for validation (jmh reports a large 99.9% CI but make
sure you are confident the interval itself makes sense, 3 measurements
are really the bare minimum for those to make any sense at all)

Many of these issues are also taken care of by creating a (e.g.
nightly) long-running benchmark series where the random fluctuations
become quite apparent over days.

In general, for complex benchmarks like in pekko-http, I like to use
these rough guidelines for evaluating benchmark evidence:

 * < 5% difference needs exceptional statistical evidence and a
reasonable explanation for the behavior (e.g. you tried to optimize
something before and the improvements are exactly in the area that you
expected)
 * 5-10% difference needs very good statistical evidence and/or
explanations for the improvements
 * ...
 * > 10-15% if consistently better in multiple runs and environment,
likely an improvement

(When benchmarking single methods you might relax the judgement,
though then the measured performance might not materialize in more
realistic scenarios)

The StreamedServerProcessing result seems somewhat internally
inconsistent since the same "chunked" configuration with different
chunk sizes shows somewhat different behavior which is possible but
maybe not super likely?

Also the result seems somewhat cherry-picked as e.g. most of the
LineParserBenchmarks were slightly (up to 7%) better without inlining
(speaking about your results in all of the above).

Here are my quick results (also very weak evidence):
https://gist.github.com/jrudolph/bc97146dedf0290d059e5e44939fbdc0

Btw. the main improvement that the inliner promises is that it can
avoid a particular kind of megamorphic call site in higher-order
function use, i.e. scala collection usage. In most performance
critical situations, this has been manually optimized before where it
turned up in profiles. All the other expensive places of megamorphic
call sites (dispatchers, stream GraphInterpreter, routing DSL) are
unfortunately not "static enough" that an AOT inliner could pick them
up with static analysis.

Johannes

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pekko.apache.org
For additional commands, e-mail: dev-h...@pekko.apache.org

Reply via email to