Hello,

I think that creating a macro-benchmarking module would be a very good
idea. It would make doing performance-related changes much easier and
safer.

I have also used Peel, and can confirm that it would be a good fit for
this task.

> I've also been looking recently at some of the hot code and see about a
> ~12-14% total improvement when modifying NormalizedKeySorter.compare/swap
> to bitshift and bitmask rather than divide and modulo. The trade-off is
> that to align on a power-of-2 we have holes in and require additional
> MemoryBuffers.

I've also noticed the performance problem that those divisons in
NormalizedKeySorter.compare/swap cause, and have an idea about
eliminating them without the aligning to power-of-2 trade-off. I've
opened a Jira [1], where I explain it.

Best,
Gábor

[1] https://issues.apache.org/jira/browse/FLINK-3722




2016-04-06 18:56 GMT+02:00 Greg Hogan <c...@greghogan.com>:
> I'd like to discuss the creation of a macro-benchmarking module for Flink.
> This could be run during pre-release testing to detect performance
> regressions and during development when refactoring or performance tuning
> code on the hot path.
>
> Many users have published benchmarks and the Flink libraries already
> contain a modest selection of algorithms. Some benefits of creating a
> consolidated collection of macro-benchmarks include:
>
> - comprehensive code coverage: a diverse set of algorithms can stress every
> aspect of Flink (streaming, batch, sorts, joins, spilling, cluster, ...)
>
> - codify best practices: benchmarks should be relatively stable and
> repeatable
>
> - efficient: an automated system can run many more tests and generate more
> accurate results
>
> Macro-benchmarks would be useful in analyzing improved performance with the
> proposed specialized serializes and comparators [FLINK-3599] or making
> Flink NUMA-aware [FLINK-3163].
>
> I've also been looking recently at some of the hot code and see about a
> ~12-14% total improvement when modifying NormalizedKeySorter.compare/swap
> to bitshift and bitmask rather than divide and modulo. The trade-off is
> that to align on a power-of-2 we have holes in and require additional
> MemoryBuffers. And I'm testing on a single data type, IntValue, and there
> may be different results for LongValue or StringValue or custom types or
> with different algorithms. And replacing multiply with a left shift reduces
> performance, demonstrating the need to test changes in isolation.
>
> There are many more ideas, i.e. NormalizedKeySorter writing keys before the
> pointer so that the offset computation is performed outside of the compare
> and sort methods. Also, SpanningRecordSerializer could skip to the next
> buffer rather than writing length across buffers. These changes might each
> be worth a few percent. Other changes might be less than a 1% speedup, but
> taken in aggregate will yield a noticeable performance increase.
>
> I like the idea of profile first, measure second, then create and discuss
> the pull request.
>
> As for the actual macro-benchmarking framework, it would be nice if the
> algorithms would also verify correctness alongside performance. The
> algorithm interface would be warmup (run only once) and execute, which
> would be run multiple times in an interleaved manner. There benchmarking
> duration should be tunable.
>
> The framework would be responsible for configuration of as well as starting
> and stopping the cluster, executing algorithms and recording performance,
> and comparing and analyzing results.
>
> Greg

Reply via email to