
I think that creating a macro-benchmarking module would be a very good
idea. It would make doing performance-related changes much easier and

I have also used Peel, and can confirm that it would be a good fit for
this task.

> I've also been looking recently at some of the hot code and see about a
> ~12-14% total improvement when modifying NormalizedKeySorter.compare/swap
> to bitshift and bitmask rather than divide and modulo. The trade-off is
> that to align on a power-of-2 we have holes in and require additional
> MemoryBuffers.

I've also noticed the performance problem that those divisons in
NormalizedKeySorter.compare/swap cause, and have an idea about
eliminating them without the aligning to power-of-2 trade-off. I've
opened a Jira [1], where I explain it.


[1] https://issues.apache.org/jira/browse/FLINK-3722

2016-04-06 18:56 GMT+02:00 Greg Hogan <c...@greghogan.com>:
> I'd like to discuss the creation of a macro-benchmarking module for Flink.
> This could be run during pre-release testing to detect performance
> regressions and during development when refactoring or performance tuning
> code on the hot path.
> Many users have published benchmarks and the Flink libraries already
> contain a modest selection of algorithms. Some benefits of creating a
> consolidated collection of macro-benchmarks include:
> - comprehensive code coverage: a diverse set of algorithms can stress every
> aspect of Flink (streaming, batch, sorts, joins, spilling, cluster, ...)
> - codify best practices: benchmarks should be relatively stable and
> repeatable
> - efficient: an automated system can run many more tests and generate more
> accurate results
> Macro-benchmarks would be useful in analyzing improved performance with the
> proposed specialized serializes and comparators [FLINK-3599] or making
> Flink NUMA-aware [FLINK-3163].
> I've also been looking recently at some of the hot code and see about a
> ~12-14% total improvement when modifying NormalizedKeySorter.compare/swap
> to bitshift and bitmask rather than divide and modulo. The trade-off is
> that to align on a power-of-2 we have holes in and require additional
> MemoryBuffers. And I'm testing on a single data type, IntValue, and there
> may be different results for LongValue or StringValue or custom types or
> with different algorithms. And replacing multiply with a left shift reduces
> performance, demonstrating the need to test changes in isolation.
> There are many more ideas, i.e. NormalizedKeySorter writing keys before the
> pointer so that the offset computation is performed outside of the compare
> and sort methods. Also, SpanningRecordSerializer could skip to the next
> buffer rather than writing length across buffers. These changes might each
> be worth a few percent. Other changes might be less than a 1% speedup, but
> taken in aggregate will yield a noticeable performance increase.
> I like the idea of profile first, measure second, then create and discuss
> the pull request.
> As for the actual macro-benchmarking framework, it would be nice if the
> algorithms would also verify correctness alongside performance. The
> algorithm interface would be warmup (run only once) and execute, which
> would be run multiple times in an interleaved manner. There benchmarking
> duration should be tunable.
> The framework would be responsible for configuration of as well as starting
> and stopping the cluster, executing algorithms and recording performance,
> and comparing and analyzing results.
> Greg

Reply via email to