Hi Greg! The idea is very good, especially having these pre-built performance tests for release testing.
In your opinion, are the tests going to be self-contained, or will they need a cluster (YARN, Mesos, Docker, etc.) to bring up a Flink cluster and run things? Greetings, Stephan On Sat, Apr 9, 2016 at 12:41 PM, Gábor Gévay <gga...@gmail.com> wrote: > Hello, > > I think that creating a macro-benchmarking module would be a very good > idea. It would make doing performance-related changes much easier and > safer. > > I have also used Peel, and can confirm that it would be a good fit for > this task. > > > I've also been looking recently at some of the hot code and see about a > > ~12-14% total improvement when modifying NormalizedKeySorter.compare/swap > > to bitshift and bitmask rather than divide and modulo. The trade-off is > > that to align on a power-of-2 we have holes in and require additional > > MemoryBuffers. > > I've also noticed the performance problem that those divisons in > NormalizedKeySorter.compare/swap cause, and have an idea about > eliminating them without the aligning to power-of-2 trade-off. I've > opened a Jira [1], where I explain it. > > Best, > Gábor > > [1] https://issues.apache.org/jira/browse/FLINK-3722 > > > > > 2016-04-06 18:56 GMT+02:00 Greg Hogan <c...@greghogan.com>: > > I'd like to discuss the creation of a macro-benchmarking module for > Flink. > > This could be run during pre-release testing to detect performance > > regressions and during development when refactoring or performance tuning > > code on the hot path. > > > > Many users have published benchmarks and the Flink libraries already > > contain a modest selection of algorithms. Some benefits of creating a > > consolidated collection of macro-benchmarks include: > > > > - comprehensive code coverage: a diverse set of algorithms can stress > every > > aspect of Flink (streaming, batch, sorts, joins, spilling, cluster, ...) > > > > - codify best practices: benchmarks should be relatively stable and > > repeatable > > > > - efficient: an automated system can run many more tests and generate > more > > accurate results > > > > Macro-benchmarks would be useful in analyzing improved performance with > the > > proposed specialized serializes and comparators [FLINK-3599] or making > > Flink NUMA-aware [FLINK-3163]. > > > > I've also been looking recently at some of the hot code and see about a > > ~12-14% total improvement when modifying NormalizedKeySorter.compare/swap > > to bitshift and bitmask rather than divide and modulo. The trade-off is > > that to align on a power-of-2 we have holes in and require additional > > MemoryBuffers. And I'm testing on a single data type, IntValue, and there > > may be different results for LongValue or StringValue or custom types or > > with different algorithms. And replacing multiply with a left shift > reduces > > performance, demonstrating the need to test changes in isolation. > > > > There are many more ideas, i.e. NormalizedKeySorter writing keys before > the > > pointer so that the offset computation is performed outside of the > compare > > and sort methods. Also, SpanningRecordSerializer could skip to the next > > buffer rather than writing length across buffers. These changes might > each > > be worth a few percent. Other changes might be less than a 1% speedup, > but > > taken in aggregate will yield a noticeable performance increase. > > > > I like the idea of profile first, measure second, then create and discuss > > the pull request. > > > > As for the actual macro-benchmarking framework, it would be nice if the > > algorithms would also verify correctness alongside performance. The > > algorithm interface would be warmup (run only once) and execute, which > > would be run multiple times in an interleaved manner. There benchmarking > > duration should be tunable. > > > > The framework would be responsible for configuration of as well as > starting > > and stopping the cluster, executing algorithms and recording performance, > > and comparing and analyzing results. > > > > Greg >