Yes, that was a very interesting read and describes very well how JMH works.
Ralph > On Jul 7, 2019, at 4:30 AM, Remko Popma <[email protected]> wrote: > > This may be of interest: > https://www.researchgate.net/publication/333825812_What's_Wrong_With_My_Benchmark_Results_Studying_Bad_Practices_in_JMH_Benchmarks > > > >> On Jul 6, 2019, at 5:49, Raman Gupta <[email protected]> wrote: >> >> I'm no JMH expert... but, there is a lot more going on there than just >> iterations to avoid the extra time for JIT compilation on warmup. >> Checkout https://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdf >> for an idea of the kinds of problems naive timer-only benchmarks >> encounter, and the different ways in which JMH solves them. There are >> other good presentations about JMH at https://shipilev.net/ as well. >> >> Secondly, JMH isn't de-optimizing the JVM (a la client JVM) >> willy-nilly. It just provides an API that allows you to measure what >> you (hopefully) think you are measuring e.g. blackholes to consume >> results that would otherwise be optimized away by the JVM because the >> result would otherwise be unused, and so forth. AFAIK, it doesn't >> apply these "de-optimizations" on its own -- it just provides an API >> for you to use as needed. >> >> Thirdly, I don't believe JMH does anything (by default) to optimize >> away GC or anything like that -- it in fact allows you to consider the >> affect of GC by running lots of iterations, so that the affect of GC >> *is* taken into account in the benchmark. It does have a (non-default) >> option to run a GC before each iteration for specific use cases (e.g. >> benchmarks which generate lots of garbage). It also has a GC profiler >> to understand the impact of GC on your results (see >> https://github.com/Valloric/jmh-playground/blob/master/src/jmh/java/org/openjdk/jmh/samples/JMHSample_35_Profilers.java). >> >> Regards, >> Raman >> >> >>> On Fri, Jul 5, 2019 at 4:06 PM Ralph Goers <[email protected]> >>> wrote: >>> >>> Actually, I would disagree. First, I don’t think the first thousand or slow >>> iterations running slow would matter much with 1.5 million iterations. Lets >>> say the first thousand take 10 times longer the the rest of the iterations. >>> That is still only the equivalent of 10,000 iterations. That is a blip in >>> 1.5 million. >>> Second, measuring the impact of garbage collection should be a part of the >>> benchmark if your code is generating garbage. >>> >>> Again, I am not saying JMH doesn’t have its place, but I have to think >>> something weird is going on when JMH is showing the no location benchmark >>> to be about 4 times faster than the location benchmark where the simplistic >>> loop test shows the elapsed time when including the location information to >>> be about twice that of not including. I can tell that the call to >>> StackWalker isn’t being optimized away because it still shows up in YourKit >>> as the most expensive operation. And when I turn on immediateFlush flushing >>> the file shows up in YourKit >>> >>> Ralph >>> >>>> On Jul 5, 2019, at 12:18 PM, Matt Sicker <[email protected]> wrote: >>>> >>>> What I mean is that JMH is written to avoid including >>>> behind-the-scenes variability of the code execution that changes from >>>> startup until the code has been compiled to native code, garbage >>>> collection, synchronization points for threads (safe points), etc. All >>>> these things become variable noise that affects benchmarks when the >>>> code you're testing executes in an especially short period of time >>>> (e.g., appending a log event is pretty fast relatively speaking). >>>> >>>> Say you wrote a naive benchmark like this: >>>> >>>> long start = System.nanoTime(); >>>> for (int i = 0; i < 1_000_000; i++) { >>>> runBenchmark(); >>>> } >>>> long end = System.nanoTime(); >>>> double averageExecutionNanos = (end - start) / 1_000_000.0; >>>> >>>> This will give incorrect results because the first thousand or so >>>> iterations will be run by the first gen compiler, then the next >>>> iterations will be run via code output from the second gen compiler >>>> which will be significantly faster. So then you might split that up >>>> into a warmup block and a measurement block. You'll still be including >>>> the execution time spent in garbage collection, thread >>>> synchronization, JIT compilation, etc., that isn't relevant to the >>>> benchmark for executing the code over time. >>>> >>>>> On Fri, 5 Jul 2019 at 01:25, Ralph Goers <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>> >>>>> Well, I can infer it from the examples but it doesn’t come close to >>>>> explaining some of the things we use it for. I would argue that logging a >>>>> line whose value changes on every call, even if it is just an incremented >>>>> integer, that writes to a file is more than a micro-benchmark, and I >>>>> truly want to know how it is going to perform in real life, not with all >>>>> the optimizations disabled. When I first started Log4j 2 I noticed right >>>>> away how much slower it ran when Java ran in client mode, which was much >>>>> more common back then. That is essentially what I think JMH is doing with >>>>> some of these benchmarks. So while it may be useful in determining that >>>>> one algorithm is better than another when the code has not been optimized >>>>> I wouldn’t trust that you will see the same behavior when running in a >>>>> “real” environment. >>>>> >>>>> For example, in testing calculating the location I doubt the JIT compiler >>>>> can optimize away the call to the native methods involved in grabbing the >>>>> stack frames. If that is the case then it can’t optimize away the methods >>>>> to walk the stack frames, although it can certainly “clean them up”. >>>>> Likewise, it can’t avoid writing log events to files so I don’t see how >>>>> it can eliminate the work to call the methods to do it, especially if the >>>>> log line isn’t constant. That said, it isn’t clear to me even after >>>>> looking at the examples how much JIT stuff JMH is going to allow or >>>>> prevent. I do see annotations that can be used to prevent inlining but it >>>>> isn’t clear to me how JMH calls the benchmark methods. All >>>>> https://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html >>>>> >>>>> <https://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html> >>>>> >>>>> <https://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html >>>>> >>>>> <https://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html>> >>>>> says is that @State will prevent “certain” optimizations from occurring. >>>>> >>>>> Don’t get me wrong. I think there are many places where JHM benchmarks >>>>> are going to be useful. I’m just not sold on it being as useful for >>>>> “semi-macro” benchmarks like we have in a lot of cases. >>>>> >>>>> Ralph >>>>> >>>>>> On Jul 4, 2019, at 9:21 PM, Remko Popma <[email protected]> wrote: >>>>>> >>>>>> The JMH samples >>>>>> (https://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/) >>>>>> are the best place to learn about the pitfalls of java micro benchmarks >>>>>> and how JMH can help you avoid them. >>>>>> >>>>>> Remko. >>>>>> >>>>>> (Shameless plug) Every java main() method deserves http://picocli.info >>>>>> >>>>>>> On Jul 5, 2019, at 11:25, Ralph Goers <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> I don’t understand what you mean with you last message. >>>>>>> >>>>>>> I don’t really know the details of how JMH works but my understanding >>>>>>> is it does something to prevent things like dead code elimination. I >>>>>>> haven’t been able to find anything very specific about what JMH >>>>>>> actually does though. >>>>>>> >>>>>>> Ralph >>>>>>> >>>>>>>> On Jul 4, 2019, at 11:15 AM, Matt Sicker <[email protected]> wrote: >>>>>>>> >>>>>>>> And the checkpoint bias thing is alleviated amongst other noise from >>>>>>>> the >>>>>>>> JVM that would otherwise invalidate the time measurements. >>>>>>>> >>>>>>>>> On Thu, Jul 4, 2019 at 13:14, Matt Sicker <[email protected]> wrote: >>>>>>>>> >>>>>>>>> I thought JMH ran your code enough for JIT to take effect (though >>>>>>>>> that’s >>>>>>>>> also up to your warmup iteration setting), and the >>>>>>>>> >>>>>>>>> On Thu, Jul 4, 2019 at 00:21, Ralph Goers <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> That said, here are the results of the >>>>>>>>>> FileAppenderWithLocationBenchmark >>>>>>>>>> from master. >>>>>>>>>> >>>>>>>>>> Benchmark Mode Cnt Score >>>>>>>>>> Error Units >>>>>>>>>> FileAppenderWithLocationBenchmark.log4j1File thrpt 20 >>>>>>>>>> 27597.034 ± >>>>>>>>>> 350.465 ops/s >>>>>>>>>> FileAppenderWithLocationBenchmark.log4j2File thrpt 20 >>>>>>>>>> 82966.450 ± >>>>>>>>>> 2023.826 ops/s >>>>>>>>>> FileAppenderWithLocationBenchmark.log4j2RAF thrpt 20 >>>>>>>>>> 78820.010 ± >>>>>>>>>> 1452.810 ops/s >>>>>>>>>> FileAppenderWithLocationBenchmark.logbackFile thrpt 20 >>>>>>>>>> 138656.655 ± >>>>>>>>>> 6329.118 ops/s >>>>>>>>>> >>>>>>>>>> and here are the results with all the changes I have made. >>>>>>>>>> >>>>>>>>>> Benchmark Mode Cnt Score >>>>>>>>>> Error Units >>>>>>>>>> FileAppenderWithLocationBenchmark.log4j1File thrpt 20 >>>>>>>>>> 47003.919 ± >>>>>>>>>> 1594.591 ops/s >>>>>>>>>> FileAppenderWithLocationBenchmark.log4j2File thrpt 20 >>>>>>>>>> 185129.606 ± >>>>>>>>>> 2309.118 ops/s >>>>>>>>>> FileAppenderWithLocationBenchmark.log4j2RAF thrpt 20 >>>>>>>>>> 183346.456 ± >>>>>>>>>> 3103.615 ops/s >>>>>>>>>> FileAppenderWithLocationBenchmark.logbackFile thrpt 20 >>>>>>>>>> 130121.612 ± >>>>>>>>>> 2535.387 ops/s >>>>>>>>>> >>>>>>>>>> Ralph >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Jul 3, 2019, at 8:35 PM, Ralph Goers <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Remko, >>>>>>>>>>> >>>>>>>>>>> I ran the existing JHM benchmarks for both no location and with >>>>>>>>>> location and there is still a huge difference. The problem is I don’t >>>>>>>>>> necessarily believe the results. Nor do I think the benchmark I have >>>>>>>>>> been >>>>>>>>>> using is totally accurate either. The problem is that, as I >>>>>>>>>> understand it, >>>>>>>>>> JMH limits or eliminates much of the optimization the Java runtime >>>>>>>>>> will >>>>>>>>>> perform. This is bad for Log4j because in some ways our code is >>>>>>>>>> written >>>>>>>>>> inefficiently and hotspot will optimize it. OTOH, since the test is >>>>>>>>>> pretty >>>>>>>>>> much logging the same record with an incremented value from the same >>>>>>>>>> location it is hard to say whether the code isn’t being over >>>>>>>>>> optimized. I >>>>>>>>>> suspect the real performance will be somewhere in the middle. What >>>>>>>>>> would be >>>>>>>>>> better is something like a restful service with plenty of logging >>>>>>>>>> driven by >>>>>>>>>> Jmeter once with no location and once with it. >>>>>>>>>>> >>>>>>>>>>> Ralph >>>>>>>>>>> >>>>>>>>>>>> On Jul 2, 2019, at 3:47 PM, Remko Popma <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> If memory serves correctly, the `includeLocation` was introduced >>>>>>>>>> because it was necessary for async loggers (and it seemed not >>>>>>>>>> feasible to >>>>>>>>>> get this information from the layout pattern). >>>>>>>>>>>> >>>>>>>>>>>> Async loggers need to know upfront whether they need to pass >>>>>>>>>>>> location >>>>>>>>>> information to the consumer thread. >>>>>>>>>>>> >>>>>>>>>>>> If obtaining location information becomes so cheap that it’s >>>>>>>>>> essentially free, it is fine to ignore the `includeLocation` >>>>>>>>>> directive and >>>>>>>>>> include location information anyway. >>>>>>>>>>>> >>>>>>>>>>>> About the performance results: what’s the difference between “Log4j >>>>>>>>>> 3.0 with class and method” and “Log4j 3.0 withLocation()”? >>>>>>>>>>>> >>>>>>>>>>>> Also I am generally cautious of custom benchmarks; any reason not >>>>>>>>>>>> to >>>>>>>>>> use JMH? >>>>>>>>>>>> >>>>>>>>>>>> Overall impressive results, congratulations! >>>>>>>>>>>> >>>>>>>>>>>> Remko. >>>>>>>>>>>> >>>>>>>>>>>> (Shameless plug) Every java main() method deserves >>>>>>>>>>>> http://picocli.info >>>>>>>>>>>> >>>>>>>>>>>>> On Jul 3, 2019, at 7:26, Ralph Goers <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> But what do you think about includeLocation. Should it override >>>>>>>>>>>>> the >>>>>>>>>> result of requiresLocation or should the fact that we determined >>>>>>>>>> location >>>>>>>>>> information is required override the includeLocation setting? >>>>>>>>>>>>> >>>>>>>>>>>>> Ralph >>>>>>>>>>>>> >>>>>>>>>>>>>> On Jul 2, 2019, at 2:34 PM, Matt Sicker <[email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> These benchmarks are impressive! Moving the point of call site >>>>>>>>>> calculation is a great idea. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Tue, 2 Jul 2019 at 16:22, Ralph Goers >>>>>>>>>>>>>> <[email protected] >>>>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>>>>>>>> I have made changes to the code the master branch to measure and >>>>>>>>>> improve performance. I have committed some of the changes but not >>>>>>>>>> the last >>>>>>>>>> set. I have not made the latest changes to release-2.x because they >>>>>>>>>> will >>>>>>>>>> have to be implemented a bit differently to avoid compatibility >>>>>>>>>> issues. The >>>>>>>>>> table below summarizes what I am seeing, at least on my MacBook Pro. >>>>>>>>>> This >>>>>>>>>> test just basically writes 1.5 million log statements in a loop >>>>>>>>>> using an >>>>>>>>>> equivalent format for each logging framework. Note that JUL is >>>>>>>>>> faster with >>>>>>>>>> location information than without primarily because a custom Handler >>>>>>>>>> had to >>>>>>>>>> be created to format the log record and it performs better than the >>>>>>>>>> one >>>>>>>>>> used to log without location information. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> There are several things I should note about this: >>>>>>>>>>>>>> • The first step, which is in the PR, has >>>>>>>>>>>>>> StackLocator.calcLocation >>>>>>>>>> modified to use a faster algorithm. Streams are not particularly >>>>>>>>>> fast. This >>>>>>>>>> resulted in some gains not shown here but can be seen in >>>>>>>>>> https://issues.apache.org/jira/browse/LOG4J2-2644 < >>>>>>>>>> https://issues.apache.org/jira/browse/LOG4J2-2644> >>>>>>>>>>>>>> • Performance is proportional to the number of stack traces that >>>>>>>>>> need to be traversed. JUL only had 11 whereas Log4j had 35 before it >>>>>>>>>> called >>>>>>>>>> calcLocation. Our stack is very deep because AbstractLogger and >>>>>>>>>> LoggerConfig have been overly refactored along with having to go >>>>>>>>>> though >>>>>>>>>> several intermediate objects. >>>>>>>>>>>>>> • I introduced a requiredLocation() method throughout the stack. >>>>>>>>>> Any component that requires location information sets this value to >>>>>>>>>> true. >>>>>>>>>> Based on this, the call to calcLocation can be made significantly >>>>>>>>>> higher up >>>>>>>>>> the stack just before the log event is created but only when a >>>>>>>>>> configured >>>>>>>>>> component will use it - with a few exceptions. This code is not yet >>>>>>>>>> checked >>>>>>>>>> in. >>>>>>>>>>>>>> • If for some reason the location information isn’t calculated >>>>>>>>>>>>>> when >>>>>>>>>> the event is constructed it can still be lazily populated later >>>>>>>>>> using the >>>>>>>>>> current, slower mechanism. >>>>>>>>>>>>>> >>>>>>>>>>>>>> As you can see these changes result in a pretty dramatic >>>>>>>>>> improvement, almost to the point that the cost of obtaining location >>>>>>>>>> information doesn’t really matter - especially in Java 11. The >>>>>>>>>> results in >>>>>>>>>> Java 11 amount to 5us per logging call with no location vs 7.5us per >>>>>>>>>> logging call with location info. These results could probably be >>>>>>>>>> improved >>>>>>>>>> even more if AbstractLogger was “flattened” a bit. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I should point out that the logic isn’t perfect. I cannot >>>>>>>>>>>>>> determine >>>>>>>>>> whether location information is required or not for the >>>>>>>>>> RoutingAppender >>>>>>>>>> since it dynamically creates Appenders at run time. Instead, I added >>>>>>>>>> a >>>>>>>>>> “requiresLocation” attribute that can be set to indicate that. >>>>>>>>>>>>>> >>>>>>>>>>>>>> However, in testing I have encountered a unit test failure in >>>>>>>>>> RandomAccessFileAppenderTest. It is failing because the location >>>>>>>>>> information is being included even though includeLocation is set to >>>>>>>>>> false. >>>>>>>>>> I am wondering if that setting needs to be respected since the >>>>>>>>>> pattern >>>>>>>>>> specifies including it. It seems odd to me to not include it in that >>>>>>>>>> case. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thoughts? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Ralph >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Matt Sicker <[email protected] <mailto:[email protected]>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>> Matt Sicker <[email protected]> >>>>>>>>> >>>>>>>> -- >>>>>>>> Matt Sicker <[email protected]> >>>>>>> >>>>>>> >>>>> >>>> >>>> >>>> -- >>>> Matt Sicker <[email protected] <mailto:[email protected]>> >>>
