Re: Location performance

Apache Sun, 07 Jul 2019 06:53:07 -0700

Yes, that was a very interesting read and describes very well how JMH works.


Ralph

> On Jul 7, 2019, at 4:30 AM, Remko Popma <[email protected]> wrote:
> 
> This may be of interest:
> https://www.researchgate.net/publication/333825812_What's_Wrong_With_My_Benchmark_Results_Studying_Bad_Practices_in_JMH_Benchmarks
> 
> 
> 
>> On Jul 6, 2019, at 5:49, Raman Gupta <[email protected]> wrote:
>> 
>> I'm no JMH expert... but, there is a lot more going on there than just
>> iterations to avoid the extra time for JIT compilation on warmup.
>> Checkout https://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdf
>> for an idea of the kinds of problems naive timer-only benchmarks
>> encounter, and the different ways in which JMH solves them. There are
>> other good presentations about JMH at https://shipilev.net/ as well.
>> 
>> Secondly, JMH isn't de-optimizing the JVM (a la client JVM)
>> willy-nilly. It just provides an API that allows you to measure what
>> you (hopefully) think you are measuring e.g. blackholes to consume
>> results that would otherwise be optimized away by the JVM because the
>> result would otherwise be unused, and so forth. AFAIK,  it doesn't
>> apply these "de-optimizations" on its own -- it just provides an API
>> for you to use as needed.
>> 
>> Thirdly, I don't believe JMH does anything (by default) to optimize
>> away GC or anything like that -- it in fact allows you to consider the
>> affect of GC by running lots of iterations, so that the affect of GC
>> *is* taken into account in the benchmark. It does have a (non-default)
>> option to run a GC before each iteration for specific use cases (e.g.
>> benchmarks which generate lots of garbage). It also has a GC profiler
>> to understand the impact of GC on your results (see
>> https://github.com/Valloric/jmh-playground/blob/master/src/jmh/java/org/openjdk/jmh/samples/JMHSample_35_Profilers.java).
>> 
>> Regards,
>> Raman
>> 
>> 
>>> On Fri, Jul 5, 2019 at 4:06 PM Ralph Goers <[email protected]> 
>>> wrote:
>>> 
>>> Actually, I would disagree. First, I don’t think the first thousand or slow 
>>> iterations running slow would matter much with 1.5 million iterations. Lets 
>>> say the first thousand take 10 times longer the the rest of the iterations. 
>>> That is still only the equivalent of 10,000 iterations. That is a blip in 
>>> 1.5 million.
>>> Second, measuring the impact of garbage collection should be a part of the 
>>> benchmark if your code is generating garbage.
>>> 
>>> Again, I am not saying JMH doesn’t have its place, but I have to think 
>>> something weird is going on when JMH is showing the no location benchmark 
>>> to be about 4 times faster than the location benchmark where the simplistic 
>>> loop test shows the elapsed time when including the location information to 
>>> be about twice that of not including. I can tell that the call to 
>>> StackWalker isn’t being optimized away because it still shows up in YourKit 
>>> as the most expensive operation. And when I turn on immediateFlush flushing 
>>> the file shows up in YourKit
>>> 
>>> Ralph
>>> 
>>>> On Jul 5, 2019, at 12:18 PM, Matt Sicker <[email protected]> wrote:
>>>> 
>>>> What I mean is that JMH is written to avoid including
>>>> behind-the-scenes variability of the code execution that changes from
>>>> startup until the code has been compiled to native code, garbage
>>>> collection, synchronization points for threads (safe points), etc. All
>>>> these things become variable noise that affects benchmarks when the
>>>> code you're testing executes in an especially short period of time
>>>> (e.g., appending a log event is pretty fast relatively speaking).
>>>> 
>>>> Say you wrote a naive benchmark like this:
>>>> 
>>>> long start = System.nanoTime();
>>>> for (int i = 0; i < 1_000_000; i++) {
>>>> runBenchmark();
>>>> }
>>>> long end = System.nanoTime();
>>>> double averageExecutionNanos = (end - start) / 1_000_000.0;
>>>> 
>>>> This will give incorrect results because the first thousand or so
>>>> iterations will be run by the first gen compiler, then the next
>>>> iterations will be run via code output from the second gen compiler
>>>> which will be significantly faster. So then you might split that up
>>>> into a warmup block and a measurement block. You'll still be including
>>>> the execution time spent in garbage collection, thread
>>>> synchronization, JIT compilation, etc., that isn't relevant to the
>>>> benchmark for executing the code over time.
>>>> 
>>>>> On Fri, 5 Jul 2019 at 01:25, Ralph Goers <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Well, I can infer it from the examples but it doesn’t come close to 
>>>>> explaining some of the things we use it for. I would argue that logging a 
>>>>> line whose value changes on every call, even if it is just an incremented 
>>>>> integer, that writes to a file is more than a micro-benchmark, and I 
>>>>> truly want to know how it is going to perform in real life, not with all 
>>>>> the optimizations disabled.  When I first started Log4j 2 I noticed right 
>>>>> away how much slower it ran when Java ran in client mode, which was much 
>>>>> more common back then. That is essentially what I think JMH is doing with 
>>>>> some of these benchmarks. So while it may be useful in determining that 
>>>>> one algorithm is better than another when the code has not been optimized 
>>>>> I wouldn’t trust that you will see the same behavior when running in a 
>>>>> “real” environment.
>>>>> 
>>>>> For example, in testing calculating the location I doubt the JIT compiler 
>>>>> can optimize away the call to the native methods involved in grabbing the 
>>>>> stack frames. If that is the case then it can’t optimize away the methods 
>>>>> to walk the stack frames, although it can certainly “clean them up”. 
>>>>> Likewise, it can’t avoid writing log events to files so I don’t see how 
>>>>> it can eliminate the work to call the methods to do it, especially if the 
>>>>> log line isn’t constant. That said, it isn’t clear to me even after 
>>>>> looking at the examples how much JIT stuff JMH is going to allow or 
>>>>> prevent. I do see annotations that can be used to prevent inlining but it 
>>>>> isn’t clear to me how JMH calls the benchmark methods. All 
>>>>> https://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html
>>>>>  
>>>>> <https://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html>
>>>>>  
>>>>> <https://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html
>>>>>  
>>>>> <https://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html>>
>>>>>  says is that @State will prevent “certain” optimizations from occurring.
>>>>> 
>>>>> Don’t get me wrong. I think there are many places where JHM benchmarks 
>>>>> are going to be useful. I’m just not sold on it being as useful for 
>>>>> “semi-macro” benchmarks like we have in a lot of cases.
>>>>> 
>>>>> Ralph
>>>>> 
>>>>>> On Jul 4, 2019, at 9:21 PM, Remko Popma <[email protected]> wrote:
>>>>>> 
>>>>>> The JMH samples 
>>>>>> (https://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/)
>>>>>>  are the best place to learn about the pitfalls of java micro benchmarks 
>>>>>> and how JMH can help you avoid them.
>>>>>> 
>>>>>> Remko.
>>>>>> 
>>>>>> (Shameless plug) Every java main() method deserves http://picocli.info
>>>>>> 
>>>>>>> On Jul 5, 2019, at 11:25, Ralph Goers <[email protected]> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> I don’t understand what you mean with you last message.
>>>>>>> 
>>>>>>> I don’t really know the details of how JMH works but my understanding 
>>>>>>> is it does something to prevent things like dead code elimination. I 
>>>>>>> haven’t been able to find anything very specific about what JMH 
>>>>>>> actually does though.
>>>>>>> 
>>>>>>> Ralph
>>>>>>> 
>>>>>>>> On Jul 4, 2019, at 11:15 AM, Matt Sicker <[email protected]> wrote:
>>>>>>>> 
>>>>>>>> And the checkpoint bias thing is alleviated amongst other noise from 
>>>>>>>> the
>>>>>>>> JVM that would otherwise invalidate the time measurements.
>>>>>>>> 
>>>>>>>>> On Thu, Jul 4, 2019 at 13:14, Matt Sicker <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> I thought JMH ran your code enough for JIT to take effect (though 
>>>>>>>>> that’s
>>>>>>>>> also up to your warmup iteration setting), and the
>>>>>>>>> 
>>>>>>>>> On Thu, Jul 4, 2019 at 00:21, Ralph Goers <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> That said, here are the results of the 
>>>>>>>>>> FileAppenderWithLocationBenchmark
>>>>>>>>>> from master.
>>>>>>>>>> 
>>>>>>>>>> Benchmark                                       Mode  Cnt       Score
>>>>>>>>>> Error  Units
>>>>>>>>>> FileAppenderWithLocationBenchmark.log4j1File   thrpt   20   
>>>>>>>>>> 27597.034 ±
>>>>>>>>>> 350.465  ops/s
>>>>>>>>>> FileAppenderWithLocationBenchmark.log4j2File   thrpt   20   
>>>>>>>>>> 82966.450 ±
>>>>>>>>>> 2023.826  ops/s
>>>>>>>>>> FileAppenderWithLocationBenchmark.log4j2RAF    thrpt   20   
>>>>>>>>>> 78820.010 ±
>>>>>>>>>> 1452.810  ops/s
>>>>>>>>>> FileAppenderWithLocationBenchmark.logbackFile  thrpt   20  
>>>>>>>>>> 138656.655 ±
>>>>>>>>>> 6329.118  ops/s
>>>>>>>>>> 
>>>>>>>>>> and here are the results with all the changes I have made.
>>>>>>>>>> 
>>>>>>>>>> Benchmark                                       Mode  Cnt       Score
>>>>>>>>>> Error  Units
>>>>>>>>>> FileAppenderWithLocationBenchmark.log4j1File   thrpt   20   
>>>>>>>>>> 47003.919 ±
>>>>>>>>>> 1594.591  ops/s
>>>>>>>>>> FileAppenderWithLocationBenchmark.log4j2File   thrpt   20  
>>>>>>>>>> 185129.606 ±
>>>>>>>>>> 2309.118  ops/s
>>>>>>>>>> FileAppenderWithLocationBenchmark.log4j2RAF    thrpt   20  
>>>>>>>>>> 183346.456 ±
>>>>>>>>>> 3103.615  ops/s
>>>>>>>>>> FileAppenderWithLocationBenchmark.logbackFile  thrpt   20  
>>>>>>>>>> 130121.612 ±
>>>>>>>>>> 2535.387  ops/s
>>>>>>>>>> 
>>>>>>>>>> Ralph
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Jul 3, 2019, at 8:35 PM, Ralph Goers <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Remko,
>>>>>>>>>>> 
>>>>>>>>>>> I ran the existing JHM benchmarks for both no location and with
>>>>>>>>>> location and there is still a huge difference. The problem is I don’t
>>>>>>>>>> necessarily believe the results. Nor do I think the benchmark I have 
>>>>>>>>>> been
>>>>>>>>>> using is totally accurate either. The problem is that, as I 
>>>>>>>>>> understand it,
>>>>>>>>>> JMH limits or eliminates much of the optimization the Java runtime 
>>>>>>>>>> will
>>>>>>>>>> perform. This is bad for Log4j because in some ways our code is 
>>>>>>>>>> written
>>>>>>>>>> inefficiently and hotspot will optimize it. OTOH, since the test is 
>>>>>>>>>> pretty
>>>>>>>>>> much logging the same record with an incremented value from the same
>>>>>>>>>> location it is hard to say whether the code isn’t being over 
>>>>>>>>>> optimized. I
>>>>>>>>>> suspect the real performance will be somewhere in the middle. What 
>>>>>>>>>> would be
>>>>>>>>>> better is something like a restful service with plenty of logging 
>>>>>>>>>> driven by
>>>>>>>>>> Jmeter once with no location and once with it.
>>>>>>>>>>> 
>>>>>>>>>>> Ralph
>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 2, 2019, at 3:47 PM, Remko Popma <[email protected]> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> If memory serves correctly, the `includeLocation` was introduced
>>>>>>>>>> because it was necessary for async loggers (and it seemed not 
>>>>>>>>>> feasible to
>>>>>>>>>> get this information from the layout pattern).
>>>>>>>>>>>> 
>>>>>>>>>>>> Async loggers need to know upfront whether they need to pass 
>>>>>>>>>>>> location
>>>>>>>>>> information to the consumer thread.
>>>>>>>>>>>> 
>>>>>>>>>>>> If obtaining location information becomes so cheap that it’s
>>>>>>>>>> essentially free, it is fine to ignore the `includeLocation` 
>>>>>>>>>> directive and
>>>>>>>>>> include location information anyway.
>>>>>>>>>>>> 
>>>>>>>>>>>> About the performance results: what’s the difference between “Log4j
>>>>>>>>>> 3.0 with class and method” and “Log4j 3.0 withLocation()”?
>>>>>>>>>>>> 
>>>>>>>>>>>> Also I am generally cautious of custom benchmarks; any reason not 
>>>>>>>>>>>> to
>>>>>>>>>> use JMH?
>>>>>>>>>>>> 
>>>>>>>>>>>> Overall impressive results, congratulations!
>>>>>>>>>>>> 
>>>>>>>>>>>> Remko.
>>>>>>>>>>>> 
>>>>>>>>>>>> (Shameless plug) Every java main() method deserves 
>>>>>>>>>>>> http://picocli.info
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jul 3, 2019, at 7:26, Ralph Goers <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> But what do you think about includeLocation. Should it override 
>>>>>>>>>>>>> the
>>>>>>>>>> result of requiresLocation or should the fact that we determined 
>>>>>>>>>> location
>>>>>>>>>> information is required override the includeLocation setting?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Jul 2, 2019, at 2:34 PM, Matt Sicker <[email protected]> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> These benchmarks are impressive! Moving the point of call site
>>>>>>>>>> calculation is a great idea.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Tue, 2 Jul 2019 at 16:22, Ralph Goers 
>>>>>>>>>>>>>> <[email protected]
>>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>>> I have made changes to the code the master branch to measure and
>>>>>>>>>> improve performance. I have committed some of the changes but not 
>>>>>>>>>> the last
>>>>>>>>>> set. I have not made the latest changes to release-2.x because they 
>>>>>>>>>> will
>>>>>>>>>> have to be implemented a bit differently to avoid compatibility 
>>>>>>>>>> issues. The
>>>>>>>>>> table below summarizes what I am seeing, at least on my MacBook Pro. 
>>>>>>>>>>  This
>>>>>>>>>> test just basically writes 1.5 million log statements in a loop 
>>>>>>>>>> using an
>>>>>>>>>> equivalent format for each logging framework. Note that JUL is 
>>>>>>>>>> faster with
>>>>>>>>>> location information than without primarily because a custom Handler 
>>>>>>>>>> had to
>>>>>>>>>> be created to format the log record and it performs better than the 
>>>>>>>>>> one
>>>>>>>>>> used to log without location information.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> There are several things I should note about this:
>>>>>>>>>>>>>> • The first step, which is in the PR, has 
>>>>>>>>>>>>>> StackLocator.calcLocation
>>>>>>>>>> modified to use a faster algorithm. Streams are not particularly 
>>>>>>>>>> fast. This
>>>>>>>>>> resulted in some gains not shown here but can be seen in
>>>>>>>>>> https://issues.apache.org/jira/browse/LOG4J2-2644 <
>>>>>>>>>> https://issues.apache.org/jira/browse/LOG4J2-2644>
>>>>>>>>>>>>>> • Performance is proportional to the number of stack traces that
>>>>>>>>>> need to be traversed. JUL only had 11 whereas Log4j had 35 before it 
>>>>>>>>>> called
>>>>>>>>>> calcLocation. Our stack is very deep because AbstractLogger and
>>>>>>>>>> LoggerConfig have been overly refactored along with having to go 
>>>>>>>>>> though
>>>>>>>>>> several intermediate objects.
>>>>>>>>>>>>>> • I introduced a requiredLocation() method throughout the stack.
>>>>>>>>>> Any component that requires location information sets this value to 
>>>>>>>>>> true.
>>>>>>>>>> Based on this, the call to calcLocation can be made significantly 
>>>>>>>>>> higher up
>>>>>>>>>> the stack just before the log event is created but only when a 
>>>>>>>>>> configured
>>>>>>>>>> component will use it - with a few exceptions. This code is not yet 
>>>>>>>>>> checked
>>>>>>>>>> in.
>>>>>>>>>>>>>> • If for some reason the location information isn’t calculated 
>>>>>>>>>>>>>> when
>>>>>>>>>> the event is constructed it can still be lazily populated later 
>>>>>>>>>> using the
>>>>>>>>>> current, slower mechanism.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> As you can see these changes result in a pretty dramatic
>>>>>>>>>> improvement, almost to the point that the cost of obtaining location
>>>>>>>>>> information doesn’t really matter - especially in Java 11. The 
>>>>>>>>>> results in
>>>>>>>>>> Java 11 amount to 5us per logging call with no location vs 7.5us per
>>>>>>>>>> logging call with location info. These results could probably be 
>>>>>>>>>> improved
>>>>>>>>>> even more if AbstractLogger was “flattened” a bit.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I should point out that the logic isn’t perfect. I cannot 
>>>>>>>>>>>>>> determine
>>>>>>>>>> whether location information is required or not for the 
>>>>>>>>>> RoutingAppender
>>>>>>>>>> since it dynamically creates Appenders at run time. Instead, I added 
>>>>>>>>>> a
>>>>>>>>>> “requiresLocation” attribute that can be set to indicate that.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> However, in testing I have encountered a unit test failure in
>>>>>>>>>> RandomAccessFileAppenderTest. It is failing because the location
>>>>>>>>>> information is being included even though includeLocation is set to 
>>>>>>>>>> false.
>>>>>>>>>> I am wondering if that setting needs to be respected since the 
>>>>>>>>>> pattern
>>>>>>>>>> specifies including it. It seems odd to me to not include it in that 
>>>>>>>>>> case.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Matt Sicker <[email protected] <mailto:[email protected]>>
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>> Matt Sicker <[email protected]>
>>>>>>>>> 
>>>>>>>> --
>>>>>>>> Matt Sicker <[email protected]>
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Matt Sicker <[email protected] <mailto:[email protected]>>
>>>

Re: Location performance

Reply via email to