Re: Location performance

Remko Popma Sun, 07 Jul 2019 04:31:15 -0700

This may be of interest:
https://www.researchgate.net/publication/333825812_What's_Wrong_With_My_Benchmark_Results_Studying_Bad_Practices_in_JMH_Benchmarks




> On Jul 6, 2019, at 5:49, Raman Gupta <[email protected]> wrote:
> 
> I'm no JMH expert... but, there is a lot more going on there than just
> iterations to avoid the extra time for JIT compilation on warmup.
> Checkout https://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdf
> for an idea of the kinds of problems naive timer-only benchmarks
> encounter, and the different ways in which JMH solves them. There are
> other good presentations about JMH at https://shipilev.net/ as well.
> 
> Secondly, JMH isn't de-optimizing the JVM (a la client JVM)
> willy-nilly. It just provides an API that allows you to measure what
> you (hopefully) think you are measuring e.g. blackholes to consume
> results that would otherwise be optimized away by the JVM because the
> result would otherwise be unused, and so forth. AFAIK,  it doesn't
> apply these "de-optimizations" on its own -- it just provides an API
> for you to use as needed.
> 
> Thirdly, I don't believe JMH does anything (by default) to optimize
> away GC or anything like that -- it in fact allows you to consider the
> affect of GC by running lots of iterations, so that the affect of GC
> *is* taken into account in the benchmark. It does have a (non-default)
> option to run a GC before each iteration for specific use cases (e.g.
> benchmarks which generate lots of garbage). It also has a GC profiler
> to understand the impact of GC on your results (see
> https://github.com/Valloric/jmh-playground/blob/master/src/jmh/java/org/openjdk/jmh/samples/JMHSample_35_Profilers.java).
> 
> Regards,
> Raman
> 
> 
>> On Fri, Jul 5, 2019 at 4:06 PM Ralph Goers <[email protected]> 
>> wrote:
>> 
>> Actually, I would disagree. First, I don’t think the first thousand or slow 
>> iterations running slow would matter much with 1.5 million iterations. Lets 
>> say the first thousand take 10 times longer the the rest of the iterations. 
>> That is still only the equivalent of 10,000 iterations. That is a blip in 
>> 1.5 million.
>> Second, measuring the impact of garbage collection should be a part of the 
>> benchmark if your code is generating garbage.
>> 
>> Again, I am not saying JMH doesn’t have its place, but I have to think 
>> something weird is going on when JMH is showing the no location benchmark to 
>> be about 4 times faster than the location benchmark where the simplistic 
>> loop test shows the elapsed time when including the location information to 
>> be about twice that of not including. I can tell that the call to 
>> StackWalker isn’t being optimized away because it still shows up in YourKit 
>> as the most expensive operation. And when I turn on immediateFlush flushing 
>> the file shows up in YourKit
>> 
>> Ralph
>> 
>>> On Jul 5, 2019, at 12:18 PM, Matt Sicker <[email protected]> wrote:
>>> 
>>> What I mean is that JMH is written to avoid including
>>> behind-the-scenes variability of the code execution that changes from
>>> startup until the code has been compiled to native code, garbage
>>> collection, synchronization points for threads (safe points), etc. All
>>> these things become variable noise that affects benchmarks when the
>>> code you're testing executes in an especially short period of time
>>> (e.g., appending a log event is pretty fast relatively speaking).
>>> 
>>> Say you wrote a naive benchmark like this:
>>> 
>>> long start = System.nanoTime();
>>> for (int i = 0; i < 1_000_000; i++) {
>>> runBenchmark();
>>> }
>>> long end = System.nanoTime();
>>> double averageExecutionNanos = (end - start) / 1_000_000.0;
>>> 
>>> This will give incorrect results because the first thousand or so
>>> iterations will be run by the first gen compiler, then the next
>>> iterations will be run via code output from the second gen compiler
>>> which will be significantly faster. So then you might split that up
>>> into a warmup block and a measurement block. You'll still be including
>>> the execution time spent in garbage collection, thread
>>> synchronization, JIT compilation, etc., that isn't relevant to the
>>> benchmark for executing the code over time.
>>> 
>>>> On Fri, 5 Jul 2019 at 01:25, Ralph Goers <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Well, I can infer it from the examples but it doesn’t come close to 
>>>> explaining some of the things we use it for. I would argue that logging a 
>>>> line whose value changes on every call, even if it is just an incremented 
>>>> integer, that writes to a file is more than a micro-benchmark, and I truly 
>>>> want to know how it is going to perform in real life, not with all the 
>>>> optimizations disabled.  When I first started Log4j 2 I noticed right away 
>>>> how much slower it ran when Java ran in client mode, which was much more 
>>>> common back then. That is essentially what I think JMH is doing with some 
>>>> of these benchmarks. So while it may be useful in determining that one 
>>>> algorithm is better than another when the code has not been optimized I 
>>>> wouldn’t trust that you will see the same behavior when running in a 
>>>> “real” environment.
>>>> 
>>>> For example, in testing calculating the location I doubt the JIT compiler 
>>>> can optimize away the call to the native methods involved in grabbing the 
>>>> stack frames. If that is the case then it can’t optimize away the methods 
>>>> to walk the stack frames, although it can certainly “clean them up”. 
>>>> Likewise, it can’t avoid writing log events to files so I don’t see how it 
>>>> can eliminate the work to call the methods to do it, especially if the log 
>>>> line isn’t constant. That said, it isn’t clear to me even after looking at 
>>>> the examples how much JIT stuff JMH is going to allow or prevent. I do see 
>>>> annotations that can be used to prevent inlining but it isn’t clear to me 
>>>> how JMH calls the benchmark methods. All 
>>>> https://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html
>>>>  
>>>> <https://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html>
>>>>  
>>>> <https://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html
>>>>  
>>>> <https://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html>>
>>>>  says is that @State will prevent “certain” optimizations from occurring.
>>>> 
>>>> Don’t get me wrong. I think there are many places where JHM benchmarks are 
>>>> going to be useful. I’m just not sold on it being as useful for 
>>>> “semi-macro” benchmarks like we have in a lot of cases.
>>>> 
>>>> Ralph
>>>> 
>>>>> On Jul 4, 2019, at 9:21 PM, Remko Popma <[email protected]> wrote:
>>>>> 
>>>>> The JMH samples 
>>>>> (https://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/)
>>>>>  are the best place to learn about the pitfalls of java micro benchmarks 
>>>>> and how JMH can help you avoid them.
>>>>> 
>>>>> Remko.
>>>>> 
>>>>> (Shameless plug) Every java main() method deserves http://picocli.info
>>>>> 
>>>>>> On Jul 5, 2019, at 11:25, Ralph Goers <[email protected]> wrote:
>>>>>> 
>>>>>> I don’t understand what you mean with you last message.
>>>>>> 
>>>>>> I don’t really know the details of how JMH works but my understanding is 
>>>>>> it does something to prevent things like dead code elimination. I 
>>>>>> haven’t been able to find anything very specific about what JMH actually 
>>>>>> does though.
>>>>>> 
>>>>>> Ralph
>>>>>> 
>>>>>>> On Jul 4, 2019, at 11:15 AM, Matt Sicker <[email protected]> wrote:
>>>>>>> 
>>>>>>> And the checkpoint bias thing is alleviated amongst other noise from the
>>>>>>> JVM that would otherwise invalidate the time measurements.
>>>>>>> 
>>>>>>>> On Thu, Jul 4, 2019 at 13:14, Matt Sicker <[email protected]> wrote:
>>>>>>>> 
>>>>>>>> I thought JMH ran your code enough for JIT to take effect (though 
>>>>>>>> that’s
>>>>>>>> also up to your warmup iteration setting), and the
>>>>>>>> 
>>>>>>>> On Thu, Jul 4, 2019 at 00:21, Ralph Goers <[email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> That said, here are the results of the 
>>>>>>>>> FileAppenderWithLocationBenchmark
>>>>>>>>> from master.
>>>>>>>>> 
>>>>>>>>> Benchmark                                       Mode  Cnt       Score
>>>>>>>>> Error  Units
>>>>>>>>> FileAppenderWithLocationBenchmark.log4j1File   thrpt   20   27597.034 
>>>>>>>>> ±
>>>>>>>>> 350.465  ops/s
>>>>>>>>> FileAppenderWithLocationBenchmark.log4j2File   thrpt   20   82966.450 
>>>>>>>>> ±
>>>>>>>>> 2023.826  ops/s
>>>>>>>>> FileAppenderWithLocationBenchmark.log4j2RAF    thrpt   20   78820.010 
>>>>>>>>> ±
>>>>>>>>> 1452.810  ops/s
>>>>>>>>> FileAppenderWithLocationBenchmark.logbackFile  thrpt   20  138656.655 
>>>>>>>>> ±
>>>>>>>>> 6329.118  ops/s
>>>>>>>>> 
>>>>>>>>> and here are the results with all the changes I have made.
>>>>>>>>> 
>>>>>>>>> Benchmark                                       Mode  Cnt       Score
>>>>>>>>> Error  Units
>>>>>>>>> FileAppenderWithLocationBenchmark.log4j1File   thrpt   20   47003.919 
>>>>>>>>> ±
>>>>>>>>> 1594.591  ops/s
>>>>>>>>> FileAppenderWithLocationBenchmark.log4j2File   thrpt   20  185129.606 
>>>>>>>>> ±
>>>>>>>>> 2309.118  ops/s
>>>>>>>>> FileAppenderWithLocationBenchmark.log4j2RAF    thrpt   20  183346.456 
>>>>>>>>> ±
>>>>>>>>> 3103.615  ops/s
>>>>>>>>> FileAppenderWithLocationBenchmark.logbackFile  thrpt   20  130121.612 
>>>>>>>>> ±
>>>>>>>>> 2535.387  ops/s
>>>>>>>>> 
>>>>>>>>> Ralph
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Jul 3, 2019, at 8:35 PM, Ralph Goers <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Remko,
>>>>>>>>>> 
>>>>>>>>>> I ran the existing JHM benchmarks for both no location and with
>>>>>>>>> location and there is still a huge difference. The problem is I don’t
>>>>>>>>> necessarily believe the results. Nor do I think the benchmark I have 
>>>>>>>>> been
>>>>>>>>> using is totally accurate either. The problem is that, as I 
>>>>>>>>> understand it,
>>>>>>>>> JMH limits or eliminates much of the optimization the Java runtime 
>>>>>>>>> will
>>>>>>>>> perform. This is bad for Log4j because in some ways our code is 
>>>>>>>>> written
>>>>>>>>> inefficiently and hotspot will optimize it. OTOH, since the test is 
>>>>>>>>> pretty
>>>>>>>>> much logging the same record with an incremented value from the same
>>>>>>>>> location it is hard to say whether the code isn’t being over 
>>>>>>>>> optimized. I
>>>>>>>>> suspect the real performance will be somewhere in the middle. What 
>>>>>>>>> would be
>>>>>>>>> better is something like a restful service with plenty of logging 
>>>>>>>>> driven by
>>>>>>>>> Jmeter once with no location and once with it.
>>>>>>>>>> 
>>>>>>>>>> Ralph
>>>>>>>>>> 
>>>>>>>>>>> On Jul 2, 2019, at 3:47 PM, Remko Popma <[email protected]> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> If memory serves correctly, the `includeLocation` was introduced
>>>>>>>>> because it was necessary for async loggers (and it seemed not 
>>>>>>>>> feasible to
>>>>>>>>> get this information from the layout pattern).
>>>>>>>>>>> 
>>>>>>>>>>> Async loggers need to know upfront whether they need to pass 
>>>>>>>>>>> location
>>>>>>>>> information to the consumer thread.
>>>>>>>>>>> 
>>>>>>>>>>> If obtaining location information becomes so cheap that it’s
>>>>>>>>> essentially free, it is fine to ignore the `includeLocation` 
>>>>>>>>> directive and
>>>>>>>>> include location information anyway.
>>>>>>>>>>> 
>>>>>>>>>>> About the performance results: what’s the difference between “Log4j
>>>>>>>>> 3.0 with class and method” and “Log4j 3.0 withLocation()”?
>>>>>>>>>>> 
>>>>>>>>>>> Also I am generally cautious of custom benchmarks; any reason not to
>>>>>>>>> use JMH?
>>>>>>>>>>> 
>>>>>>>>>>> Overall impressive results, congratulations!
>>>>>>>>>>> 
>>>>>>>>>>> Remko.
>>>>>>>>>>> 
>>>>>>>>>>> (Shameless plug) Every java main() method deserves 
>>>>>>>>>>> http://picocli.info
>>>>>>>>>>> 
>>>>>>>>>>>> On Jul 3, 2019, at 7:26, Ralph Goers <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> But what do you think about includeLocation. Should it override the
>>>>>>>>> result of requiresLocation or should the fact that we determined 
>>>>>>>>> location
>>>>>>>>> information is required override the includeLocation setting?
>>>>>>>>>>>> 
>>>>>>>>>>>> Ralph
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jul 2, 2019, at 2:34 PM, Matt Sicker <[email protected]> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> These benchmarks are impressive! Moving the point of call site
>>>>>>>>> calculation is a great idea.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, 2 Jul 2019 at 16:22, Ralph Goers 
>>>>>>>>>>>>> <[email protected]
>>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>>>>>> I have made changes to the code the master branch to measure and
>>>>>>>>> improve performance. I have committed some of the changes but not the 
>>>>>>>>> last
>>>>>>>>> set. I have not made the latest changes to release-2.x because they 
>>>>>>>>> will
>>>>>>>>> have to be implemented a bit differently to avoid compatibility 
>>>>>>>>> issues. The
>>>>>>>>> table below summarizes what I am seeing, at least on my MacBook Pro.  
>>>>>>>>> This
>>>>>>>>> test just basically writes 1.5 million log statements in a loop using 
>>>>>>>>> an
>>>>>>>>> equivalent format for each logging framework. Note that JUL is faster 
>>>>>>>>> with
>>>>>>>>> location information than without primarily because a custom Handler 
>>>>>>>>> had to
>>>>>>>>> be created to format the log record and it performs better than the 
>>>>>>>>> one
>>>>>>>>> used to log without location information.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> There are several things I should note about this:
>>>>>>>>>>>>> • The first step, which is in the PR, has 
>>>>>>>>>>>>> StackLocator.calcLocation
>>>>>>>>> modified to use a faster algorithm. Streams are not particularly 
>>>>>>>>> fast. This
>>>>>>>>> resulted in some gains not shown here but can be seen in
>>>>>>>>> https://issues.apache.org/jira/browse/LOG4J2-2644 <
>>>>>>>>> https://issues.apache.org/jira/browse/LOG4J2-2644>
>>>>>>>>>>>>> • Performance is proportional to the number of stack traces that
>>>>>>>>> need to be traversed. JUL only had 11 whereas Log4j had 35 before it 
>>>>>>>>> called
>>>>>>>>> calcLocation. Our stack is very deep because AbstractLogger and
>>>>>>>>> LoggerConfig have been overly refactored along with having to go 
>>>>>>>>> though
>>>>>>>>> several intermediate objects.
>>>>>>>>>>>>> • I introduced a requiredLocation() method throughout the stack.
>>>>>>>>> Any component that requires location information sets this value to 
>>>>>>>>> true.
>>>>>>>>> Based on this, the call to calcLocation can be made significantly 
>>>>>>>>> higher up
>>>>>>>>> the stack just before the log event is created but only when a 
>>>>>>>>> configured
>>>>>>>>> component will use it - with a few exceptions. This code is not yet 
>>>>>>>>> checked
>>>>>>>>> in.
>>>>>>>>>>>>> • If for some reason the location information isn’t calculated 
>>>>>>>>>>>>> when
>>>>>>>>> the event is constructed it can still be lazily populated later using 
>>>>>>>>> the
>>>>>>>>> current, slower mechanism.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> As you can see these changes result in a pretty dramatic
>>>>>>>>> improvement, almost to the point that the cost of obtaining location
>>>>>>>>> information doesn’t really matter - especially in Java 11. The 
>>>>>>>>> results in
>>>>>>>>> Java 11 amount to 5us per logging call with no location vs 7.5us per
>>>>>>>>> logging call with location info. These results could probably be 
>>>>>>>>> improved
>>>>>>>>> even more if AbstractLogger was “flattened” a bit.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I should point out that the logic isn’t perfect. I cannot 
>>>>>>>>>>>>> determine
>>>>>>>>> whether location information is required or not for the 
>>>>>>>>> RoutingAppender
>>>>>>>>> since it dynamically creates Appenders at run time. Instead, I added a
>>>>>>>>> “requiresLocation” attribute that can be set to indicate that.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> However, in testing I have encountered a unit test failure in
>>>>>>>>> RandomAccessFileAppenderTest. It is failing because the location
>>>>>>>>> information is being included even though includeLocation is set to 
>>>>>>>>> false.
>>>>>>>>> I am wondering if that setting needs to be respected since the pattern
>>>>>>>>> specifies including it. It seems odd to me to not include it in that 
>>>>>>>>> case.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thoughts?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Matt Sicker <[email protected] <mailto:[email protected]>>
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>> Matt Sicker <[email protected]>
>>>>>>>> 
>>>>>>> --
>>>>>>> Matt Sicker <[email protected]>
>>>>>> 
>>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Matt Sicker <[email protected] <mailto:[email protected]>>
>>

Re: Location performance

Reply via email to