Re: Cek's new log4j vs logback benchmark

Remko Popma Fri, 27 Aug 2021 13:11:07 -0700

I remember looking at PatternLayout performance, I reported my findings here, 
hopefully they’re still useful: https://issues.apache.org/jira/browse/LOG4J2-930


If %d is used in the pattern, does the FixedDateFormat get used?




> On Aug 28, 2021, at 4:33, Ralph Goers <[email protected]> wrote:
> 
> All of that agrees with my observations as well.
> 
> Ralph
> 
>> On Aug 27, 2021, at 12:23 PM, Carter Kozak <[email protected]> wrote:
>> 
>> I've identified a few things that seem impactful, but bear in mind that I 
>> haven't begun to drill down into them yet. I plan to file individual tickets 
>> and investigate in greater depth later on:
>> 
>> 1. Formatting the date is expensive. The process of actually formatting a 
>> value is reasonable, however using a precise clock appears to cause cache 
>> misses even when the pattern results in the same value 
>> (microseconds/nanoseconds aren't included in the result). Using the 
>> SystemMillisClock improves our throughput. I imagine some part of that 
>> improvement is the result of currentTimeMillis(): long rather than 
>> Clock.instant().
>> 
>> 2. Setting 'log4j2.enable.direct.encoders' to false improves performance, 
>> otherwise we see a lot of time spent in the StringBuilderEncoder, but I 
>> haven't had time to investigate that yet. Without direct encoders, we are 
>> allocating a great deal more byte arrays, but they're used across a small 
>> space that C2 may be able to inline out.
>> 
>> 3. Setting log4j2.formatMsgNoLookups=true (or using message pattern 
>> '%m{nolookup}' saves us some time scanning through produced messages -- I've 
>> had this disabled at work for a while.
>> 
>> 4. [in progress] If I implement the full layout directly in java (write 
>> directly to a stringbuilder based on the event in a custom Layout instead of 
>> using PatternLayout) combined with [1] and [2] above, performance is much 
>> better than logback, however using PatternLayout with the steps above we 
>> fall just a bit short. I still need to read through the patternlayout and 
>> individual pattern converters to understand why, but there's definitely some 
>> headroom we'll be able to reclaim hiding somewhere.
>> 
>> -ck
>> 
>>> On Fri, Aug 27, 2021, at 11:23, Carter Kozak wrote:
>>> My pleasure, I enjoy digging into this sort of performance comparison, the
>>> toughest piece right now is balancing investigation with active projects at
>>> work. I completely agree about the importance of getting this resolved
>>> quickly and I intend to continue investigating.
>>> 
>>> Re loom:
>>> Currently we have a few places which reuse shared objects, potentially large
>>> ones (thinking StringBuilder instances on ReusableLogEvent/StringLayout/etc)
>>> which are currently bounded to the number of application threads you have
>>> (which is relatively small compared to the loom claims which are admittedly
>>> large). More importantly, today we tend to reuse almost every thread in a
>>> ThreadPoolExecutor of some kind where the model in loom involves creating
>>> a new virtual thread for each task, meaning using a thread-local as a cache
>>> will have a much lower hit rate as reuse will only occur within a single 
>>> task,
>>> and only if ThreadLocals are enabled. I hope that made sense sense.
>>> 
>>> -ck
>>> 
>>> On Fri, Aug 27, 2021, at 08:18, Volkan Yazıcı wrote:
>>>> Carter, thanks so much for your great effort and attention on this issue. 
>>>> *I
>>>> personally find this the uttermost important thing the project needs to
>>>> address with urgency right now.* Ceki is warming up for a new Logback (and
>>>> SLF4J?) release. I am pretty sure he will do his part in PR, particularly
>>>> in the context of benchmarking his new work with competitors. Once these
>>>> results are out – independent of whether they make sense or not –
>>>> "internet" will take this as the ultimate truth and it will require a
>>>> tremendous effort to change people's mind. In conclusion, please prioritize
>>>> this appropriately and keep us posted. Don't hesitate to let us/me know if
>>>> there is anything (drinks, coffee, lunch?) we can help with.
>>>> 
>>>> Regarding your remarks about Loom, I don't think I follow you there. Our
>>>> thread-locals are stateless hence should be perfectly fine while running in
>>>> virtual threads too. (I am totally against blindly using TLs, but that is
>>>> another discussion. I would rather prefer a recycler concept similar to the
>>>> one we have in JsonTemplateLayout. I think there was already a mailing list
>>>> thread about this.)
>>>> 
>>>> On Thu, Aug 26, 2021 at 8:37 PM Carter Kozak <[email protected]> wrote:
>>>> 
>>>>> Yes, I'm still looking into this.
>>>>> 
>>>>> I agree that the zero-garbage code is difficult to follow, and Loom
>>>>> virtual threads will have less reuse so our thread-locals will create more
>>>>> overhead than they're worth in many cases. Fingers crossed for Valhalla to
>>>>> land before Loom, but I haven't been following updates from either project
>>>>> very closely lately, and Loom seems farther along.
>>>>> 
>>>>> I think the issue with multiple buffers these days is that we have optane
>>>>> and nvme devices which can be _nearly_ as fast as main memory, so copying
>>>>> contents can be problematic. In fully async mode (with some caveats around
>>>>> queue-full-policy configuration points and nested async appenders) we 
>>>>> don't
>>>>> need locking because the appender interactions are single-threaded.
>>>>> 
>>>>> For what it's worth, in my testing I didn't see much difference between
>>>>> the 8 kB buffer and 256 kB buffer on a system with a relatively fast nvme
>>>>> drive, however I did discover and resolve LOG4J2-3150 (incorrect default
>>>>> buffer size in the RandomAccessFileAppender).
>>>>> 
>>>>> On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote:
>>>>>> While that is true it does mean that we are locking higher up in the
>>>>> call stack than we need to be.
>>>>>> 
>>>>>> Ralph
>>>>>> 
>>>>>>> On Aug 26, 2021, at 11:13 AM, Tim Perry <[email protected]> wrote:
>>>>>>> 
>>>>>>> I’m fairly certain the JIT will optimize out the locking operations on
>>>>> the nested synchronized calls after a while. I’m not sure how soon into 
>>>>> the
>>>>> performance tests that would happen.
>>>>>>> 
>>>>>>> Tim
>>>>>>> 
>>>>>>> 
>>>>>>>> On Aug 26, 2021, at 10:55 AM, Ralph Goers <[email protected]>
>>>>> wrote:
>>>>>>>> 
>>>>>>>> So are you continuing to look at this?  Frankly, the work on zero-gc
>>>>> stuff made this
>>>>>>>> very complicated. I am not sure I can even follow the logic anymore.
>>>>> I also noticed
>>>>>>>> that many of the methods dealing with byte buffers are synchronized.
>>>>> I am not sure
>>>>>>>> why as only the method that moves data really needs to be.
>>>>>>>> 
>>>>>>>> Using multiple buffers would make sense if you are filling one from
>>>>> the various patterns
>>>>>>>> while it is unlocked and then only blocking when writing to the
>>>>> output stream buffer.
>>>>>>>> 
>>>>>>>> For what its worth Ceki updated his performance page again yesterday
>>>>> to change
>>>>>>>> the wording but I am pretty sure he didn’t change the log4j2
>>>>> configuration to use a
>>>>>>>> 256K buffer to match Logback.  And the page still makes it look like
>>>>> he is testing
>>>>>>>> the performance of asynchronous processing when it is really testing
>>>>> the performance
>>>>>>>> of the file appenders with a blocking queue in front of them.
>>>>>>>> 
>>>>>>>> Ralph
>>>>>>>> 
>>>>>>>>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> I also tried that with similar results, which leads me to believe
>>>>> we're spending
>>>>>>>>> too much time copying between buffers.
>>>>>>>>> 
>>>>>>>>> We've proven that log4j can get LogEvents to the appender very
>>>>> quickly and
>>>>>>>>> efficiently.
>>>>>>>>> 
>>>>>>>>> Once we hit PatternLayout we write data to a StringBuilder.
>>>>> AbstractStringBuilder used to store a char array in java 8, however with
>>>>> jep 254 compact strings it was updated (similarly to String) to use a byte
>>>>> array + byte representing the encoding.
>>>>>>>>> We encode the StringBuilder value into a ByteBufferDestination
>>>>> (Heap)ByteBuffer in StringBuilderEncoder using the following steps:
>>>>>>>>> 1. copy the StringBuilder value into a CharBuffer (this would have
>>>>> been a direct copy from char[] to char[] in java8, but requires conversion
>>>>> from byte[]+coder now)
>>>>>>>>> 2. encode data from the CharBuffer to the StringBuilderEncoder
>>>>> ByteBuffer
>>>>>>>>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the
>>>>> ByteBufferDestination ByteBuffer (which is also always a heap buffer as 
>>>>> far
>>>>> as I can tell)
>>>>>>>>> 4. Write the ByteBufferDestination to the
>>>>> FileOutputStream/RandomAccessFile by extracting the heap byte-array.
>>>>>>>>> 
>>>>>>>>> We're copying bytes around a lot, and I think older JVMs would have
>>>>> optimized out a bit of the copying because the types exactly matched.
>>>>>>>>> It's unclear why we need to copy through two separate ByteBuffers
>>>>> from StringBuilderEncoder into ByteBufferDestination
>>>>>>>>> I've generally had better I/O performance using direct byte-buffers
>>>>> than heap buffers, although that experience largely comes from work on
>>>>> webservers, I haven't tested if it holds true for file I/O. If we operate
>>>>> directly upon a FileChannel, we can write direct buffers directly to 
>>>>> files.
>>>>> The DirectByteBuffer memory is guaranteed to be contiguous, so it should
>>>>> have better performance around copying and manipulation as well.
>>>>>>>>> 
>>>>>>>>> -ck
>>>>>>>>> 
>>>>>>>>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote:
>>>>>>>>>> Ralph, maybe a dumb idea but... Can't we simply write to /dev/null
>>>>> to avoid
>>>>>>>>>> hardware's influence in the test results?
>>>>>>>>>> 
>>>>>>>>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers <
>>>>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Did you add the no-op appender to Ceki’s project? Or are you using
>>>>> our
>>>>>>>>>>> NullAppender? I have
>>>>>>>>>>> my doubts about using that NullAppender as it does nothing - not
>>>>> even
>>>>>>>>>>> render the Layout, so
>>>>>>>>>>> it may get completely optimized away.
>>>>>>>>>>> 
>>>>>>>>>>> I’d still like to know what kind of disks Remko did his original
>>>>> testing.
>>>>>>>>>>> The page says he got
>>>>>>>>>>> 18.495 million messages per second. Each message would be about 130
>>>>>>>>>>> characters long
>>>>>>>>>>> from I can tell. That means the destination has to be able to
>>>>> support
>>>>>>>>>>> about at least 2.4 GB
>>>>>>>>>>> per second.
>>>>>>>>>>> 
>>>>>>>>>>> In Ceki’s test the message is only 23 characters long (vs 100 in
>>>>> Remko’s).
>>>>>>>>>>> That means the
>>>>>>>>>>> message size is only going to be about 60 characters long. For me
>>>>> the
>>>>>>>>>>> async test is writing
>>>>>>>>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still
>>>>> seeing the
>>>>>>>>>>> queue backed up.
>>>>>>>>>>> So how we are writing must be terribly inefficient. Logback is
>>>>> getting
>>>>>>>>>>> about 2000 ops/ms,
>>>>>>>>>>> which is still only 120 MBs.
>>>>>>>>>>> 
>>>>>>>>>>> Your test below would be generating about 197 MBs.
>>>>>>>>>>> 
>>>>>>>>>>> One thing I did notice that made a big difference is that Ceki has
>>>>> Logback
>>>>>>>>>>> configured to
>>>>>>>>>>> use a buffer size of 256K while Log4j2 uses the default of 8K.
>>>>> Setting
>>>>>>>>>>> Log4j2 to 256K
>>>>>>>>>>> considerably improved the performance.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Ralph
>>>>>>>>>>> 
>>>>>>>>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak <[email protected]>
>>>>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> If we would move the appender performance aside, am I right to
>>>>> conclude
>>>>>>>>>>>>> that the entire complex async. framework built atop Disruptor
>>>>> with all
>>>>>>>>>>> its
>>>>>>>>>>>>> whistles and bells is yielding a diminishing performance gain
>>>>> compared
>>>>>>>>>>> to a
>>>>>>>>>>>>> simple off the shelf blocking queue?
>>>>>>>>>>>> 
>>>>>>>>>>>> I haven't seen any data that would give me such an impression --
>>>>> the
>>>>>>>>>>> file appender
>>>>>>>>>>>> itself is slower than the thread producing events, which appears
>>>>> to
>>>>>>>>>>> limit our throughput.
>>>>>>>>>>>> Even then, log4j2 does much better in my testing on a linux
>>>>> workstation
>>>>>>>>>>> with 20+ producer
>>>>>>>>>>>> threads compared to logback.
>>>>>>>>>>>> Any time the queue is constantly full, performance will suffer
>>>>> (and this
>>>>>>>>>>> applies to standard
>>>>>>>>>>>> blocking queues as well as disruptor, however disruptor will fare
>>>>> worse
>>>>>>>>>>> when the queue
>>>>>>>>>>>> is full).
>>>>>>>>>>>> The results using a single thread are roughly reproducible on the
>>>>>>>>>>> non-async test, so I think
>>>>>>>>>>>> the problem is somewhere between the layout, and appender/manager.
>>>>>>>>>>>> 
>>>>>>>>>>>> Results running locally with no-op appender implementations:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1 thread:
>>>>>>>>>>>> 
>>>>>>>>>>>> Benchmark                                        Mode  Cnt
>>>>> Score
>>>>>>>>>>> Error   Units
>>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
>>>>> 3285.743 ±
>>>>>>>>>>> 427.963  ops/ms
>>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
>>>>> 2383.532 ±
>>>>>>>>>>> 136.034  ops/ms
>>>>>>>>>>>> 
>>>>>>>>>>>> 20 threads:
>>>>>>>>>>>> 
>>>>>>>>>>>> Benchmark                                        Mode  Cnt
>>>>> Score
>>>>>>>>>>> Error   Units
>>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile  thrpt    4
>>>>> 1177.746 ±
>>>>>>>>>>> 919.560  ops/ms
>>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile      thrpt    4
>>>>> 602.614 ±
>>>>>>>>>>> 47.485  ops/ms
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> -ck
>>>>> 
>>>> 
>>> 
> 
>

Re: Cek's new log4j vs logback benchmark

Reply via email to