I remember looking at PatternLayout performance, I reported my findings here, hopefully they’re still useful: https://issues.apache.org/jira/browse/LOG4J2-930
If %d is used in the pattern, does the FixedDateFormat get used? > On Aug 28, 2021, at 4:33, Ralph Goers <[email protected]> wrote: > > All of that agrees with my observations as well. > > Ralph > >> On Aug 27, 2021, at 12:23 PM, Carter Kozak <[email protected]> wrote: >> >> I've identified a few things that seem impactful, but bear in mind that I >> haven't begun to drill down into them yet. I plan to file individual tickets >> and investigate in greater depth later on: >> >> 1. Formatting the date is expensive. The process of actually formatting a >> value is reasonable, however using a precise clock appears to cause cache >> misses even when the pattern results in the same value >> (microseconds/nanoseconds aren't included in the result). Using the >> SystemMillisClock improves our throughput. I imagine some part of that >> improvement is the result of currentTimeMillis(): long rather than >> Clock.instant(). >> >> 2. Setting 'log4j2.enable.direct.encoders' to false improves performance, >> otherwise we see a lot of time spent in the StringBuilderEncoder, but I >> haven't had time to investigate that yet. Without direct encoders, we are >> allocating a great deal more byte arrays, but they're used across a small >> space that C2 may be able to inline out. >> >> 3. Setting log4j2.formatMsgNoLookups=true (or using message pattern >> '%m{nolookup}' saves us some time scanning through produced messages -- I've >> had this disabled at work for a while. >> >> 4. [in progress] If I implement the full layout directly in java (write >> directly to a stringbuilder based on the event in a custom Layout instead of >> using PatternLayout) combined with [1] and [2] above, performance is much >> better than logback, however using PatternLayout with the steps above we >> fall just a bit short. I still need to read through the patternlayout and >> individual pattern converters to understand why, but there's definitely some >> headroom we'll be able to reclaim hiding somewhere. >> >> -ck >> >>> On Fri, Aug 27, 2021, at 11:23, Carter Kozak wrote: >>> My pleasure, I enjoy digging into this sort of performance comparison, the >>> toughest piece right now is balancing investigation with active projects at >>> work. I completely agree about the importance of getting this resolved >>> quickly and I intend to continue investigating. >>> >>> Re loom: >>> Currently we have a few places which reuse shared objects, potentially large >>> ones (thinking StringBuilder instances on ReusableLogEvent/StringLayout/etc) >>> which are currently bounded to the number of application threads you have >>> (which is relatively small compared to the loom claims which are admittedly >>> large). More importantly, today we tend to reuse almost every thread in a >>> ThreadPoolExecutor of some kind where the model in loom involves creating >>> a new virtual thread for each task, meaning using a thread-local as a cache >>> will have a much lower hit rate as reuse will only occur within a single >>> task, >>> and only if ThreadLocals are enabled. I hope that made sense sense. >>> >>> -ck >>> >>> On Fri, Aug 27, 2021, at 08:18, Volkan Yazıcı wrote: >>>> Carter, thanks so much for your great effort and attention on this issue. >>>> *I >>>> personally find this the uttermost important thing the project needs to >>>> address with urgency right now.* Ceki is warming up for a new Logback (and >>>> SLF4J?) release. I am pretty sure he will do his part in PR, particularly >>>> in the context of benchmarking his new work with competitors. Once these >>>> results are out – independent of whether they make sense or not – >>>> "internet" will take this as the ultimate truth and it will require a >>>> tremendous effort to change people's mind. In conclusion, please prioritize >>>> this appropriately and keep us posted. Don't hesitate to let us/me know if >>>> there is anything (drinks, coffee, lunch?) we can help with. >>>> >>>> Regarding your remarks about Loom, I don't think I follow you there. Our >>>> thread-locals are stateless hence should be perfectly fine while running in >>>> virtual threads too. (I am totally against blindly using TLs, but that is >>>> another discussion. I would rather prefer a recycler concept similar to the >>>> one we have in JsonTemplateLayout. I think there was already a mailing list >>>> thread about this.) >>>> >>>> On Thu, Aug 26, 2021 at 8:37 PM Carter Kozak <[email protected]> wrote: >>>> >>>>> Yes, I'm still looking into this. >>>>> >>>>> I agree that the zero-garbage code is difficult to follow, and Loom >>>>> virtual threads will have less reuse so our thread-locals will create more >>>>> overhead than they're worth in many cases. Fingers crossed for Valhalla to >>>>> land before Loom, but I haven't been following updates from either project >>>>> very closely lately, and Loom seems farther along. >>>>> >>>>> I think the issue with multiple buffers these days is that we have optane >>>>> and nvme devices which can be _nearly_ as fast as main memory, so copying >>>>> contents can be problematic. In fully async mode (with some caveats around >>>>> queue-full-policy configuration points and nested async appenders) we >>>>> don't >>>>> need locking because the appender interactions are single-threaded. >>>>> >>>>> For what it's worth, in my testing I didn't see much difference between >>>>> the 8 kB buffer and 256 kB buffer on a system with a relatively fast nvme >>>>> drive, however I did discover and resolve LOG4J2-3150 (incorrect default >>>>> buffer size in the RandomAccessFileAppender). >>>>> >>>>> On Thu, Aug 26, 2021, at 14:21, Ralph Goers wrote: >>>>>> While that is true it does mean that we are locking higher up in the >>>>> call stack than we need to be. >>>>>> >>>>>> Ralph >>>>>> >>>>>>> On Aug 26, 2021, at 11:13 AM, Tim Perry <[email protected]> wrote: >>>>>>> >>>>>>> I’m fairly certain the JIT will optimize out the locking operations on >>>>> the nested synchronized calls after a while. I’m not sure how soon into >>>>> the >>>>> performance tests that would happen. >>>>>>> >>>>>>> Tim >>>>>>> >>>>>>> >>>>>>>> On Aug 26, 2021, at 10:55 AM, Ralph Goers <[email protected]> >>>>> wrote: >>>>>>>> >>>>>>>> So are you continuing to look at this? Frankly, the work on zero-gc >>>>> stuff made this >>>>>>>> very complicated. I am not sure I can even follow the logic anymore. >>>>> I also noticed >>>>>>>> that many of the methods dealing with byte buffers are synchronized. >>>>> I am not sure >>>>>>>> why as only the method that moves data really needs to be. >>>>>>>> >>>>>>>> Using multiple buffers would make sense if you are filling one from >>>>> the various patterns >>>>>>>> while it is unlocked and then only blocking when writing to the >>>>> output stream buffer. >>>>>>>> >>>>>>>> For what its worth Ceki updated his performance page again yesterday >>>>> to change >>>>>>>> the wording but I am pretty sure he didn’t change the log4j2 >>>>> configuration to use a >>>>>>>> 256K buffer to match Logback. And the page still makes it look like >>>>> he is testing >>>>>>>> the performance of asynchronous processing when it is really testing >>>>> the performance >>>>>>>> of the file appenders with a blocking queue in front of them. >>>>>>>> >>>>>>>> Ralph >>>>>>>> >>>>>>>>> On Aug 26, 2021, at 7:29 AM, Carter Kozak <[email protected]> wrote: >>>>>>>>> >>>>>>>>> I also tried that with similar results, which leads me to believe >>>>> we're spending >>>>>>>>> too much time copying between buffers. >>>>>>>>> >>>>>>>>> We've proven that log4j can get LogEvents to the appender very >>>>> quickly and >>>>>>>>> efficiently. >>>>>>>>> >>>>>>>>> Once we hit PatternLayout we write data to a StringBuilder. >>>>> AbstractStringBuilder used to store a char array in java 8, however with >>>>> jep 254 compact strings it was updated (similarly to String) to use a byte >>>>> array + byte representing the encoding. >>>>>>>>> We encode the StringBuilder value into a ByteBufferDestination >>>>> (Heap)ByteBuffer in StringBuilderEncoder using the following steps: >>>>>>>>> 1. copy the StringBuilder value into a CharBuffer (this would have >>>>> been a direct copy from char[] to char[] in java8, but requires conversion >>>>> from byte[]+coder now) >>>>>>>>> 2. encode data from the CharBuffer to the StringBuilderEncoder >>>>> ByteBuffer >>>>>>>>> 3. copy contents of the StringBuilderEncoder ByteBuffer to the >>>>> ByteBufferDestination ByteBuffer (which is also always a heap buffer as >>>>> far >>>>> as I can tell) >>>>>>>>> 4. Write the ByteBufferDestination to the >>>>> FileOutputStream/RandomAccessFile by extracting the heap byte-array. >>>>>>>>> >>>>>>>>> We're copying bytes around a lot, and I think older JVMs would have >>>>> optimized out a bit of the copying because the types exactly matched. >>>>>>>>> It's unclear why we need to copy through two separate ByteBuffers >>>>> from StringBuilderEncoder into ByteBufferDestination >>>>>>>>> I've generally had better I/O performance using direct byte-buffers >>>>> than heap buffers, although that experience largely comes from work on >>>>> webservers, I haven't tested if it holds true for file I/O. If we operate >>>>> directly upon a FileChannel, we can write direct buffers directly to >>>>> files. >>>>> The DirectByteBuffer memory is guaranteed to be contiguous, so it should >>>>> have better performance around copying and manipulation as well. >>>>>>>>> >>>>>>>>> -ck >>>>>>>>> >>>>>>>>>> On Thu, Aug 26, 2021, at 10:00, Volkan Yazıcı wrote: >>>>>>>>>> Ralph, maybe a dumb idea but... Can't we simply write to /dev/null >>>>> to avoid >>>>>>>>>> hardware's influence in the test results? >>>>>>>>>> >>>>>>>>>> On Wed, Aug 25, 2021 at 8:05 PM Ralph Goers < >>>>> [email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Did you add the no-op appender to Ceki’s project? Or are you using >>>>> our >>>>>>>>>>> NullAppender? I have >>>>>>>>>>> my doubts about using that NullAppender as it does nothing - not >>>>> even >>>>>>>>>>> render the Layout, so >>>>>>>>>>> it may get completely optimized away. >>>>>>>>>>> >>>>>>>>>>> I’d still like to know what kind of disks Remko did his original >>>>> testing. >>>>>>>>>>> The page says he got >>>>>>>>>>> 18.495 million messages per second. Each message would be about 130 >>>>>>>>>>> characters long >>>>>>>>>>> from I can tell. That means the destination has to be able to >>>>> support >>>>>>>>>>> about at least 2.4 GB >>>>>>>>>>> per second. >>>>>>>>>>> >>>>>>>>>>> In Ceki’s test the message is only 23 characters long (vs 100 in >>>>> Remko’s). >>>>>>>>>>> That means the >>>>>>>>>>> message size is only going to be about 60 characters long. For me >>>>> the >>>>>>>>>>> async test is writing >>>>>>>>>>> at about 1700 ops/ms which equates to 102 MBs, yet I am still >>>>> seeing the >>>>>>>>>>> queue backed up. >>>>>>>>>>> So how we are writing must be terribly inefficient. Logback is >>>>> getting >>>>>>>>>>> about 2000 ops/ms, >>>>>>>>>>> which is still only 120 MBs. >>>>>>>>>>> >>>>>>>>>>> Your test below would be generating about 197 MBs. >>>>>>>>>>> >>>>>>>>>>> One thing I did notice that made a big difference is that Ceki has >>>>> Logback >>>>>>>>>>> configured to >>>>>>>>>>> use a buffer size of 256K while Log4j2 uses the default of 8K. >>>>> Setting >>>>>>>>>>> Log4j2 to 256K >>>>>>>>>>> considerably improved the performance. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Ralph >>>>>>>>>>> >>>>>>>>>>>> On Aug 25, 2021, at 8:42 AM, Carter Kozak <[email protected]> >>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> If we would move the appender performance aside, am I right to >>>>> conclude >>>>>>>>>>>>> that the entire complex async. framework built atop Disruptor >>>>> with all >>>>>>>>>>> its >>>>>>>>>>>>> whistles and bells is yielding a diminishing performance gain >>>>> compared >>>>>>>>>>> to a >>>>>>>>>>>>> simple off the shelf blocking queue? >>>>>>>>>>>> >>>>>>>>>>>> I haven't seen any data that would give me such an impression -- >>>>> the >>>>>>>>>>> file appender >>>>>>>>>>>> itself is slower than the thread producing events, which appears >>>>> to >>>>>>>>>>> limit our throughput. >>>>>>>>>>>> Even then, log4j2 does much better in my testing on a linux >>>>> workstation >>>>>>>>>>> with 20+ producer >>>>>>>>>>>> threads compared to logback. >>>>>>>>>>>> Any time the queue is constantly full, performance will suffer >>>>> (and this >>>>>>>>>>> applies to standard >>>>>>>>>>>> blocking queues as well as disruptor, however disruptor will fare >>>>> worse >>>>>>>>>>> when the queue >>>>>>>>>>>> is full). >>>>>>>>>>>> The results using a single thread are roughly reproducible on the >>>>>>>>>>> non-async test, so I think >>>>>>>>>>>> the problem is somewhere between the layout, and appender/manager. >>>>>>>>>>>> >>>>>>>>>>>> Results running locally with no-op appender implementations: >>>>>>>>>>>> >>>>>>>>>>>> 1 thread: >>>>>>>>>>>> >>>>>>>>>>>> Benchmark Mode Cnt >>>>> Score >>>>>>>>>>> Error Units >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 >>>>> 3285.743 ± >>>>>>>>>>> 427.963 ops/ms >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 >>>>> 2383.532 ± >>>>>>>>>>> 136.034 ops/ms >>>>>>>>>>>> >>>>>>>>>>>> 20 threads: >>>>>>>>>>>> >>>>>>>>>>>> Benchmark Mode Cnt >>>>> Score >>>>>>>>>>> Error Units >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.log4j2AsyncFile thrpt 4 >>>>> 1177.746 ± >>>>>>>>>>> 919.560 ops/ms >>>>>>>>>>>> AsyncWithFileAppenderBenchmark.logbackFile thrpt 4 >>>>> 602.614 ± >>>>>>>>>>> 47.485 ops/ms >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>> -ck >>>>> >>>> >>> > >
