Leon, can you definitely rule out the possibility that there are any objects being logged which themselves call Log4j from their toString method?
On Tue, Mar 16, 2021 at 7:28 AM Leon Finker <leon...@gmail.com> wrote: > We are using default TimeoutBlockingWaitStrategy. I doubt it will > reproduce again. We have been using 1.13.1 since October 2020 and this > never happened till now (been using log4j2 since the beginning on > various versions). We run about 200 of these specific service > instances and never observed it before. But it would require the async > queue to be filled up, which almost never happens. The only other time > I've seen a somewhat similar deadlock with the disruptor consumer > thread was https://issues.apache.org/jira/browse/LOG4J2-1518, but that > was a clear issue with toString logging as well on the producer side. > > What I don't understand is that default async queue full policy is > supposed to be synchronous. So not sure why it would block on full > queue. Also, I see a bit more changes in the default one since my > 13.1.1 version (but doesn't look like it would help): > > https://github.com/apache/logging-log4j2/blame/master/log4j-core/src/main/java/org/apache/logging/log4j/core/async/DefaultAsyncQueueFullPolicy.java > https://issues.apache.org/jira/browse/LOG4J2-2895 > > I don't think my thread is either the background thread or > Log4J2Thread...So I can't understand the blocking on full queue in my > case. > > On Mon, Mar 15, 2021 at 1:39 PM Anthony Maire <maire.anth...@gmail.com> > wrote: > > > > My bad, I read things a little too quickly and I havent seen the > separation > > between applicative threads thread and background logging thread in the > > original message. There are 2 issues there : > > - the one from Leon : the application thread cannot make progress because > > the disruptor buffer is full, but the background locking thread is doing > > nothing, waiting to be signaled from one producers. Can you tell what > > waitStrategy you are using for disruptor ? If it is the blocking wait > > strategy, there were some bugs in it and another strategy should be > > preferred, cf https://issues.apache.org/jira/browse/LOG4J2-1221. > However in > > this case, I expect the background thread to be in WAITING state. If it > is > > still RUNNABLE but doing nothing, it would suggest that you are using a > > spinning wait strategy, and it would be the sign of a pretty serious > > disruptor bug. If you are able to reproduce it and do a thread dump while > > the system is stuck and not during your shutdown procedure, it can help a > > lot. > > > > - the one from Carter where we sometimes need to log from the > > background thread (according to the disruptor github issue) : with > default > > strategy that wait for a slot in the disruptor to be empty, this just > > cannot work by design since the only thread that can free some slot is > > blocked waiting for an empty slot. > > - change the queue full strategy to either be synchronous or drop > > messages so that the background thread can always make progress but it > has > > some drawbacks (messages lost or out of orders) > > - don't log asynchronously from the background thread : if > > parameter.toString() calls log4j, I think that it is a pretty bad design > > but if you don't have the choice then you need to have some workaround > like > > mixing async and sync logging in your log4j config so that these > toString() > > related logs are always done synchronously. > > > > > > > > Le lun. 15 mars 2021 à 14:21, Carter Kozak <cko...@ckozak.net> a écrit : > > > > > > The issue here is that you are trying to log from the thread. > > > > > > That doesn't match my understanding of the problem, but it's entirely > > > possible I've missed something. The > > > "Log4j2-TF-1-AsyncLogger[AsyncContext@5cb0d902]-1" thread in the > example > > > above is waiting for events to become available on the disruptor > ringbuffer > > > and is not in the process of logging anything itself as far as I can > tell. > > > > > > > The main question here is : why do you need to log from the logging > > > thread ? > > > > > > Sometimes the background thread must log, although in many cases we > > > attempt to avoid it. Consider message parameters which must be > extracted, > > > calling Object.toString may result in a logger invocation. We have > code in > > > place to prevent recursive logging which can avoid some of these > calls, as > > > well as code log synchronously in some of these cases to avoid > potentially > > > queueing between different ringbuffers on the same system, although in > many > > > cases we want to allow these calls to log because it provides valuable > and > > > expected information. > > > > > > Best, > > > -Carter > > > > > > On Mon, Mar 15, 2021, at 09:08, Anthony Maire wrote: > > > > Hello > > > > > > > > The issue here is that you are trying to log from the thread. Since > this > > > > thread is the only one that consume event from the disruptor, is the > > > buffer > > > > is full, it deadlocks itself. > > > > It is not really a disruptor bug, it is an inappropriate usage of it > > > which > > > > leads to this behavior. > > > > > > > > In this thread, only log4j-related processing should occurs (loggers, > > > > appenders, filters and so on), so using a standard log4j logger for > this > > > is > > > > pretty weird. Using the statusLogger is more appropriate most of the > time > > > > (and this logger as far as I know doesn't go through the disruptor, > > > please > > > > correctly if I'm wrong) > > > > > > > > The main question here is : why do you need to log from the logging > > > thread ? > > > > > > > > Best regards, > > > > Anthony > > > > > > > > > > > > Le lun. 15 mars 2021 à 13:50, Carter Kozak <cko...@ckozak.net> a > écrit : > > > > > > > > > This sounds a lot like an issue I discussed with the lmax-disruptor > > > folks > > > > > here: https://github.com/LMAX-Exchange/disruptor/issues/307 > > > > > I ended up switching my services to log synchronously rather than > > > enqueue > > > > > when the buffer is full, which > > > > > can produce events out of order, however we log into a structured > > > pipeline > > > > > that indexes based on event timestamp. The funny thing is that I > > > haven't > > > > > seen any occurrences of this behavior since switching the > queue-full > > > policy > > > > > to synchronous rather than enqueue, so I'm inclined to believe > there's > > > a > > > > > disruptor bug somewhere, but I could be wrong. > > > > > > > > > > On Sun, Mar 14, 2021, at 20:50, Leon Finker wrote: > > > > > > Yes unfortunately, all logging stopped and nothing was seen in > the > > > log > > > > > > file for about 2 hrs till service was restarted. But there is > nothing > > > > > > pointing to a deadlock in the threads dump we git so far :( > However, > > > > > > the threads dump captured was part of our standard kill -3 if the > > > > > > process isn't exiting normally. So it may not show the correct > > > picture > > > > > > during shutdown. > > > > > > > > > > > > On Sun, Mar 14, 2021 at 3:32 AM Ralph Goers < > > > ralph.go...@dslextreme.com> > > > > > wrote: > > > > > > > > > > > > > > One other thing. My expectation is that once the buffer is > full you > > > > > should continue to log but at the rate your underlying system can > > > handle. > > > > > This wouldn’t be a deadlock but if the rate at which you are > logging is > > > > > higher than the system can write the result will be that the > > > application > > > > > slows down. > > > > > > > > > > > > > > If you are truly seeing a deadlock then it is possible there > is a > > > > > problem in the disruptor but I have very little experience with > that > > > and > > > > > would need Remko to weigh in. > > > > > > > > > > > > > > Ralph > > > > > > > > > > > > > > > On Mar 14, 2021, at 12:26 AM, Ralph Goers < > > > > > ralph.go...@dslextreme.com> wrote: > > > > > > > > > > > > > > > > Yes, it rings bells and really should have a slot on our FAQ > page > > > > > (although the question really isn’t asked all that frequently). > > > > > > > > > > > > > > > > First it means that you are logging faster than your system > can > > > > > handle. The proper response is to figure out why you are logging so > > > much > > > > > and try to reduce it or use an alternative destination that is > faster. > > > For > > > > > example, if you are logging to a file you may need to improve the > > > hardware > > > > > or increase the buffer size. > > > > > > > > > > > > > > > > As you observed your buffer has filled up. The default > behavior > > > is > > > > > for threads to wait until there is an empty slot. > > > > > > > > > > > > > > > > Remko is the real expert on this but I will take a stab at > > > answering > > > > > your question. The threads below show that they are in > > > > > handleRingBufferFull. That method has 3 options to choose from > based > > > on how > > > > > you have configured the asyncQueueFullPolicy. You basically have 3 > > > choices: > > > > > > > > Wait for an open slot (the default) > > > > > > > > Resort to logging synchronously. > > > > > > > > Discard log events. > > > > > > > > > > > > > > > > You should look at > > > > > > > > > https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/async/AsyncQueueFullPolicy.html > > > > > < > > > > > > > > > https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/async/AsyncQueueFullPolicy.html > > > >. > > > > > Note that option 2 above (synchronous) is only used by Log4j in > certain > > > > > circumstances and there isn’t an option to specify that, so the > only > > > real > > > > > options are the default behavior, discarding, or create a custom > > > > > AsyncQueueFullPolicy. > > > > > > > > > > > > > > > > Ralph > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >> On Mar 13, 2021, at 10:32 PM, Leon Finker < > leon...@gmail.com> > > > > > wrote: > > > > > > > >> > > > > > > > >> HI, > > > > > > > >> > > > > > > > >> Using log4j2 1.13.1 with disruptor 3.4.2. Linux CentOS. JRE > > > 11.0.5 > > > > > > > >> Using the following JRE args: > > > > > > > >> -DAsyncLogger.RingBufferSize=32768 > > > > > > > >> > > > > > > > > > -DLog4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector > > > > > > > >> -Dlog4j.configurationFile=... > > > > > > > >> > > > > > > > >> The disruptor queue has filled up. And we've observed > deadlock > > > with > > > > > > > >> the interesting threads stuck like the following. Disruptor > > > > > processing > > > > > > > >> thread seems to be waiting to process events and all the > other > > > > > threads > > > > > > > >> locked up on trying to publish the events. > > > > > > > >> > > > > > > > >> "Log4j2-TF-1-AsyncLogger[AsyncContext@5cb0d902]-1" #23 > daemon > > > > > prio=5 > > > > > > > >> os_prio=0 cpu=40099128.73ms elapsed=44102.95s > > > tid=0x00007f4c2ca34000 > > > > > > > >> nid=0x611a runnable [0x00007f4bfdcfc000] > > > > > > > >> java.lang.Thread.State: RUNNABLE > > > > > > > >> at > > > > > > > >> > > > > > > > > > com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56) > > > > > > > >> at > > > > > > > >> > > > > > > > > > com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159) > > > > > > > >> at > > > > > > > >> > > > > > > > > > com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125) > > > > > > > >> at java.lang.Thread.run(java.base@11.0.5 > /Thread.java:834) > > > > > > > >> > > > > > > > >> Locked ownable synchronizers: > > > > > > > >> - None > > > > > > > >> > > > > > > > >> > > > > > > > >> "xxx" #268 prio=5 os_prio=0 cpu=7051742.32ms > elapsed=43967.04s > > > > > > > >> tid=0x00007f4c2e6ea800 nid=0x649d runnable > [0x00007f49f77dd000] > > > > > > > >> java.lang.Thread.State: TIMED_WAITING (parking) > > > > > > > >> at jdk.internal.misc.Unsafe.park(java.base@11.0.5 > /Native > > > > > Method) > > > > > > > >> at > > > > > java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.5 > > > > > > > >> /LockSupport.java:357) > > > > > > > >> at > > > > > > > >> > > > > > > > > > com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136) > > > > > > > >> at > > > > > > > >> > > > > > > > > > com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:105) > > > > > > > >> at > > > > > com.lmax.disruptor.RingBuffer.publishEvent(RingBuffer.java:465) > > > > > > > >> at > > > > > com.lmax.disruptor.dsl.Disruptor.publishEvent(Disruptor.java:331) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.core.async.AsyncLoggerDisruptor.enqueueLogMessageWhenQueueFull(AsyncLoggerDisruptor.java:230) > > > > > > > >> - locked <0x0000000701017288> (a java.lang.Object) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.core.async.AsyncLogger.handleRingBufferFull(AsyncLogger.java:246) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.core.async.AsyncLogger.publish(AsyncLogger.java:230) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.core.async.AsyncLogger.logWithThreadLocalTranslator(AsyncLogger.java:225) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.core.async.AsyncLogger.access$000(AsyncLogger.java:67) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.core.async.AsyncLogger$1.log(AsyncLogger.java:152) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.core.async.AsyncLogger.log(AsyncLogger.java:136) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2198) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2152) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2135) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2016) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1875) > > > > > > > >> at org.apache.logging.slf4j.Log4jLogger.info > > > > > (Log4jLogger.java:179) > > > > > > > >> ... > > > > > > > >> at java.lang.Thread.run(java.base@11.0.5 > /Thread.java:834) > > > > > > > >> > > > > > > > >> Locked ownable synchronizers: > > > > > > > >> - None > > > > > > > >> > > > > > > > >> and all other threads are stuck like this: > > > > > > > >> "xx-nioW-16" #273 prio=5 os_prio=0 cpu=7439.15ms > > > elapsed=43966.67s > > > > > > > >> tid=0x00007f4c2e4c7000 nid=0x64a9 waiting for monitor entry > > > > > > > >> [0x00007f49f6ad2000] > > > > > > > >> java.lang.Thread.State: BLOCKED (on object monitor) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.core.async.AsyncLoggerDisruptor.enqueueLogMessageWhenQueueFull(AsyncLoggerDisruptor.java:230) > > > > > > > >> - waiting to lock <0x0000000701017288> (a > > > java.lang.Object) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.core.async.AsyncLogger.handleRingBufferFull(AsyncLogger.java:246) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.core.async.AsyncLogger.publish(AsyncLogger.java:230) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.core.async.AsyncLogger.logWithThreadLocalTranslator(AsyncLogger.java:225) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.core.async.AsyncLogger.access$000(AsyncLogger.java:67) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.core.async.AsyncLogger$1.log(AsyncLogger.java:152) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.core.async.AsyncLogger.log(AsyncLogger.java:136) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2198) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2152) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2135) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2022) > > > > > > > >> at > > > > > > > >> > > > > > > > > > org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1891) > > > > > > > >> at org.apache.logging.slf4j.Log4jLogger.info > > > > > (Log4jLogger.java:194) > > > > > > > >> ... > > > > > > > >> > > > > > > > >> Does this ring any bells? > > > > > > > >> > > > > > > > >> Thank you in advance! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > To unsubscribe, e-mail: > log4j-user-unsubscr...@logging.apache.org > > > > > > > For additional commands, e-mail: > > > log4j-user-h...@logging.apache.org > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: > log4j-user-unsubscr...@logging.apache.org > > > > > > For additional commands, e-mail: > log4j-user-h...@logging.apache.org > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: log4j-user-unsubscr...@logging.apache.org > > > > > For additional commands, e-mail: > log4j-user-h...@logging.apache.org > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: log4j-user-unsubscr...@logging.apache.org > > > For additional commands, e-mail: log4j-user-h...@logging.apache.org > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: log4j-user-unsubscr...@logging.apache.org > For additional commands, e-mail: log4j-user-h...@logging.apache.org > >