Re: Deadlock observed

Remko Popma Mon, 15 Mar 2021 15:29:13 -0700

  Leon, can you definitely rule out the possibility that there are any
objects being logged which themselves call Log4j from their toString
method?


On Tue, Mar 16, 2021 at 7:28 AM Leon Finker <leon...@gmail.com> wrote:

> We are using default TimeoutBlockingWaitStrategy. I doubt it will
> reproduce again. We have been using 1.13.1 since October 2020 and this
> never happened till now (been using log4j2 since the beginning on
> various versions). We run about 200 of these specific service
> instances and never observed it before. But it would require the async
> queue to be filled up, which almost never happens. The only other time
> I've seen a somewhat similar deadlock with the disruptor consumer
> thread was https://issues.apache.org/jira/browse/LOG4J2-1518, but that
> was a clear issue with toString logging as well on the producer side.
>
> What I don't understand is that default async queue full policy is
> supposed to be synchronous. So not sure why it would block on full
> queue. Also, I see a bit more changes in the default one since my
> 13.1.1 version (but doesn't look like it would help):
>
> https://github.com/apache/logging-log4j2/blame/master/log4j-core/src/main/java/org/apache/logging/log4j/core/async/DefaultAsyncQueueFullPolicy.java
> https://issues.apache.org/jira/browse/LOG4J2-2895
>
> I don't think my thread is either the background thread or
> Log4J2Thread...So I can't understand the blocking on full queue in my
> case.
>
> On Mon, Mar 15, 2021 at 1:39 PM Anthony Maire <maire.anth...@gmail.com>
> wrote:
> >
> > My bad, I read things a little too quickly and I havent seen the
> separation
> > between applicative threads thread and background logging thread in the
> > original message. There are 2 issues there :
> > - the one from Leon : the application thread cannot make progress because
> > the disruptor buffer is full, but the background locking thread is doing
> > nothing, waiting to be signaled from one producers. Can you tell what
> > waitStrategy you are using for disruptor ? If it is the blocking wait
> > strategy, there were some bugs in it and another strategy should be
> > preferred, cf https://issues.apache.org/jira/browse/LOG4J2-1221.
> However in
> > this case, I expect the background thread to be in WAITING state. If it
> is
> > still RUNNABLE but doing nothing, it would suggest that you are using a
> > spinning wait strategy, and it would be the sign of a pretty serious
> > disruptor bug. If you are able to reproduce it and do a thread dump while
> > the system is stuck and not during your shutdown procedure, it can help a
> > lot.
> >
> > - the one from Carter where we sometimes need to log from the
> > background thread (according to the disruptor github issue) : with
> default
> > strategy that wait for a slot in the disruptor to be empty, this just
> > cannot work by design since the only thread that can free some slot is
> > blocked waiting for an empty slot.
> >      - change the queue full strategy to either be synchronous or drop
> > messages so that the background thread can always make progress but it
> has
> > some drawbacks (messages lost or out of orders)
> >      - don't log asynchronously from the background thread : if
> > parameter.toString() calls log4j, I think that it is a pretty bad design
> > but if you don't have the choice then you need to have some workaround
> like
> > mixing async and sync logging in your log4j config so that these
> toString()
> > related logs are always done synchronously.
> >
> >
> >
> > Le lun. 15 mars 2021 à 14:21, Carter Kozak <cko...@ckozak.net> a écrit :
> >
> > > > The issue here is that you are trying to log from the thread.
> > >
> > > That doesn't match my understanding of the problem, but it's entirely
> > > possible I've missed something. The
> > > "Log4j2-TF-1-AsyncLogger[AsyncContext@5cb0d902]-1" thread in the
> example
> > > above is waiting for events to become available on the disruptor
> ringbuffer
> > > and is not in the process of logging anything itself as far as I can
> tell.
> > >
> > > > The main question here is : why do you need to log from the logging
> > > thread ?
> > >
> > > Sometimes the background thread must log, although in many cases we
> > > attempt to avoid it. Consider message parameters which must be
> extracted,
> > > calling Object.toString may result in a logger invocation. We have
> code in
> > > place to prevent recursive logging which can avoid some of these
> calls, as
> > > well as code log synchronously in some of these cases to avoid
> potentially
> > > queueing between different ringbuffers on the same system, although in
> many
> > > cases we want to allow these calls to log because it provides valuable
> and
> > > expected information.
> > >
> > > Best,
> > > -Carter
> > >
> > > On Mon, Mar 15, 2021, at 09:08, Anthony Maire wrote:
> > > > Hello
> > > >
> > > > The issue here is that you are trying to log from the thread. Since
> this
> > > > thread is the only one that consume event from the disruptor, is the
> > > buffer
> > > > is full, it deadlocks itself.
> > > > It is not really a disruptor bug, it is an inappropriate usage of it
> > > which
> > > > leads to this behavior.
> > > >
> > > > In this thread, only log4j-related processing should occurs (loggers,
> > > > appenders, filters and so on), so using a standard log4j logger for
> this
> > > is
> > > > pretty weird. Using the statusLogger is more appropriate most of the
> time
> > > > (and this logger as far as I know doesn't go through the disruptor,
> > > please
> > > > correctly if I'm wrong)
> > > >
> > > > The main question here is : why do you need to log from the logging
> > > thread ?
> > > >
> > > > Best regards,
> > > > Anthony
> > > >
> > > >
> > > > Le lun. 15 mars 2021 à 13:50, Carter Kozak <cko...@ckozak.net> a
> écrit :
> > > >
> > > > > This sounds a lot like an issue I discussed with the lmax-disruptor
> > > folks
> > > > > here: https://github.com/LMAX-Exchange/disruptor/issues/307
> > > > > I ended up switching my services to log synchronously rather than
> > > enqueue
> > > > > when the buffer is full, which
> > > > > can produce events out of order, however we log into a structured
> > > pipeline
> > > > > that indexes based on event timestamp. The funny thing is that I
> > > haven't
> > > > > seen any occurrences of this behavior since switching the
> queue-full
> > > policy
> > > > > to synchronous rather than enqueue, so I'm inclined to believe
> there's
> > > a
> > > > > disruptor bug somewhere, but I could be wrong.
> > > > >
> > > > > On Sun, Mar 14, 2021, at 20:50, Leon Finker wrote:
> > > > > > Yes unfortunately, all logging stopped and nothing was seen in
> the
> > > log
> > > > > > file for about 2 hrs till service was restarted. But there is
> nothing
> > > > > > pointing to a deadlock in the threads dump we git so far :(
> However,
> > > > > > the threads dump captured was part of our standard kill -3 if the
> > > > > > process isn't exiting normally. So it may not show the correct
> > > picture
> > > > > > during shutdown.
> > > > > >
> > > > > > On Sun, Mar 14, 2021 at 3:32 AM Ralph Goers <
> > > ralph.go...@dslextreme.com>
> > > > > wrote:
> > > > > > >
> > > > > > > One other thing. My expectation is that once the buffer is
> full you
> > > > > should continue to log but at the rate your underlying system can
> > > handle.
> > > > > This wouldn’t be a deadlock but if the rate at which you are
> logging is
> > > > > higher than the system can write the result will be that the
> > > application
> > > > > slows down.
> > > > > > >
> > > > > > > If you are truly seeing a deadlock then it is possible there
> is a
> > > > > problem in the disruptor but I have very little experience with
> that
> > > and
> > > > > would need Remko to weigh in.
> > > > > > >
> > > > > > > Ralph
> > > > > > >
> > > > > > > > On Mar 14, 2021, at 12:26 AM, Ralph Goers <
> > > > > ralph.go...@dslextreme.com> wrote:
> > > > > > > >
> > > > > > > > Yes, it rings bells and really should have a slot on our FAQ
> page
> > > > > (although the question really isn’t asked all that frequently).
> > > > > > > >
> > > > > > > > First it means that you are logging faster than your system
> can
> > > > > handle. The proper response is to figure out why you are logging so
> > > much
> > > > > and try to reduce it or use an alternative destination that is
> faster.
> > > For
> > > > > example, if you are logging to a file you may need to improve the
> > > hardware
> > > > > or increase the buffer size.
> > > > > > > >
> > > > > > > > As you observed your buffer has filled up. The default
> behavior
> > > is
> > > > > for threads to wait until there is an empty slot.
> > > > > > > >
> > > > > > > > Remko is the real expert on this but I will take a stab at
> > > answering
> > > > > your question. The threads below show that they are in
> > > > > handleRingBufferFull. That method has 3 options to choose from
> based
> > > on how
> > > > > you have configured the asyncQueueFullPolicy. You basically have 3
> > > choices:
> > > > > > > > Wait for an open slot (the default)
> > > > > > > > Resort to logging synchronously.
> > > > > > > > Discard log events.
> > > > > > > >
> > > > > > > > You should look at
> > > > >
> > >
> https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/async/AsyncQueueFullPolicy.html
> > > > > <
> > > > >
> > >
> https://logging.apache.org/log4j/2.x/log4j-core/apidocs/org/apache/logging/log4j/core/async/AsyncQueueFullPolicy.html
> > > >.
> > > > > Note that option 2 above (synchronous) is only used by Log4j in
> certain
> > > > > circumstances and there isn’t an option to specify that, so the
> only
> > > real
> > > > > options are the default behavior, discarding, or create a custom
> > > > > AsyncQueueFullPolicy.
> > > > > > > >
> > > > > > > > Ralph
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >> On Mar 13, 2021, at 10:32 PM, Leon Finker <
> leon...@gmail.com>
> > > > > wrote:
> > > > > > > >>
> > > > > > > >> HI,
> > > > > > > >>
> > > > > > > >> Using log4j2 1.13.1 with disruptor 3.4.2. Linux CentOS. JRE
> > > 11.0.5
> > > > > > > >> Using the following JRE args:
> > > > > > > >> -DAsyncLogger.RingBufferSize=32768
> > > > > > > >>
> > > > >
> > >
> -DLog4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector
> > > > > > > >> -Dlog4j.configurationFile=...
> > > > > > > >>
> > > > > > > >> The disruptor queue has filled up. And we've observed
> deadlock
> > > with
> > > > > > > >> the interesting threads stuck like the following. Disruptor
> > > > > processing
> > > > > > > >> thread seems to be waiting to process events and all the
> other
> > > > > threads
> > > > > > > >> locked up on trying to publish the events.
> > > > > > > >>
> > > > > > > >> "Log4j2-TF-1-AsyncLogger[AsyncContext@5cb0d902]-1" #23
> daemon
> > > > > prio=5
> > > > > > > >> os_prio=0 cpu=40099128.73ms elapsed=44102.95s
> > > tid=0x00007f4c2ca34000
> > > > > > > >> nid=0x611a runnable  [0x00007f4bfdcfc000]
> > > > > > > >>  java.lang.Thread.State: RUNNABLE
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
> > > > > > > >>       at java.lang.Thread.run(java.base@11.0.5
> /Thread.java:834)
> > > > > > > >>
> > > > > > > >>  Locked ownable synchronizers:
> > > > > > > >>       - None
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> "xxx" #268 prio=5 os_prio=0 cpu=7051742.32ms
> elapsed=43967.04s
> > > > > > > >> tid=0x00007f4c2e6ea800 nid=0x649d runnable
> [0x00007f49f77dd000]
> > > > > > > >>  java.lang.Thread.State: TIMED_WAITING (parking)
> > > > > > > >>       at jdk.internal.misc.Unsafe.park(java.base@11.0.5
> /Native
> > > > > Method)
> > > > > > > >>       at
> > > > > java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.5
> > > > > > > >> /LockSupport.java:357)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:105)
> > > > > > > >>       at
> > > > > com.lmax.disruptor.RingBuffer.publishEvent(RingBuffer.java:465)
> > > > > > > >>       at
> > > > > com.lmax.disruptor.dsl.Disruptor.publishEvent(Disruptor.java:331)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.core.async.AsyncLoggerDisruptor.enqueueLogMessageWhenQueueFull(AsyncLoggerDisruptor.java:230)
> > > > > > > >>       - locked <0x0000000701017288> (a java.lang.Object)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.core.async.AsyncLogger.handleRingBufferFull(AsyncLogger.java:246)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.core.async.AsyncLogger.publish(AsyncLogger.java:230)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.core.async.AsyncLogger.logWithThreadLocalTranslator(AsyncLogger.java:225)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.core.async.AsyncLogger.access$000(AsyncLogger.java:67)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.core.async.AsyncLogger$1.log(AsyncLogger.java:152)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.core.async.AsyncLogger.log(AsyncLogger.java:136)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2198)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2152)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2135)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2016)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1875)
> > > > > > > >>       at org.apache.logging.slf4j.Log4jLogger.info
> > > > > (Log4jLogger.java:179)
> > > > > > > >>       ...
> > > > > > > >>       at java.lang.Thread.run(java.base@11.0.5
> /Thread.java:834)
> > > > > > > >>
> > > > > > > >>  Locked ownable synchronizers:
> > > > > > > >>       - None
> > > > > > > >>
> > > > > > > >> and all other threads are stuck like this:
> > > > > > > >> "xx-nioW-16" #273 prio=5 os_prio=0 cpu=7439.15ms
> > > elapsed=43966.67s
> > > > > > > >> tid=0x00007f4c2e4c7000 nid=0x64a9 waiting for monitor entry
> > > > > > > >> [0x00007f49f6ad2000]
> > > > > > > >>  java.lang.Thread.State: BLOCKED (on object monitor)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.core.async.AsyncLoggerDisruptor.enqueueLogMessageWhenQueueFull(AsyncLoggerDisruptor.java:230)
> > > > > > > >>       - waiting to lock <0x0000000701017288> (a
> > > java.lang.Object)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.core.async.AsyncLogger.handleRingBufferFull(AsyncLogger.java:246)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.core.async.AsyncLogger.publish(AsyncLogger.java:230)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.core.async.AsyncLogger.logWithThreadLocalTranslator(AsyncLogger.java:225)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.core.async.AsyncLogger.access$000(AsyncLogger.java:67)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.core.async.AsyncLogger$1.log(AsyncLogger.java:152)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.core.async.AsyncLogger.log(AsyncLogger.java:136)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2198)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2152)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2135)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2022)
> > > > > > > >>       at
> > > > > > > >>
> > > > >
> > >
> org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1891)
> > > > > > > >>       at org.apache.logging.slf4j.Log4jLogger.info
> > > > > (Log4jLogger.java:194)
> > > > > > > >> ...
> > > > > > > >>
> > > > > > > >> Does this ring any bells?
> > > > > > > >>
> > > > > > > >> Thank you in advance!
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> log4j-user-unsubscr...@logging.apache.org
> > > > > > > For additional commands, e-mail:
> > > log4j-user-h...@logging.apache.org
> > > > > > >
> > > > > >
> > > > > >
> ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail:
> log4j-user-unsubscr...@logging.apache.org
> > > > > > For additional commands, e-mail:
> log4j-user-h...@logging.apache.org
> > > > > >
> > > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: log4j-user-unsubscr...@logging.apache.org
> > > > > For additional commands, e-mail:
> log4j-user-h...@logging.apache.org
> > > > >
> > > > >
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: log4j-user-unsubscr...@logging.apache.org
> > > For additional commands, e-mail: log4j-user-h...@logging.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: log4j-user-unsubscr...@logging.apache.org
> For additional commands, e-mail: log4j-user-h...@logging.apache.org
>
>

Re: Deadlock observed

Reply via email to