[
https://issues.apache.org/jira/browse/ARTEMIS-3809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531237#comment-17531237
]
David Bennion edited comment on ARTEMIS-3809 at 5/3/22 2:47 PM:
----------------------------------------------------------------
Thanks Justin.
{quote}Since the {{wait}} happens in a {{while}} loop it could be looping on
the same {{LargeMessageControllerImpl}} instance. So the lock ID in itself is
not sufficient to say that execution of that thread has not proceeded in anyway
between thread dumps.
{quote}
Ok. Yes I get this in principle. The behavior we observed from this is that
all of the other messages timed out in the queues. We have 60 second timeouts
on all these these messages. I pulled a thread dump this morning and this
machine is still hung on the same place and with the same object hash. We ran
a large batch of messages yesterday and after this issue a large number came
back as timeout. So I am virtually certain that we are a hung on a single
message in a single place, otherwise the message would have eventually timed
out and the consumer would have become unblocked and wouldn't be hanging in
this same location. I have not yet been able to get this machine into a
debugger to be able to attempt to inspect. (There is some work involved in
that due to network segmentation and security stuff, but I'll work my way
around that).
{quote}The default call timeout on the {{ServerLocator}} should be {{30000}}
milliseconds. Therefore, if you are not setting it to {{0}} I don't see how a
value of {{0}} could be used here.
{quote}
OK, that surprises me. I will search to double validate that what I said is
true. I cannot see any references in our code other than the one we just put
in. But is it possible that there is some other configuration call that does
this as a side effect? I can clearly see that the current number must be
either 0 or astronomically large.
{quote}The main thing I'm interested in at this point is {{TRACE}} logging on
{{{}org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl{}}}.
This will show what packets are arriving to the client. Specifically I want to
see packets of the type {{SESS_RECEIVE_LARGE_MSG}} and
{{{}SESS_RECEIVE_CONTINUATION{}}}.
{quote}
Got it. Unless I am mistaken, I won't see any of those logs from
RemotingConnectionImpl in my logs because I have only seemed to be able to make
this happen on InVM. For instance, I see these logs:
{noformat}
05/02/2022 15:19:09.440 TRACE (.InVMConnector)) [InVMConnection ]
InVMConnection [serverID=0, id=1e5a49e7-ca54-11ec-99a3-0050568ed7da]::packet
sent done
05/02/2022 15:19:09.440 TRACE (0.0-2081182161)) [InVMConnection ]
InVMConnection [serverID=0, id=1e5a49e7-ca54-11ec-99a3-0050568ed7da]::Sending
inVM packet
05/02/2022 15:19:09.440 TRACE (0.0-2081182161)) [InVMConnection ]
InVMConnection [serverID=0, id=1e5a49e7-ca54-11ec-99a3-0050568ed7da]::packet
sent done
05/02/2022 15:19:09.440 TRACE (.InVMConnector)) [InVMConnection ]
InVMConnection [serverID=0, id=1e5a49e7-ca54-11ec-99a3-0050568ed7da]::Sending
inVM packet
05/02/2022 15:19:09.440 TRACE (.InVMConnector)) [InVMConnection ]
InVMConnection [serverID=0, id=1e5a49e7-ca54-11ec-99a3-0050568ed7da]::packet
sent done
05/02/2022 15:19:09.441 TRACE (0.0-2081182161)) [InVMConnection ]
InVMConnection [serverID=0, id=1e5a49e7-ca54-11ec-99a3-0050568ed7da]::Sending
inVM packet
05/02/2022 15:19:09.441 TRACE (0.0-2081182161)) [InVMConnection ]
InVMConnection [serverID=0, id=1e5a49e7-ca54-11ec-99a3-0050568ed7da]::packet
sent done
05/02/2022 15:19:09.441 TRACE (.InVMConnector)) [InVMConnection ]
InVMConnection [serverID=0, id=1e5a49e7-ca54-11ec-99a3-0050568ed7da]::Sending
inVM packet{noformat}
But these logs just appear over and over like this and don't really seem to
have anything interesting.
{quote}Ultimately the best way to proceed is for you to provide a way to
reproduce the problem. That way I can put a debugger on it and really see
what's happening.
{quote}
I get it. I will see what I can do. It is hard enough to get this one to
happen even with all our code connected.
was (Author: JIRAUSER288845):
Thanks Justin.
{quote}Since the {{wait}} happens in a {{while}} loop it could be looping on
the same {{LargeMessageControllerImpl}} instance. So the lock ID in itself is
not sufficient to say that execution of that thread has not proceeded in anyway
between thread dumps.
{quote}
Ok. Yes I get this in principle. The behavior we observed from this is that
all of the other messages timed out in the queues. We have 60 second timeouts
on all these these messages. I pulled a thread dump this morning and this
machine is still hung on the same place and with the same object hash. We ran
a large batch of messages yesterday and after this issue a large number came
back as timeout. So I am virtually certain that we are a hung on a single
message in a single place, otherwise the message would have eventually timed
out and the consumer would have become unblocked and wouldn't be hanging in
this same location. I have not yet been able to get this machine into a
debugger to be able to attempt to inspect. (There is some work involved in
that due to network segmentation and security stuff, but I'll work my way
around that).
{quote}The default call timeout on the {{ServerLocator}} should be {{30000}}
milliseconds. Therefore, if you are not setting it to {{0}} I don't see how a
value of {{0}} could be used here.
{quote}
OK, that surprises me. I will search to double validate that what I said is
true. I cannot see any references in our code other than the one we just put
in. But is it possible that there is some other configuration call that does
this as a side effect? I can clearly see that the current number must be
either 0 or astronomically large.
{quote}The main thing I'm interested in at this point is {{TRACE}} logging on
{{{}org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl{}}}.
This will show what packets are arriving to the client. Specifically I want to
see packets of the type {{SESS_RECEIVE_LARGE_MSG}} and
{{{}SESS_RECEIVE_CONTINUATION{}}}.
{quote}
Got it. Unless I am mistaken, I won't see any of those logs from
RemotingConnectionImpl in my logs because I have only seemed to be able to make
this happen on InVM. For instance, I see these logs:
[insert log here]
But these logs just appear over and over like this and don't really seem to
have anything interesting.
{quote}Ultimately the best way to proceed is for you to provide a way to
reproduce the problem. That way I can put a debugger on it and really see
what's happening.
{quote}
I get it. I will see what I can do. It is hard enough to get this one to
happen even with all our code connected.
> LargeMessageControllerImpl hangs the message consume
> ----------------------------------------------------
>
> Key: ARTEMIS-3809
> URL: https://issues.apache.org/jira/browse/ARTEMIS-3809
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Components: Broker
> Affects Versions: 2.21.0
> Environment: OS: Windows Server 2019
> JVM: OpenJDK 64-Bit Server VM Temurin-17.0.1+12
> Max Memory (-Xmx): 6GB
> Allocated to JVM: 4.168GB
> Currently in use: 3.398GB (heap 3.391GB, non-heap 0.123GB)
> Reporter: David Bennion
> Priority: Major
> Labels: test-stability
>
> I wondered if this might be a recurrence of issue ARTEMIS-2293 but this
> happens on 2.21.0 and I can see the code change in
> LargeMessageControllerImpl.
> Using the default min-large-message-size of 100K. (defaults)
> Many messages are passing through the broker when this happens. I would
> anticipate that most of the messages are smaller than 100K, but clearly some
> of them must exceed. After some number of messages, a particular consumer
> ceases to consume messages.
> After the system became "hung" I was able to get a stack trace and I was able
> to identify that the system is stuck in an Object.wait() for a notify that
> appears to never come.
> Here is the trace I was able to capture:
> {code:java}
> Thread-2 (ActiveMQ-client-global-threads) id=78 state=TIMED_WAITING
> - waiting on <0x43523a75> (a
> org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl)
> - locked <0x43523a75> (a
> org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl)
> at [email protected]/java.lang.Object.wait(Native Method)
> at
> org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl.waitCompletion(LargeMessageControllerImpl.java:294)
> at
> org.apache.activemq.artemis.core.client.impl.LargeMessageControllerImpl.saveBuffer(LargeMessageControllerImpl.java:268)
> at
> org.apache.activemq.artemis.core.client.impl.ClientLargeMessageImpl.checkBuffer(ClientLargeMessageImpl.java:157)
> at
> org.apache.activemq.artemis.core.client.impl.ClientLargeMessageImpl.getBodyBuffer(ClientLargeMessageImpl.java:89)
> at mypackage.MessageListener.handleMessage(MessageListener.java:46)
> {code}
>
> The app can run either as a single node using the InVM transporter or as a
> cluster using the TCP. To my knowledge, I have only seen this issue occur on
> the InVM.
> I am not expert in this code, but I can tell from the call stack that 0 must
> be the value of timeWait passed into waitCompletion(). But from what I can
> discern of the code changes in 2.21.0, it should be adjusting the
> readTimeout to the timeout of the message (I think?) such that it causes the
> read to eventually give up rather than remaining blocked forever.
> We have persistenceEnabled = false, which leads me to believe that the only
> disk activity for messages should be related to large messages(?).
> On a machine and context where this was consistently happening, I adjusted
> the min-large-message-size upwards and the problem went away. This makes
> sense for my application, but ultimately if a message goes across the
> threshold to become large it appears to hang the consumer indefinitely.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)