[
https://issues.apache.org/jira/browse/ARTEMIS-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17330426#comment-17330426
]
Christian Danner commented on ARTEMIS-3260:
-------------------------------------------
After we encountered this issue in one of our test installations I immediately
tried out the latest 2.18.0 snapshot I could find
(2.18.0-20210420.204200-86-bin), however the behavior was the same.
What bothers me is that I do not know what causes the journal to enter a state
where this problem starts to manifest itself. This is also the reason why it is
quite hard to reproduce. All I could do so far is analyze the situations where
the broker resolves this state after restart (when it does, the messages are
still delivered first, but not anymore during follow-up restarts)
I added some additional log output (using our own custom logger) to the
PagingStoreImpl and QueueImpl classes to find out the difference between a
restart where the problem persists and one where the problem is resolved.
Attached are the log files of two such situations (restart_log_1_no_recovery
contains the output where the problem persists afterwards,
restart_log_2_recovery contains the output of a restart attempt where the
broker was finally able to resolve the problem).
I also attached the updated classes so you can check where the log output was
added (the line numbers in the logs are slightly off due to the change of the
original implementation)
Finally I'm also attaching the complete data directory (journal, bindings) of a
test installation that is in a state where this problem can (sometimes) be
reproduced - not sure if this is helpful.
Unfortunately I currently see no way to implement a test case that is able to
reproduce this, as it would also have to simulate a broker restart and - as I
said - the behavior is completely undeterministic. Sometimes the issue is
resolved on the first startup, then the messages keep on re-appearing for 20
restarts in a row...
> Already consumed messages are redelivered after server restart (possible
> Journal corruption)
> --------------------------------------------------------------------------------------------
>
> Key: ARTEMIS-3260
> URL: https://issues.apache.org/jira/browse/ARTEMIS-3260
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Components: AMQP, Broker
> Affects Versions: 2.17.0
> Environment: Embedded Apache Artemis 2.17.0
> Windows Server 2016 Standard (10.0.14393)
> Java(TM) SE Runtime Environment (build 1.8.0_152-b16)
> Java HotSpot(TM) 64-Bit Server VM (build 25.152-b16, mixed mode)
> Reporter: Christian Danner
> Priority: Blocker
> Attachments: PagingStoreImpl.java, QueueImpl.java,
> activemq_artemis.log, broker.xml, broker.zip,
> restart_log_1_no_recovery-2021-04-22_08.46.52.log,
> restart_log_2_recovery-2021-04-22_08.48.49.log
>
>
> After upgrading from Artemis 2.15.0 to 2.17.0 we are experiencing situations
> (non-deterministic but quite regular) where the embedded Apache Artemis
> instance re-delivers messages to a client again after a server restart.
> Those messages were already processed successfully before restart and the web
> console shows a message count of 0 prior to restarting the process.
> It is also important to note once those stuck messages (which seemingly
> appear from out of nowhere) have been reprocessed newly added messages are
> processed just fine - so what we are seeing is the following:
> # At some point in time messages A,B,C were routed to Queue Q and
> successfully consumed
> # Q is empty (web console)
> # perform broker restart
> # client consumes A,B,C from Q again
> # Q is empty (web console)
> # another client sends X,Y,Z to Q
> # client consumes X,Y,Z
> # Q is empty (web console)
> # perform broker restart
> # client consumes A,B,C from Q again!
> This happens again and again on each boker restart up to a point where the
> broker finally manages to recover from this situation by detecting an invalid
> (negative) address size as indicated by the following log output:
> {quote}2021-04-22 21:04:51.897 WARN
> org.apache.activemq.artemis.core.paging.impl.PagingStoreImpl.addSize(PagingStoreImpl.java:753)
> [Thread-1
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@26bb92e2)]
> {quote}
> {quote}[ARTEMIS] AMQ222214: Destination incoming.message has an inconsistent
> and negative address size=-3,379.
> {quote}
> {quote}2021-04-22 21:04:51.897 WARN
> org.apache.activemq.artemis.core.paging.impl.PagingStoreImpl.addSize(PagingStoreImpl.java:753)
> [Thread-1
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@26bb92e2)]
> {quote}
> {quote}[ARTEMIS] AMQ222214: Destination incoming.message has an inconsistent
> and negative address size=-3,451.
> {quote}
>
> The full log file of such a situation (where the broker managed to recover)
> is attached together with the broker.xml file that we use as a template to
> configure the embedded instance programmatically.
> The broker runs embedded with the client which consumes messages via AMQP
> using the Apache QPID library (JMS2.0 - v0.57.0) - there is only a single
> Thread ever consuming from a queue and we use transactions to explicitly
> commit or rollback received messages with prefetch disabled
> (jms.prefetchPolicy.all=0)
> Further investigation / debugging has shown that when messages are
> redelivered the above log outputs concerning the negative address size are
> absent and the reason is that the value returned by
> messageReference.getMessageMemoryEstimate() is different for the exact same
> message in line 1022 of class
> org.apache.activemq.artemis.core.server.impl.QueueImpl.
> This difference stems from a different value being calculated in
> AMQPStandardMessage class (getMemoryEstimate()) and the difference is equal
> to the value returned by
> unmarshalledApplicationPropertiesMemoryEstimateFromData() so I assume that
> the applicationProperties are sometimes not being considered (I still have to
> verify this).
> I can also provide the complete broker journal for such a situation which we
> currently use for debugging if this helps to analyze the issue (approx. ~25MB
> of files, compressed ~100kB)
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)