[jira] [Commented] (AMQ-6115) No more browse/consume possible after #checkpoint run

Klaus Pittig (JIRA) Wed, 13 Jan 2016 07:47:31 -0800

    [ 
https://issues.apache.org/jira/browse/AMQ-6115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096370#comment-15096370
 ]


Klaus Pittig commented on AMQ-6115:
-----------------------------------

Just to complete the known information used for the corresponding mailing list 
thread: 
http://activemq.2283324.n4.nabble.com/How-to-avoid-blocking-of-queue-browsing-after-ActiveMQ-checkpoint-call-td4705696.html

Tim Bain:
{quote}
I believe you are correct: browsing a persistent queue uses bytes from the 
memory store, because those bytes must be read from the persistence store 
into the memory store before they can be handed off to browsers or 
consumers.  If all available bytes in the memory store are already in use, 
the messages can't be paged into the memory store, and so the operation 
that required them to be paged in will hang/fail. 

You can work around the problem by increasing your memory store size via 
trial-and-error until the problem goes away.  Note that the broker itself 
needs some amount of memory, so you can't give the whole heap over to the 
memory store or you'll risk getting OOMs, which means you may need to 
increase the heap size as well.  You can estimate how much memory the 
broker needs aside from the memory store by subtracting the bytes used for 
the memory store (539 MB) from the total heap bytes used as measured via 
JConsole or similar tools.  I'd double (or more) that number to be safe, if 
it was me; the last thing I want to deal with in a production application 
(ActiveMQ or anything else) is running out of memory because I tried to cut 
the memory limits too close just to save a little RAM. 

All of that is how to work around the fact that before you try to browse 
your queue, something else has already consumed all available bytes in the 
memory store.  If you want to dig into why that's happening, we'd need to 
try to figure out what those bytes are being used for and whether it's 
possible to change configuration values to reduce the usage so it fits into 
your current limit.  There will definitely be more effort required than 
simply increasing the memory limit (and max heap size), but we can try if 
you're not able to increase the limits enough to fix the problem. 

If you want to go down that path, one thread to pull on is your observation 
that you "can browse/consume some Queues  _until_ the #checkpoint call 
after 30 seconds."  I assume from your reference to checkpointing that 
you're using KahaDB as your persistence store.  Can you post the KahaDB 
portion of your config? 

Your statements here and in your StackOverflow post ( 
http://stackoverflow.com/questions/34679854/how-to-avoid-blocking-of-queue-browsing-after-activemq-checkpoint-call)
 
indicate that you think that the problem is that memory isn't getting 
garbage collected after the operation that needed it (i.e. the checkpoint) 
completes, but it's also possible that the checkpoint operation isn't 
completing because it can't get enough messages read into the memory 
store.  Have you confirmed via the thread dump that there is not a 
checkpoint operation still in progress?  Also, how large are your journal 
files that are getting checkpointed?  If they're large enough that all 
messages for one file won't fit into the memory store, you might be able to 
prevent the problem by using smaller files. 
{quote}

a.) Regarding your last answer (thanks for your effort by the way):

I'm aware of the relation between the heap and the systemUsage memoryLimit and 
we make sure that there are no illogical settings.
The primary requirement is to have a stable system running 'forever' w/o any 
memory issues at any time independent from the load/throughput.
No one really wants to deal with memory settings on the edge of limits.

You're right: the memory is completely consumed. And I can't guarantee the 
checkpoint/cleanup to be finished completely, so the system can be stalled 
without giving GC a chance to release some memory.

It's the expiry check causing this. The persistent stores themselves seem to be 
managed as expected (no issues, no inconsistency, no loss);
our situation is independent of the storage (reproducable for leveldb and 
kahadb). For KahaDB we use 16mb for journal files since years (helps to save a 
huge amount of space required for pending messages not consumed for some days 
due to offline situations on client side).
Anyway, here is our current configuration you requested:

{code:xml}
<persistenceAdapter>
<kahaDB directory="${activemq.base}/data/kahadb" enableIndexWriteAsync="true" 
journalMaxFileLength="16mb" indexWriteBatchSize="10000" indexCacheSize="10000" 
/>
<!--
<levelDB directory="${activemq.base}/data/leveldb" logSize="33554432" />
-->
</persistenceAdapter>
{code}

b.) Some proposal concerning AMQ-6115:

In my point of view, it's worth to discuss the one and only memoryLimit 
parameter used for both the regular browse/consume threads and the 
checkpoint/cleanup threads.
There should always be enough space to browse/consume any queue at least with 
prefetch 1 resp. one of the next pending messages.
Maybe - in this case - 2 well-balanced memoryLimit parameters with priority on 
consumption instead of checkpoint/cleanup are helpful for a a better 
regulation. Or something near it.


c.) Our results and an acceptable solution so far:

After a thorough investigation (w/o changing ActiveMQ source code) the result 
is for now that we need to accept the limitations defined by the single 
memoryLimit parameter used both for the #checkpoint/cleanup process and 
browsing/consuming queues.

**1.) Memory**

There is not a problem, if we use a much higher memoryLimit (together
with a higher max-heap) to support both the message caching per
destination during the #checkpoint/cleanup workflow and our requirements to 
browse/consume messages.

But more memory is not an option in our scenario, we need to deal with 1024m 
max-heap and 500m memoryLimit.

Besides this, constantly setting higher memoryLimits just because of more 
persistent queues containing hundreds/thousands of pending messages together 
with certain offline/inactive consumer scenarios should be discussed in detail 
(IMHO).


**2.) Persistent Adapters**

We ruled out persistent adapters as the cause of the problem, because the 
behaviour doesn't change, if we switch different types of persistent stores 
(KahaDB, LevelDB, JDBC-PostgreSQL).

During the debugging sessions with KahaDB we also see regular checkpoint 
handling, the storage is managed as expected.


**3.) Destination Policy / Expiration Check**

Our problem completely disappears, if we disable caching and the expiration 
check, which is the actual cause of the problem.

The corresponding properties are documented and there is a nice blog article 
about Message Priorities with a description quite suitable for our scenario:

- http://activemq.apache.org/how-can-i-support-priority-queues.html
- 
http://blog.christianposta.com/activemq/activemq-message-priorities-how-it-works/

We simply added useCache="false" and expireMessagesPeriod="0" to the
policyEntry:

{code:xml}
<destinationPolicy>
<policyMap>
<policyEntries>
<policyEntry queue=">" producerFlowControl="false" optimizedDispatch="true" 
memoryLimit="128mb" timeBeforeDispatchStarts="1000"
useCache="false" expireMessagesPeriod="0">
<dispatchPolicy>
<strictOrderDispatchPolicy />
</dispatchPolicy>
<pendingQueuePolicy>
<storeCursor />
</pendingQueuePolicy>
</policyEntry>
</policyEntries>
</policyMap>
</destinationPolicy>
{code}

The consequences are clear, if we don't use in-mem caching anymore and never 
check for message expiration.

For we neither use message expiration nor message priorities and the current 
message dispatching is fast enough for us, this trade-off is acceptable 
regarding given system limitations.

One should also think about well-defined prefetch limits for memory consumption 
during specific workflows. Message sizes in our scenario can be 2 Bytes up to 
approx. 100 KB, so more individual policyEntries and client consumer 
configurations could be helpful to optimize system behaviour concerning 
performance and memory usage (see 
http://activemq.apache.org/per-destination-policies.html).




> No more browse/consume possible after #checkpoint run
> -----------------------------------------------------
>
>                 Key: AMQ-6115
>                 URL: https://issues.apache.org/jira/browse/AMQ-6115
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: activemq-leveldb-store, Broker, KahaDB
>    Affects Versions: 5.5.1, 5.11.2, 5.13.0
>         Environment: OS=Linux,MacOS,Windows, Java=1.7,1.8, Xmx=1024m, 
> SystemUsage Memory Limit 500 MB, Temp Limit 1 GB, Storage 80 GB
>            Reporter: Klaus Pittig
>         Attachments: Bildschirmfoto 2016-01-08 um 12.09.34.png, 
> Bildschirmfoto 2016-01-08 um 13.29.08.png
>
>
> We are currently facing a problem when Using ActiveMQ with a large number of 
> Persistence Queues (250) á 1000 persistent TextMessages á 10 KB.
> Our scenario requires these messages to remain in the storage over a long 
> time (days), until they are consumed (large amounts of data are staged for 
> distribution for many consumer, that could be offline for some days).
> This issue is independent of the JVM,  OS and PersistentAdapter (KahaDB, 
> LevelDB) with enough free space and memory.
> We tested this behaviour with ActiveMQ: 5.11.2, 5.13.0 and 5.5.1.
> After the Persistence Store is filled with these Messages (we use a simple 
> unit test for production always the same message) and a broker restart, we 
> can browse/consume some Queues  _until_ the #checkpoint call after 30 seconds.
> This call causes the broker to use all available memory and never releases it 
> for other tasks such as Queue browse/consume. Internally the MessageCursor 
> seems to decide, that there is not enough memory and stops delivery of queue 
> content to browsers/consumers.
> => Is there a way to avoid this behaviour of fix this? 
> The expectation is, that we can consume/browse any queue under all 
> circumstances.
> Besides the above mentioned settings we use the following settings for the 
> broker (btw: changing the memoryLimit to a lower value like 1mb does not 
> change the situation):
> {code:xml}
>         <destinationPolicy>
>             <policyMap>
>               <policyEntries>
>                 <policyEntry queue=">" producerFlowControl="false"
> optimizedDispatch="true" memoryLimit="128mb">
>                   <dispatchPolicy>
>                     <strictOrderDispatchPolicy />
>                   </dispatchPolicy>
>                   <pendingQueuePolicy>
>                     <storeCursor/>
>                   </pendingQueuePolicy>
>                 </policyEntry>
>               </policyEntries>
>             </policyMap>
>         </destinationPolicy>
>         <systemUsage>
>             <systemUsage sendFailIfNoSpace="true">
>                 <memoryUsage>
>                     <memoryUsage limit="500 mb"/>
>                 </memoryUsage>
>                 <storeUsage>
>                     <storeUsage limit="80000 mb"/>
>                 </storeUsage>
>                 <tempUsage>
>                     <tempUsage limit="1000 mb"/>
>                 </tempUsage>
>             </systemUsage>
>         </systemUsage>
> {code}
> If we set the *cursorMemoryHighWaterMark* in the destinationPolicy to a 
> higher value like *150* or *600* depending on the difference between 
> memoryUsage and the available heap space relieves the situation a bit for a 
> workaround, but this is not really an option for production systems in my 
> point of view.
> Attached some information from Oracle Mission Control and JProfiler showing 
> those ActiveMQTextMessage instances that are never released from memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AMQ-6115) No more browse/consume possible after #checkpoint run

Reply via email to