[
https://issues.apache.org/jira/browse/BROOKLYN-375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15663322#comment-15663322
]
ASF GitHub Bot commented on BROOKLYN-375:
-----------------------------------------
Github user aledsage commented on a diff in the pull request:
https://github.com/apache/brooklyn-docs/pull/122#discussion_r87766901
--- Diff: guide/ops/troubleshooting/memory-usage.md ---
@@ -0,0 +1,138 @@
+---
+layout: website-normal
+title: "Troubleshooting: Monitoring Memory Usage"
+toc: /guide/toc.json
+---
+
+## Memory Usage
+
+Brooklyn tries to keep in memory as much history of its activity as
possible,
+for displaying through the UI, so it is normal for it to consume as much
memory
+as it can. It uses "soft references" so these objects will be cleared if
needed,
+but **it is not a sign of anything unusual if Brooklyn is using all its
available memory**.
+
+The number of active tasks, CPU usage, thread counts, and
+retention of soft reference objects are a much better indication of load.
+This information can be found by looking in the log for lines containing
+`brooklyn gc`, such as:
+
+ 2016-09-16 16:19:43,337 DEBUG o.a.b.c.m.i.BrooklynGarbageCollector
[brooklyn-gc]: brooklyn gc (before) - using 910 MB / 3.76 GB memory; 98%
soft-reference maybe retention (of 362); 35 threads; tasks: 0 active, 2
unfinished; 31 remembered, 1013 total submitted)
+
+The soft-reference figure is indicative, but the lower this is, the more
+the JVM has decided to get rid of items that were desired to be kept but
optional.
+It only tracks some soft-references (those wrapped in `Maybe`),
+and of course if there are many many such items the JVM will have to get
rid
+of some, so a lower figure does not necessarily mean a problem.
+Typically however if there's no `OutOfMemoryError` (OOME) reported,
+there's no problem.
+
+
+## Problem Indicators and Resolutions
+
+Two things that *do* normally indicate a problem with memory are:
+
+* `OutOfMemoryError` exceptions being thrown
+* Memory usage high *and* CPU high, where the CPU is spent doing full
garbage collection
+
+One possible cause is the JVM doing a poorly-selected GC strategy,
+as described in [Oracle Java bug
6912889](http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6912889).
+This can be confirmed by running the "analyzing soft reference usage"
technique below;
+memory should shrink dramatically then increase until the problem recurs.
+This can be fixed by passing `-XX:SoftRefLRUPolicyMSPerMB=1` to the JVM,
+as described in [Brooklyn issue
375](https://issues.apache.org/jira/browse/BROOKLYN-375).
+
+Other common JVM options include `-Xms256m -Xmx1g -XX:MaxPermSize=256m`
+(depending on JVM provider and version) to set the right balance of memory
allocation.
+In some cases a larger `-Xmx` value may simply be the fix
+(but this should not be the case unless many or large blueprints are being
used).
+
+If the problem is not with soft references but with real memory usage,
+the culprit is likely a memory leak, typically in blueprint design.
+An early warning of this situation is the "soft-reference maybe retention"
level decreasing.
+In these situations, follow the steps as described below for
"Investigating Leaks".
+
+
+## Analyzing Soft Reference Usage
+
+If you are concerned about memory usage, or doing evaluation on test
environments,
+the following method (in the Groovy console) can be invoked to force the
system to
+reclaim as much memory as possible, including *all* soft references:
+
+
org.apache.brooklyn.util.javalang.MemoryUsageTracker.forceClearSoftReferences()
+
+In good situations, memory usage should return to a small level.
+This call can be disruptive to the system however so use with care.
+
+The above method can also be configured to run automatically when memory
usage
+is detected to hit a certain level. That can be useful if external
policies are
+being used to warn on high memory usage, and you want to keep some
headroom.
+Many JVM authorities discourage interfering with its garbage collector,
however,
+so use with care and study the particular JVM you are using.
+See the class `BrooklynGarbageCollector` for more information.
+
+
+## Investigating Leaks
+
+If a memory leak is found, the first place to look should be the
WARN/ERROR logs.
+Many common causes of leaks, including as runaway tasks and cyclic
dependent configuration,
+will show their own log errors prior to the memory error.
+
+You should also note the task counts in the `brooklyn gc` messages
described above,
+and if there are an exceptional number of tasks or tasks are not clearing,
+other log messages will describe what is happening, and the in-product task
+view can indicate issues.
+
+Sometimes slow leaks can occur if blueprints do not clean up entities or
locations.
+These can be diagnosed by noting the number of files written to the
persistence location,
+if persistence is being used. Deploying then destroying a blueprint
should not leave
+anything behind in the persistence directory.
+
+Where problems have been encountered in the past, we have resolved them
and/or
+worked to improve logging and early identification.
+Please report any issues so that we can improve this further.
+In many cases we can also give advice on what other log `grep` patterns
can be useful.
+
+
+### Standard Java Techniques
+
+Useful standard Java techniques for tracking memory leaks include:
+
+* `jstack <pid>` to see what tasks are running
+* `jmap -histo:live <pid>` to see what objects are using memory (see below)
+* Memory profilers such as VisualVM or Eclipse MAT, either connected to a
running system or
+ against a heap dump generated on an OOME
+
+More information is available on [the Oracle Java web
site](https://docs.oracle.com/javase/7/docs/webnotes/tsg/TSG-VM/html/memleaks.html).
+
+Note that some of the above techniques will often include soft and weak
references that are irrelevant
+to the problem (and will be cleared on an OOME). Objects that may be
cached in that way include:
+
+* `BasicConfigKey` (used for the web server and many blueprints)
+* `DslComponent` and `*Task` (used for Brooklyn activities and dependent
configuration)
+* `jclouds` items including `ImageImpl` (to cache data on cloud service
providers)
+
+On the other hand any of the above may also indicate a leak.
+Taking snapshots after a `forceClearSoftReferences()` (above) invocation
and comparing those
+is one technique to filter out noise. Another is to wait until there is
an OOME
+and look just after, because that will clear all non-essential data from
memory.
+(The `forceClearSoftReferences()` actually works by triggering an OOME, in
as safe
+a way as possible.)
+
+If leaked items are found, a profiler will normally let you see their
content
+and walk backwards along their references to find out why they are being
retained.
+
+
+### Summary of Techniques
+
+The following sequence of techniques is a common approach to investigating
and fixing memory issues:
+
+* Note the log lines about `brooklyn gc`, including memory and tasks
+* Do not assume high memory usage alone is an error, as soft reference
caches are deliberate;
+ use `forceClearSoftReferences()` to clear these
--- End diff --
@ahgittin (cc @neykov) I thought we were not going to recommend using
`forceClearSoftReferences()` this in any kind of production environment. Can we
put in a caveat here about not using it in production. I'd be extremely caution
about encouraging real users to call this until devs have been using it in
anger themselves a lot.
With the use of the (much safer) `-XX:SoftRefLRUPolicyMSPerMB=1`, I'd
expect the need for calling this would be greatly reduced.
> Brooklyn intermittently uses high CPU levels and becomes unresponsive
> ---------------------------------------------------------------------
>
> Key: BROOKLYN-375
> URL: https://issues.apache.org/jira/browse/BROOKLYN-375
> Project: Brooklyn
> Issue Type: Bug
> Environment: OSX Sierra, Java 1.7
> Reporter: Duncan Godwin
>
> Intermittently whilst launching a clocker swarm within brooklyn, it uses high
> CPU levels and becomes unresponsive. This was noted when testing failover by
> manally stopping some nodes with `shutdown -h`.
> [jstack 1|https://gist.github.com/drigodwin/c5946d23ed11350f393d9ba9b80a2a2d]
> [jstack 2|https://gist.github.com/drigodwin/5619b02c0c1d53ceb0c99234d8f0dd96]
> [jclouds.debug.log|https://gist.github.com/drigodwin/365d39d216e6a56c634a5020496ef8f1]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)