[
https://issues.apache.org/jira/browse/OAK-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592089#comment-13592089
]
Ian Boston commented on OAK-364:
--------------------------------
Some observations on monitoring in servers under load.
None of this should divert the team from doing what they think is right, I am
just sharing experience.
(ie all 100% non-binding)
When a server is under high load with multiple threads generating stats it can
be hard to interpret times calculated by one thread since there may be many
other threads performing the same operation at the same time. A slow timing can
mask high throughput.
Sometimes its more useful to have counters from which a caller (or something
known to be single threaded) can calculate rates.
eg simply increment a count every time an item is added to a queue, and
increment another counter when an item is removed. The number in the queue is
the difference between the 2. The average rate being added is the difference
between the added counter between 2 known times (same for the number being
removed). It doesn't matter if the period of sampling is 2s or 3600s.
The problem with internally calculated values is (like the RepositoryStats in
Jackrabbit) is its not easy to graph or derive further stats if the monitoring
time doesn't match the sampling time period, and since the latency of
monitoring over JMX is highly variable, that almost never happens. A set of
counters with a timestamp is much more reliable.
IMHO, if there has to be a tradeoff between how sophisticated the internal
monitoring is and the number of things monitored, I would lean towards
monitoring more things in as simple a way as possible.
What would be useful to monitor:
Counters of invocations in critical areas (low cost/high cost reads, writes etc
etc etc)
Additions/Removals from queues, especially if a queue can overflow or cause an
OOM error.
Counters of any operation that might cause a pause or wait, eg MongoDB master
re-election on a shard which will cause a pause of upto 30s.
Hits and Misses are useful for tuning so any stat that comes from a cache is
desirable. (Most external caching libs have good stats built in)
HTH
> Runtime performance metrics
> ---------------------------
>
> Key: OAK-364
> URL: https://issues.apache.org/jira/browse/OAK-364
> Project: Jackrabbit Oak
> Issue Type: New Feature
> Components: core
> Reporter: Jukka Zitting
>
> As we start looking more at performance benchmarks and more generally once
> Oak starts to get deployed in production environments, it would be really
> useful if we provided a collection of useful performance metrics in a way
> that's easy to access.
> For example it would be good to have metrics on at least the following:
> * Time per MicroKernel read
> * Hit/Miss ratio of the NodeState cache
> * Time per MicroKernel commit/merge
> * Time per each commit hook
> * Time per query
> * etc.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira