[ 
https://issues.apache.org/jira/browse/OAK-364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13592089#comment-13592089
 ] 

Ian Boston commented on OAK-364:
--------------------------------


Some observations on monitoring in servers under load.
None of this should divert the team from doing what they think is right, I am 
just sharing experience.
(ie all 100% non-binding)

When a server is under high load with multiple threads generating stats it can 
be hard to interpret times calculated by one thread since there may be many 
other threads performing the same operation at the same time. A slow timing can 
mask high throughput.

Sometimes its more useful to have counters from which a caller (or something 
known to be single threaded) can calculate rates.

eg simply increment a count every time an item is added to a queue, and 
increment another counter when an item is removed. The number in the queue is 
the difference between the 2. The average rate being added is the difference 
between the added counter between 2 known times (same for the number being 
removed). It doesn't matter if the period of sampling is 2s or 3600s.

The problem with internally calculated values is (like the RepositoryStats in 
Jackrabbit) is its not easy to graph or derive further stats if the monitoring 
time doesn't match the sampling time period, and since the latency of 
monitoring over JMX is highly variable, that almost never happens. A set of 
counters with a timestamp is much more reliable.

IMHO, if there has to be a tradeoff between how sophisticated the internal 
monitoring is and the number of things monitored, I would lean towards 
monitoring more things in as simple a way as possible.

What would be useful to monitor:
Counters of invocations in critical areas (low cost/high cost reads, writes etc 
etc etc)
Additions/Removals from queues, especially if a queue can overflow or cause an 
OOM error.
Counters of any operation that might cause a pause or wait, eg MongoDB master 
re-election on a shard which will cause a pause of upto 30s.
Hits and Misses are useful for tuning so any stat that comes from a cache is 
desirable. (Most external caching libs have good stats built in)

HTH

                
> Runtime performance metrics
> ---------------------------
>
>                 Key: OAK-364
>                 URL: https://issues.apache.org/jira/browse/OAK-364
>             Project: Jackrabbit Oak
>          Issue Type: New Feature
>          Components: core
>            Reporter: Jukka Zitting
>
> As we start looking more at performance benchmarks and more generally once 
> Oak starts to get deployed in production environments, it would be really 
> useful if we provided a collection of useful performance metrics in a way 
> that's easy to access.
> For example it would be good to have metrics on at least the following:
> * Time per MicroKernel read
> * Hit/Miss ratio of the NodeState cache
> * Time per MicroKernel commit/merge
> * Time per each commit hook
> * Time per query
> * etc.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to