[bookkeeper] branch master updated: BP-44 USE metrics

hsaputra Wed, 20 Oct 2021 09:33:55 -0700

This is an automated email from the ASF dual-hosted git repository.

hsaputra pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/bookkeeper.git



The following commit(s) were added to refs/heads/master by this push:
     new f6903b8  BP-44 USE metrics
f6903b8 is described below

commit f6903b8e8e752b642012d1fafc4f411bfce74e9a
Author: Jack Vanlightly <[email protected]>
AuthorDate: Wed Oct 20 18:33:39 2021 +0200

    BP-44 USE metrics
    
    BP-44: USE metrics. A proposal for improving BookKeeper metrics so that 
operators can employ the USE method for diagnosing performance issues.
    
    
    Reviewers: Henry Saputra <[email protected]>, Andrey Yegorov <None>, 
Enrico Olivelli <[email protected]>
    
    This closes #2835 from Vanlightly/BP-44-use-metrics and squashes the 
following commits:
    
    8d9baabf6 [Jack Vanlightly] Added link to USE method and listed each term 
of USE
    5a0f67dcf [Jack Vanlightly] BP-44 USE metrics
    a9b576da2 [Yunze Xu] Release semaphore when addEntry accepts the same 
entries (#2832)
    148bf2242 [Yun Tang] Ensure to release cache during 
KeyValueStorageRocksDB#closec (#2821)
    4dc426000 [gaozhangmin] Heap memory leak problem when ledger replication 
failed (#2794)
    a522fa33b [Raúl Gracia] Issue 2815: Upgrade to log4j2 to get rid of 
CVE-2019-17571 (#2816)
    04650521b [Nicolò Boschi] Upgrade httpclient from 4.5.5 to 4.5.13 (#2793)
    594a05685 [Raúl Gracia] Issue 2795: Bookkeeper upgrade using Bookie ID may 
fail due to cookie mismatch (#2796)
    354cf37c9 [Raúl Gracia] Upgraded dependencies with CVEs (#2792)
    e413c7094 [Raúl Gracia] Issue 2728: Entry Log GC may get blocked when using 
entryLogPerLedgerEnabled option (#2779)
    883231ec8 [pradeepbn] Building bookkeeper with gradle on java11
---
 site/bps/BP-44-use-metrics.md | 164 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 164 insertions(+)

diff --git a/site/bps/BP-44-use-metrics.md b/site/bps/BP-44-use-metrics.md
new file mode 100644
index 0000000..e09be5a
--- /dev/null
+++ b/site/bps/BP-44-use-metrics.md
@@ -0,0 +1,164 @@
+---
+title: "BP-44: USE metrics"
+issue: https://github.com/apache/bookkeeper/issues/2834
+state: "Under Discussion"
+release: "N/A"
+---
+
+### Motivation
+Based on our experience (at Splunk) running many BookKeeper clusters in 
production, from very small to very large deployments (in terms of number of 
bookies, size of VMs and load) we have identified a number of short-comings 
with the current BookKeeper metrics that make it harder than it should be to 
identify bottlenecks in performance. The [USE 
method](https://www.brendangregg.com/usemethod.html) (Utilization, Saturation, 
Errors) is an effective strategy for diagnosing where bottlenecks [...]
+
+Finally, many of the metrics are aggregated, such as journal and 
DbLedgerStorage. When these components are configured with multiple 
directories, it is currently not possible to inspect the metrics of only a 
single journal or DbLedgerStorage instance. One bad volume can be hard to 
identify.
+
+This BP proposes a number of improvements to help identify performance issues 
in production. The objective is for a human operator to rapidly see where a 
bottleneck exists and where it exists by using utilization and saturation 
metrics.
+
+#### Utilization
+Ultimately all work in a bookie is performed by threads. Some operations are 
performed by only a single thread while others are spread out over multiple 
threads. Some threads play a core role in the critical path of reads and writes 
and when fully utilized indicate a bottleneck. We propose that each thread 
reports the time that it is busy which will allow time utilization to be 
calculated. Because all file i/o is synchronous, thread utilization is a very 
close approximation of resource u [...]
+
+For example, if the Force Write thread is reporting 100% time utilization 
we’ll know that it has become a bottleneck and to look further into the force 
write metrics.
+
+Likewise, some operations are in the critical path that are executed across 
multiple threads. For example, with DbLedgerStorage, flushes are in the 
critical path for writes and the work is performed across at least two threads 
(Sync thread, DbStorage thread). Knowing the time utilization of the Sync 
Thread and the DbStorage thread is useful, but knowing the time utilization of 
flushes as a whole (regardless of thread) is even more useful in this 
particular case. Once flushes are being pe [...]
+
+So utilization metrics of both threads and operations can be extremely useful 
and often combining the two give even more insight.
+
+#### Saturation
+Metrics that indicate saturation are things like rejected requests, full 
queues/caches. We need to ensure that there are no blindspots for saturation 
metrics.
+
+Once we start getting saturation indicators we have strong signals of a 
bottleneck existing. It may be a high level metric (rejected requests) which 
only tells us we have a problem, or a component level metric (journal queue 
length) that tells us the locality of a bottleneck. We also need to know things 
like the queue size bounds in order to detect when a component is fully 
saturated, so emitting metrics for certain configurations is useful.
+
+#### Component and Thread Labels
+Work is parallelized by thread pools and by using multiple directories that 
allow for multiple journal instances and multiple ledger storage instances. We 
have seen in production cases where a single bad volume has a large negative 
impact on BookKeeper performance. Being able to pick apart metrics by the 
thread and component instance can be extremely helpful in diagnosing issues 
that only affect one journal/ledger storage instance or a subset of threads.
+
+#### Sub-Operation Time
+In some cases a single operation can consist of multiple smaller operations. 
For example, a DbLedgerStorage flush involves flushing two RocksDB instances 
and entry log files. The only way to currently confirm whether RocksDB has 
gotten into a bad state is by profiling on the machine/pod.
+
+Adding extra metrics to cover sub-operations can help to diagnose or rule-out 
an issue quickly.
+
+#### Read Cache Thrashing
+Read cache thrashing can have a huge negative impact on bookie performance. We 
have most of the metrics already for detecting when it happens except for one 
problem: we don’t report read and write cache hits/misses separately.
+
+When we separate out cache hits/misses, we can start calculating the actual 
read cache hit-to-miss rate and compare that to the readahead batch sizes and 
calculate that the miss rate is too high (indicating thrashing).
+
+### The Current State of Affairs
+
+#### Utilization
+The OpsStatsLogger gives us visibility into some time utilization metrics by 
using the “_sum” suffix metric for latency based OpStatsLogger metrics. We 
can’t see the utilization of each thread, just the thread pool as a whole. 
Knowing the number of threads in the pool allows us to calculate the % time 
utilization (given that each cluster may have a different size of thread pool 
this is inconvenient).
+
+Operations such as journal flushes, journal force writes and DbLedgerStorage 
already record latencies with OpStatsLoggers which allows us to calculate 
utilization of IO operations as long as we know the number of journal/ledger 
storage directories (again, inconvenient).
+
+Blindspots:
+- Write/read/high priority thread pool
+- Utilization of each thread
+- Journal
+  - Drill-down by journal instance
+  - Utilization of the threads (they do other work such as taking work from 
queues and placing work on queues)
+- Ledger Storage
+  - Drill-down by ledger storage instance
+  - Where is time spent on reads? Locations index, read from entry log, 
readahead?
+  - Where is time spent on flushes? Ledger index, locations index, entry log 
files?
+
+#### Saturation
+There are some existing indicators of saturation, such as the thread pool task 
queue lengths, journal queue length, force write queue length, DbLedgerStorage 
rejected writes. But there are some blindspots:
+
+- Requests rejected due to full thread pool task queues are logged as request 
failures. More helpful is a specific metric for rejected read and write requests
+- Queue lengths are in aggregate, need to be able to drill-down by 
thread/journal/ledger storage instance.
+- We don’t have metrics for things like maximum queue/cache sizes so we can’t 
calculate when a queue or a cache is full.
+- Netty can become saturated, tracking the number of unwritable channels, 
bytes for unwritable can help diagnose Netty saturation
+
+
+### Public Interfaces
+N/A
+### Proposed Changes
+
+#### Thread scoped OpStatsLogger and Counter
+Add a new “thread scoped” variant of the OpStatsLogger and the Counter where 
each adds labels:
+- threadPool (or just the thread name for lone threads such as a journal 
thread)
+- thread (the thread ordinal, for example a read thread pool with four 
threads, they are numbers 0-3)
+
+The following methods are added to the StatsLogger interface:
+```
+OpStatsLogger getThreadScopedOpStatsLogger(String name)
+Counter getThreadScopedCounter(String name)
+```
+
+The OpStatsLogger and Counter interfaces remain unchanged.
+
+For each thread that mutates a thread scoped OpStatsLogger, internally it will 
end up registering one OpStatsLogger per thread. For example, with four read 
threads, then the following metrics for successful task executions would be 
created:
+
+```
+bookkeeper_server_BookieReadThreadPool_task_execution_sum{success="true", 
thread="0", threadPool="BookieReadThreadPool"}
+bookkeeper_server_BookieReadThreadPool_task_execution_sum{success="true", 
thread="1", threadPool="BookieReadThreadPool"}
+bookkeeper_server_BookieReadThreadPool_task_execution_sum{success="true", 
thread="2", threadPool="BookieReadThreadPool"}
+bookkeeper_server_BookieReadThreadPool_task_execution_sum{success="true", 
thread="3", threadPool="BookieReadThreadPool"}
+```
+
+In order for thread scoped metrics to know which thread pool and thread 
ordinal the metrics of each thread should be labelled with, a thread registry 
is maintained. After each thread is started, the first work it performs is 
registering itself with a thread registry object that maps thread id -> 
{threadPool, thread}. In some cases this work can be performed outside of the 
thread itself, such as in a thread factory.
+
+Each thread scoped metric registers per-thread metrics lazily. For example, 
when a thread scoped counter is incremented by a given thread, the first time 
that occurs on that given thread the thread labels are retrieved from the 
thread registry and a counter for that thread is created with the right labels 
and registered with the provider. In the case that the calling thread is 
unregistered, a default metric is used. Per thread OpStatsLoggers/Counters are 
not visible externally and are st [...]
+
+With these new variants, we can now replace calls to `getCounter` with 
`getThreadScopedCounter` and `getOpsStatsLogger` with 
`getThreadScopedOpsStatsLogger` for the metrics where we want to be able to 
drill down by thread pool and thread.
+
+#### How to Report Time Spent (For utilization calculation)
+We want to be able to report time spent by each thread as a whole, but also on 
operations and sub-operations. The simplest way to do this is simply measure 
elapsed time by taking a System.nanoTime() before the work and then logging the 
elapsed time via an OpStatsLogger or a Counter.
+
+One challenge regarding this simple method is when operations take a long 
time, like a journal fsync or a ledger flush on a heavily overloaded bookie. 
Long execution times can cause per second calculations to go to 0 then to spike 
to very high levels (above 100% utilization). This can be mitigated by using 
larger windows (like 1 minute windows rather than 1 second windows) and heavily 
loaded systems don’t tend to fluctuate too much from second to second.
+
+##### Discarded time measurement alternatives:
+Use a sampling technique: too costly and complex.
+Use proc filesystem: not portable, only thread level utilization metrics.
+
+#### Per Component Instance labels
+All journal metrics get the additional label of `journalIndex`.
+All DbLedgerStorage metrics get the additional label of `ledgerDir`.
+
+This does not prevent aggregating metrics of the journal or DbLedgerStorage as 
a whole but does allow for drill-down.
+
+### New Metrics List
+
+The following new metrics to be added to cover saturation/utilization or 
blindspots.
+
+Saturation:
+- bookkeeper_server_ADD_ENTRY_REJECTED (thread scoped counter)
+- bookkeeper_server_READ_ENTRY_REJECTED (thread scoped counter)
+
+Thread time utilization:
+- bookie_sync_thread_time (thread scoped counter)
+- bookie_journal_journal_thread_time  (thread scoped counter, journalIndex 
label)
+- bookie_journal_force_write_thread_time (thread scoped counter, journalIndex 
label)
+- bookie_journal_callback_thread_time (thread scoped counter, journalIndex 
label)
+- bookie_db_storage_thread_time (thread scoped counter, ledgerDir label)
+
+Operation time spent. Counters for high frequency ops and OpStatsLoggers for 
less frequent ops:
+- bookie_read_locations_index_time (thread scoped counter, ledgerDir label)
+- bookie_read_entrylog_time (thread scoped counter, ledgerDir label)
+- bookie_readahead_time (thread scoped counter, ledgerDir label)
+- bookie_flush_entrylog (thread scoped OpStatsLogger, ledgerDir label)
+- bookie_flush_locations_index(thread scoped OpStatsLogger, ledgerDir label)
+- bookie_flush_ledger_index(thread scoped OpStatsLogger, ledgerDir label)
+
+Queue/cache max sizes, number of dirs, thread counts (useful for calculations 
such as current queue length vs max size and useful for operators to see on 
dashboards). All gauges based on config values or calculated values such as 
write cache being 25% of direct memory:
+- bookkeeper_server_<OrderedExecutor name>_threads
+- bookkeeper_server_<OrderedExecutor name>_max_queue_size
+- bookie_JOURNAL_DIRS
+- bookie_ledger_num_dirs
+- bookie_JOURNAL_QUEUE_MAX_SIZE
+- bookie_write_cache_max_size
+
+Other:
+- bookie_write_cache_hits (thread scoped counter, ledgerDir label)
+- bookie_write_cache_misses (thread scoped counter, ledgerDir label)
+
+#### Breaking Changes
+Some proposed changes are breaking:
+1. Remove thread ordinal from OrderedExecutor based metric names and use the 
label “thread” instead. This makes it easier to work with when using modern 
observability tooling and inline with the rest of the proposed changes.
+2. Replace cache hit and cache miss OpStatsLoggers with Counters. 
OpStatsLoggers are relatively expensive and cost far more CPU cycle than the 
actual accesses to the caches themselves. Latency for cache hits/misses is 
extremely low and recording these latencies is not worth the cost. Counters are 
far cheaper.
+
+### Compatibility, Deprecation, and Migration Plan
+Two changes are breaking and will cause existing dashboards or alerts based on 
the affected metrics to stop working. The affected release should document this 
changes in order to warn operators of this impact.
+
+### Test Plan
+These changes have been tested extensively using various sized deployments 
(including multi journal and ledger storage configurations) and loads to ensure 
that the utilization and saturation metrics are accurate and useful.
+
+### Rejected Alternatives
+
+Time reporting alternatives discussed above.
\ No newline at end of file

[bookkeeper] branch master updated: BP-44 USE metrics

Reply via email to