chewbranca commented on code in PR #5602:
URL: https://github.com/apache/couchdb/pull/5602#discussion_r2221252151


##########
src/couch_stats/CSRT.md:
##########
@@ -0,0 +1,893 @@
+# Couch Stats Resource Tracker (CSRT)
+
+CSRT (Couch Stats Resource Tracker) is a real time stats tracking system that
+tracks the quantity of resources induced at the process level in a live
+queryable manner that also generates process lifetime reports containing
+statistics on the total resource load of a request, as a function of things 
like
+dbs/docs opened, view and changes rows read, changes returned vs processed,
+Javascript filter usage, duration, and more. This system is a paradigm shift in
+CouchDB visibility and introspection, allowing for expressive real time 
querying
+capabilities to introspect, understand, and aggregate CouchDB internal resource
+usage, as well as powerful filtering facilities for conditionally generating
+reports on "heavy usage" requests or "long/slow" requests. CSRT also extends
+`recon:proc_window` with `csrt:proc_window` allowing for the same style of
+battle hardened introspection with Recon's excellent `proc_window`, but with 
the
+sample window over any of the CSRT tracked CouchDB stats!
+
+CSRT does this by piggy-backing off of the existing metrics tracked by way of
+`couch_stats:increment_counter` at the time when the local process induces 
those
+metrics inc calls, and then CSRT updates an ets entry containing the context
+information for the local process, such that global aggregate queries can be
+performed against the ets table as well as the generation of the process
+resource usage reports at the conclusions of the process's lifecyle.The ability
+to do aggregate querying in realtime in addition to the process lifecycle
+reports for post facto analysis over time, is a cornerstone of CSRT that is the
+result of a series of iterations until a robust and scalable aproach was built.
+
+The real time querying is achieved by way of a global ets table with
+`read_concurrency`, `write_concurrency`, and `decentralized_counters` enabled.
+Great care was taken to ensure that _zero_ concurrent writes to the same key
+occure in this model, and this entire system is predicated on the fact that
+incremental updates to `ets:update_counters` provides *really* fast and
+efficient updates in an atomic and isolated fashion when coupled with
+decentralized counters and write concurrency. Each process that calls
+`couch_stats:increment_counter` tracks their local context in CSRT as well, 
with
+zero concurrent writes from any other processes. Outside of the context setup
+and teardown logic, _only_ operations to `ets:update_counter` are performed, 
one
+per process invocation of `couch_stats:increment_counter`, and one for
+coordinators to update worker deltas in a single batch, resulting in a 1:1 
ratio
+of ets calls to real time stats updates for the primary workloads.
+
+The primary achievement of CSRT is the core framework iself for concurrent
+process local stats tracking and real time RPC delta accumulation in a scalable
+manner that allows for real time aggregate querying and process lifecycle
+reports. This took several versions to find a scalable and robust approach that
+induced minimal impact on maximum system throughput. Now that the framework is
+in place, it can be extended to track any further desired process local uses of
+`couch_stats:increment_counter`. That said, the currently selected set of stats
+to track was heavily influenced by the challenges in reotractively 
understanding
+the quantity of resources induced by a query like `/db/_changes?since=$SEQ`, or
+similarly, `/db/_find`.
+
+CSRT started as an extension of the Mango execution stats logic to `_changes`
+feeds to get proper visibility into quantity of docs read and filtered per
+changes request, but then the focus inverted with the realization that we 
should
+instead use the existing stats tracking mechanisms that have already been 
deemed
+critical information to track, which then also allows for the real time 
tracking
+and aggregate query capabilities. The Mango execution stats can be ported into
+CSRT itself and just become one subset of the stats tracked as a whole, and
+similarly, any additional desired stats tracking can be easily added and will
+be picked up in the RPC deltas and process lifetime reports.
+
+# CSRT Config Keys
+
+## -define(CSRT, "csrt").
+
+> config:get("csrt").
+
+Primary CSRT config namespace: contains core settings for enabling different
+layers of functionality in CSRT, along with global config settings for limiting
+data volume generation.
+
+## -define(CSRT_MATCHERS_ENABLED, "csrt_logger.matchers_enabled").
+
+> config:get("csrt_logger.matchers_enabled").
+
+Config toggles for enabling specific builtin logger matchers, see the dedicated
+section below on `# CSRT Default Matchers`.
+
+## -define(CSRT_MATCHERS_THRESHOLD, "csrt_logger.matchers_threshold").
+
+> config:get("csrt_logger.matchers_threshold").
+
+Config settings for defining the primary `Threshold` value of the builtin 
logger
+matchers, see the dedicated section below on `# CSRT Default Matchers`.
+
+## -define(CSRT_MATCHERS_DBNAMES, "csrt_logger.dbnames_io").
+
+> config:get("csrt_logger.matchers_enabled").
+
+Config section for setting `$db_name = $threshold` resulting in instantiating a
+"dbname_io" logger matcher for each `$db_name` that will generate a CSRT
+lifecycle report for any contexts that that induced more operations on _any_ 
one
+field of `ioq_calls|get_kv_node|get_kp_node|docs_read|rows_read` that is 
greater
+than `$threshold` and is on database `$db_name`.
+
+This is basically a simple matcher for finding heavy IO requests on a 
particular
+database, in a manner amenable to key/value pair specifications in this .ini
+file until a more sophisticated declarative model exists. In particular, it's
+not easy to sequentially generate matchspecs by way `ets:fun2ms/1`, and so an
+alternative mechanism for either dynamically assembling an `#rctx{}` to match
+against or generating the raw matchspecs themselves is warranted.
+
+## -define(CSRT_INIT_P, "csrt.init_p").
+
+> config:get("csrt.init_p").
+
+Config toggles for tracking counters on spawning of RPC `fabric_rpc` workers by
+way of `rexi_server:init_p`. This allows us to conditionally enable new metrics
+for the desired RPC operations in an expandable manner, without having to add
+new stats for every single potential RPC operation. These are for the 
individual
+metrics to track, the feature is enabled by way of the config toggle
+`config:get(?CSRT, "enable_init_p")`, and these configs can left alone for the
+most part until new operations are tracked.
+
+# CSRT Code Markers
+
+## -define(CSRT_ETS, csrt_server).
+
+This is the reference to the CSRT ets table, it's managed by `csrt_server` so
+that's where the name originates from.
+
+## -define(MATCHERS_KEY, {csrt_logger, all_csrt_matchers}).
+
+This marker is where the active matchers are written to in `persisten_term` for
+concurrently and parallelly and accessing the logger matchers in the CSRT
+tracker processes for lifecycle reporting.
+
+# CSRT Process Dictionary Markers
+
+## -define(PID_REF, {csrt, pid_ref}).
+
+This marker is for the core storing the core `PidRef` identifier. The key idea
+here is that a lifecycle is a context lifecycle is contained to within the 
given
+`PidRef`, meaning that a `Pid` can instantiate different CSRT lifecycles and
+pass those to different workers.
+
+This is specifically necessary for long running processes that need to handle
+many CSRT context lifecycles over the course of that individual process's
+lifecycle independent. In practice, this is immediately needed for the actual
+coordinator lifecycle tracking, as `chttpd` uses a worker pool of http request
+handlers that can be re-used, so we need a way to create a CSRT lifecycle
+corresponding to the given request currently being serviced. This is also
+intended to be used in other long running processes, like IOQ or `couch_js` 
pids
+such that we can track the specific context inducing the operations on the
+`couch_file` pid or indexer or replicator or whatever.
+
+Worker processes have a more clear cut lifecycle, but either style of process
+can be exit'ed in a manner that skips the ability to do cleanup operations, so
+additionally there's a dedicated tracker process spawned to monitor the process
+that induced the CSRT context such that we can do the dynamic logger matching
+directly in these tracker processes and also we can properly cleanup the ets
+entries even if the Pid crashes.
+
+## -define(TRACKER_PID, {csrt, tracker}).
+
+A handle to the spawned tracker process that does cleanup and logger matching
+reprots at the end of the process lifecycle. We store a reference to the 
tracker
+pid so that for explicit context destruction, like in `chttpd` workers after a
+request has been serviced, we can update stop the tracker and perform the
+expected cleanup directly.
+
+## -define(DELTA_TA, {csrt, delta_ta}).
+
+This stores our last delta snapshot to track progress since the last 
incremental
+streaming of stats back to the coordinator process. This will be updated after
+the next delta is made with the latest value. Eg this stores `T0` so we can do
+`T1 = get_resource()` `make_delta(T0, T1)` and then we save `T1` as the new 
`T0`
+for use in our next delta.
+
+## -define(LAST_UPDATED, {csrt, last_updated}).
+
+This stores the integer corresponding to the `erlang:monotonic_time()` value of
+the most recent `updated_at` value. Basically this lets us utilize a pdict
+value to be able to turn `update_at` tracking into an incremental operation 
that
+can be chained in the existing atomic `ets:update_counter` and
+`ets:update_element` calls.
+
+The issue being that our updates are of the form `+2 to ioq_calls for 
$pid_ref`,
+which ets does atomically in a guaranteed `atomic` and `isolated` manner. The
+strict use of the atomic operations for tracking these values is why this
+system works effeciently at scale. This means that we can increment counters on
+all of the stats counter fields in a batch, very quickly, but for tracking
+`updated_at` timestamps we'd need to either do an extra ets call to get the 
last
+`updated_at` value, or do an extra ets call to `ets:update_element` to set the
+`updated_at` value to `csrt_util:tnow()`. The core problem with this is that 
the
+batch inc operation is essentially the only write operation performed after the
+initial context setting of dbname/handler/etc; this means that we'd literally
+double the number of ets calls induced to track CSRT updates, just for tracking
+the `updated_at`. So instead, we rely on the fact that the local process
+corresponding to `$pid_ref` is the _only_ process doing updates so we know the
+last `updated_at` value will be the last time this process updated the data. So
+we track that value in the pdict and then take a delta between `tnow()` and
+`updated_at`, and then `updated_at` becomes a value we can sneak into the other
+integer counter updates we're already performing!
+
+# Primary Config Toggles
+
+# CSRT (?CSRT="csrt") Config Settings
+
+## config:get(?CSRT, "enable", false).
+
+Core enablement toggle for CSRT, defaults to false. Enabling this setting
+intiates local CSRT stats collection as well as shipping deltas in RPC
+responses to accumulate in the coordinator.
+
+This does _not_ trigger the new RPC spawn metrics, and it does not enable
+reporting for any of the rctx types.
+
+*NOTE*: you *MUST* have all nodes in the cluster running a CSRT aware CouchDB
+_before_ you enable it on any node, otherwise the old version nodes won't know
+how to handle the new RPC formats including an embedded Delta payload.
+
+## config:get(?CSRT, "enable_init_p", false).
+
+Enablement of tracking new metric counters for different `fabric_rpc` 
operations
+types to track spawn rates of RPC work induced across the cluster. There is
+corresponding config lookups into the `?CSRT_INIT_P` namespace for keys of the
+form: `atom_to_list(Mod) ++ "__" atom_to_list(Fun)`, eg 
`"fabric_rpc__open_doc"`
+for enabling the specific RPC endpoints.
+
+However, those individual settings can be ignored and this top level config
+toggle is what should be used in general, as the function specific config
+toggles predominantly exist to enable tracking a subet of total RPC operations
+in the cluster, and new endpoints can be added here.
+
+## config:get(?CSRT, "enable_reporting", false).
+
+This is the primary toggle for enabling CSRT process lifetime reports 
containing
+detailed information about the quantity of work induced by the given
+request/worker/etc. This is the top level toggle for enabling _any_ reporting,
+and there also exists `config:get(?CSRT, "enable_rpc_reporting", false).` to
+disable the reporting of any individual RPC workers, leaving the coordinator
+responsible of generating a report with the accumulated deltas.
+
+## config:get(?CSRT, "enable_rpc_reporting", false).
+
+This enables the possibility of RPC workers generating reports. They still need
+to hit the configured thresholds to induce a report, but this will generate 
CSRT
+process lifetime reports for individual RPC workers that trigger the configured
+logger thresholds. This allows for quantifying per node resource usage when
+desired, as otherwise the reports are at the http request level and don't
+provide per node stats.
+
+The key idea here is that having RPC level CSRT process lifetime reporting is
+incredibly useful, but can also generate large quantities of data. For example,
+a view query on a Q=64 database will stream results from 64 shard replicas,
+resulting in at least 64 RPC reports, plus any that might have been generated
+from RPC workers that "lost" the race for shard replica. This is very useful,
+but a lot of data given the verbose nature of funneling it through the RSyslog
+reports, however, the ability to write directly to something like ClickHouse or
+another columnar store would be great.
+
+Until there's an efficient storage mechanism to stream the results to, the
+rsyslog entries work great and are very practical, but care must be taken to
+not generate too much data for aggregate queries as they generate at least `Qx`
+more report than an individual report per http request from the coordinator.
+This setting exists as a way to either a) utilize the logger matcher configured
+thresholds to allow for _any_ rctx's to be recorded when they induce heavy
+operations, either Coordinator or RPC worker; or b) to _only_ log workloads at
+the coordinator level.
+
+NOTE: this setting exists because we lack an expressive enough config
+declaration to easily chain the matchspec constructions as `ets:fun2ms/1` is a
+special compile time parse transform macro that requires the fully definition 
to
+be specified directly, it cannot be iteractively constructed. That said, you
+_can_ register matchers through remsh with more specific and fine grained
+pattern matching, and a more expressive system for defining matchers are being
+explored.
+
+## config:get_boolean(?CSRT, "should_truncate_reports", true)
+
+Enables truncation of the CSRT process lifetime reports to not include any
+fields that are zero at the end of process lifetime, eg don't include
+`js_filter=0` in the report if the request did not induce Javascript filtering.
+
+This can be disabled if you really care about consistent fields in the report
+logs, but this is a log space saving mechanism, similar to disabling RPC
+reporting by default, as its a simple way to reduce overall volume
+
+## config:get(?CSRT, "randomize_testing", true).
+
+This is a `make eunit` only feature toggle that will induce randomness into the
+cluster's `csrt:is_enabled()` state, specifically to utilize the test suite to
+exercise edge case scenarios and failures when CSRT is only conditionally
+enabled, ensuring that it gracefuly and robustly handles errors without fallout
+to the underlying http clients.
+
+The idea here is to introduce randomness into whether CSRT is enabled across 
all
+the nodes to simulate clusters with heterogeneous CSRT enablement and also to
+ensure that CSRT works properly when toggled on/off wihout causing any
+unexpected fallout to the client requests.
+
+This is a config toggle specifically so that the actual CSRT tests can disable
+it for making accurate assertions about resource usage traacking, and is not
+intended to be used directly.
+
+## config:get_integer(?CSRT, "query_limit", ?QUERY_LIMIT)
+
+Limit the quantity of rows that can be loaded in an http query.
+
+# CSRT_INIT_P (?CSRT_INIT_P="csrt.init_p") Config Settings
+
+## config:get(?CSRT_INIT_P, ModFunName, false).
+
+These config toggles exist to conditionaly enable additional tracking of RPC
+endpoints of interest, but rather it's a way to selectively enable tracking for
+a subset of RPC operations, in a way we can extend later to add more. The
+`ModFunName` is of the form `atom_to_list(Mod) ++ "__" atom_to_list(Fun)`, eg
+`"fabric_rpc__open_doc"`, and right now, only exists for `fabric_rpc` modules.
+
+NOTE: this is a bit awkward and isn't meant to be used directly, instead,
+utilize `config:set(?CSRT, "enable_init_p", "true").` to enable or disable 
these
+as a whole.
+
+The current set of operations, as copied in from `default.ini`
+
+```
+[csrt.init_p]
+fabric_rpc__all_docs = true
+fabric_rpc__changes = true
+fabric_rpc__get_all_security = true
+fabric_rpc__map_view = true
+fabric_rpc__open_doc = true
+fabric_rpc__open_shard = true
+fabric_rpc__reduce_view = true
+fabric_rpc__update_docs = true
+```
+
+# CSRT Logger Matcher Enablement and Thresholds
+
+There are currently six builtin default loggers designed to make it easy to do
+filtering on heavy resource usage inducing and long running requests. These are
+designed as a simple baseline of useful matchers, declared in a manner amenable
+to `default.ini` based constructs. More expressive matcher declarations are
+being explored, and matchers of arbitrary complexity can be registered directly
+through remsh. The default matchers are all designed around an integer config
+Threshold that triggers on a specific field, eg docs read, or on a delta of
+fields for long requests and changes requests that process many rows but return
+few.
+
+The current default matchers are:
+
+  * docs_read: match all requests reading more than N docs
+  * rows_read: match all requests reading more than N rows
+  * docs_written: match all requests writing more than N docs
+  * long_reqs: match all requests lasting more than N milliseconds
+  * changes_processed: match all changes requests that returned at least N rows
+    less than was necessarily loaded to complete the request (eg find heavy
+    filtered changes requests reading many rows but returning few).
+  * ioq_calls: match all requests inducing more than N ioq_calls
+
+Each of the default matchers has an enablement setting in
+`config:get(?CSRT_MATCHERS_ENABLED, Name)` for toggling enablement of it, and a
+corresponding threshold value setting in `config:get(?CSRT_MATCHERS_THRESHOLD,
+Name)` that is an integer value corresponding to the specific nature of that
+matcher.
+
+## CSRT Logger Matcher Enablement (?CSRT_MATCHERS_ENABLED)
+
+> -define(CSRT_MATCHERS_THRESHOLD, "csrt_logger.matchers_enabled").
+
+### config:get_boolean(?CSRT_MATCHERS_ENABLED, "docs_read", false)
+
+Enable the `docs_read` builtin matcher, with a default `Threshold=1000`, such
+that any request that reads more than `Threshold` docs will generate a CSRT
+process lifetime report with a summary of its resouce consumption.
+
+This is different from the `rows_read` filter in that a view with `?limit=1000`
+will read 1000 rows, but the same request with `?include_docs=true` will also
+induce an additional 1000 docs read.
+
+### config:get_boolean(?CSRT_MATCHERS_ENABLED, "rows_read", false)
+
+Enable the `rows_read` builtin matcher, with a default `Threshold=1000`, such
+that any request that reads more than `Threshold` rows will generate a CSRT
+process lifetime report with a summary of its resouce consumption.
+
+This is different from the `docs_read` filter so that we can distinguish 
between
+heavy view requests with lots of rows or heavy requests with lots of docs.
+
+### config:get_boolean(?CSRT_MATCHERS_ENABLED, "docs_written", false)
+
+Enable the `docs_written` builtin matcher, with a default `Threshold=500`, such
+that any request that writtens more than `Threshold` docs will generate a CSRT
+process lifetime report with a summary of its resouce consumption.
+
+### config:get_boolean(?CSRT_MATCHERS_ENABLED, "ioq_calls", false)
+
+Enable the `ioq_calls` builtin matcher, with a default `Threshold=10000`, such
+that any request that induces more than `Threshold` IOQ calls will generate a
+CSRT process lifetime report with a summary of its resouce consumption.
+
+### config:get_boolean(?CSRT_MATCHERS_ENABLED, "long_reqs", false)
+
+Enable the `long_reqs` builtin matcher, with a default `Threshold=60000`, such
+that any request where the the last CSRT rctx `updated_at` timestamp is at 
least
+`Threshold` milliseconds grather than the `started_at timestamp` will generate 
a
+CSRT process lifetime report with a summary of its resource consumption.
+
+## CSRT Logger Matcher Threshold (?CSRT_MATCHERS_THRESHOLD)
+
+> -define(CSRT_MATCHERS_THRESHOLD, "csrt_logger.matchers_threshold").
+
+### config:get_integer(?CSRT_MATCHERS_THRESHOLD, "docs_read", 1000)
+
+Threshold for `docs_read` logger matcher, defaults to `1000` docs read.
+
+### config:get_integer(?CSRT_MATCHERS_THRESHOLD, "rows_read", 1000)
+
+Threshold for `rows_read` logger matcher, defaults to `1000` rows read.
+
+### config:get_integer(?CSRT_MATCHERS_THRESHOLD, "docs_written", 500)
+
+Threshold for `docs_written` logger matcher, defaults to `500` docs written.
+
+### config:get_integer(?CSRT_MATCHERS_THRESHOLD, "ioq_calls", 10000)
+
+Threshold for `ioq_calls` logger matcher, defaults to `10000` IOQ calls made.
+
+### config:get_integer(?CSRT_MATCHERS_THRESHOLD, "long_reqs", 60000)
+
+Threshold for `long_reqs` logger matcher, defaults to `60000` milliseconds.
+
+# Core CSRT API
+
+The `csrt(.erl)` module is the primary entry point into CSRT, containing API
+functionality for tracking the lifecycle of processes, inducing metric tracking
+over that lifecycle, and also a variety of functions for aggregate querying.
+
+It's worth noting that the CSRT context tracking functions are specifically
+designed to not `throw` and be safe in the event of unexpected CSRT failures or
+edge cases. The aggregate query API has some callers that will actually throw,
+but aside from this core CSRT operations will not bubble up exceptions, and 
will
+either return the error value, or catch the error and move on rather than
+chaining further errors.
+
+## PidRef API
+
+These are functions are CRUD operations around creating and storing the CSRT
+`PidRef` handle.
+
+```
+-export([
+    destroy_pid_ref/0,
+    destroy_pid_ref/1,
+    create_pid_ref/0,
+    get_pid_ref/0,
+    get_pid_ref/1,
+    set_pid_ref/1
+]).
+```
+
+## Context Lifecycle API
+
+These are the CRUD functions for handling a CSRT context lifecycle, where a
+lifecycle context is created in a `chttpd` coordinator process by way of
+`csrt:create_coordinator_context/2`, or in `rexi_server:init_p` by way of
+`csrt:create_worker_context/3`. Additional functions are exposed for setting
+context specific info like username/dbname/handler. `get_resource` fetches the
+context being tracked corresponding to the given `PidRef`.
+
+```
+-export([
+    create_context/2,
+    create_coordinator_context/2,
+    create_worker_context/3,
+    destroy_context/0,
+    destroy_context/1,
+    get_resource/0,
+    get_resource/1,
+    set_context_dbname/1,
+    set_context_dbname/2,
+    set_context_handler_fun/1,
+    set_context_handler_fun/2,
+    set_context_username/1,
+    set_context_username/2
+]).
+```
+
+## Public API
+
+The "Public" or miscellaneous API for lack of a better name. These are various
+functions exposed for wider use and/or testing purposes.
+
+```
+-export([
+    clear_pdict_markers/0,
+    do_report/2,
+    is_enabled/0,
+    is_enabled_init_p/0,
+    maybe_report/2,
+    to_json/1
+]).
+```
+
+## Stats Collection API
+
+This is the stats collection API utilized by way of
+`couch_stats:increment_counter` to do local process tracking, and also in 
`rexi`
+to adding and extracting delta contexts and then accumulating those values.
+
+NOTE: `make_delta/0` is a "destructive" operation that will induce a new delta
+by way of the last local pdict's rctx delta snapshot, and then update to the
+most recent version. Two individual rctx snapshots for a PidRef can safely
+generate an actual delta by way of `csrt_util:rctx_delta/2`.
+
+```
+-export([
+    accumulate_delta/1,
+    add_delta/2,
+    docs_written/1,
+    extract_delta/1,
+    get_delta/0,
+    inc/1,
+    inc/2,
+    ioq_called/0,
+    js_filtered/1,
+    make_delta/0,
+    rctx_delta/2,
+    maybe_add_delta/1,
+    maybe_add_delta/2,
+    maybe_inc/2,
+    should_track_init_p/1
+]).
+```
+
+## TODO: RPC/QUERY DOCS
+
+```
+%% RPC API
+-export([

Review Comment:
   Thanks, fixed up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@couchdb.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to