This is an automated email from the ASF dual-hosted git repository.

ocket8888 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/trafficcontrol.git


The following commit(s) were added to refs/heads/master by this push:
     new 61f4a79  Distributed Traffic Monitor blueprint (#6126)
61f4a79 is described below

commit 61f4a79749f3effe8c2aaf7d34f80188232b4b97
Author: Rawlin Peters <[email protected]>
AuthorDate: Tue Sep 7 15:17:35 2021 -0600

    Distributed Traffic Monitor blueprint (#6126)
    
    * Add blueprint for distributed TM
    
    * Incorporate review feedback
    
    Update TO impact to mention possibility of impact, and add a note about
    why the new config is not done via profile parameters.
    
    * Add note about no impact to the tm-health-client
---
 blueprints/distributed-traffic-monitor.md | 191 ++++++++++++++++++++++++++++++
 1 file changed, 191 insertions(+)

diff --git a/blueprints/distributed-traffic-monitor.md 
b/blueprints/distributed-traffic-monitor.md
new file mode 100644
index 0000000..46bde35
--- /dev/null
+++ b/blueprints/distributed-traffic-monitor.md
@@ -0,0 +1,191 @@
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+# Distributed Traffic Monitor
+
+## Problem Description
+Currently, TM polls all caches in a CDN. As CDNs grow, this becomes a major
+pain point as TM is limited by the amount of bandwidth and CPU it requires to
+receive and process data from every cache on the CDN, and scaling vertically by
+running it on better hardware is only feasible up to a certain point. Also, the
+performance of a cache observed by a TM which is very far away from it does not
+always reflect the performance observed by clients that are actually using the
+cache (because the clients are typically much closer to it).
+
+## Proposed Change
+TM should have the ability to poll only a subset of caches in a CDN and peer
+with other TMs which are monitoring other subsets in order to get a full view
+of the CDN's health. This would allow us to run TM in a more distributed manner
+across the CDN, giving us a view of cache health that is closer to what clients
+actually observe and enabling us to scale TM horizontally. Additionally, we
+would like to have the option to disable _stat polling_ in order for these
+distributed TMs to focus on _health polling_.
+
+### Traffic Portal Impact
+This proposal does not require any TP changes.
+
+### Traffic Ops Impact
+This proposal might have limited impact on TO. The existing TO API endpoints
+already provide the data that TM will need to run in a distributed manner, and
+any changes made to TM APIs that TO uses will remain backwards-compatible.
+However, TO may need to be updated if it uses any stat-polling-related TM APIs
+so that it only requests from TMs that have stat-polling enabled.
+
+### t3c Impact
+This proposal does not require `t3c` changes. Note: the `tm-health-client`
+periodically polls a random TM to get cache health states, and because
+distributed TMs will still serve the cache health states of all caches in a
+CDN, there will be no impact to the `tm-health-client`. It can continue to poll
+any random TM and still get all the cache health data for the entire CDN.
+
+### Traffic Monitor Impact
+TM will gain at least two more configuration options:
+- `distributed_polling_enabled` (default: false) - when set to true, TM will
+  run in _distributed mode_ (more details on this below). When set to false, TM
+  will run in its legacy, normal mode.
+- `stat_polling_disabled` (default: false) - when set to true, TM will _not_ do
+  stat polling for caches. When set to false, TM will do stat polling for
+  caches (legacy, normal behavior). Initially, this must be set to true if
+  `distributed_polling_enabled` is also set to true. In a later phase of
+  development, we will add the ability to enable stat polling in distributed
+  mode.
+
+Note: these are configuration options as opposed to profile parameters because
+we currently do not have the capability to have per-profile monitoring.json
+snapshots (or per-TM-server configuration in one snapshot).
+
+To use _distributed mode_, generally all TMs in the CDN need to be running in
+distributed mode (if they're taking part in the health protocol). It should
+still be possible to run TMs in the _legacy_ (non-distributed) mode in order to
+provide cache stat polling (which is important for Traffic Stats), but they
+should not be set to `ONLINE` in order to keep them from interfering with the
+health protocol.
+
+While in _distributed mode_, a TM instance will only monitor a subset of
+cachegroups in its given CDN. The number of cachegroups each TM will monitor
+depends on the number of cachegroups that contain TM servers for the CDN. These
+will be referred to as "TM groups." A TM group contains 1 to many TM servers,
+and a CDN can have 1 to many TM groups. If there are N TM groups, each TM group
+will monitor roughly 1/N of the cachegroups in the CDN. Each TM in the group
+will monitor all of caches in that 1/N portion of cachegroups that the TM group
+is responsible for. For example, if there are 10 cachegroups and 3 TM groups:
+- TM group 1 monitors cachegroups 1-4
+- TM group 2 monitors cachegroups 5-7
+- TM group 3 monitors cachegroups 8-10
+
+Because every TM can serve the health state of every cache, distributed TMs
+will need to peer not only with their own group members but also with other
+groups as well. However, instead of simultaneously requesting cache health
+states from all out-of-group peers, each distributed TM will simultaneously
+request cache health states from 1 TM in every other TM group, alternating
+between group members in a deterministic, round-robin fashion. For this
+out-of-group peering, a new TM API route will be added that returns only the
+cache health states for caches that the TM group is responsible for polling.
+
+A safety feature will be added to TM (while running in distributed mode) to
+ensure that all cachegroups are polled by at least 1 TM group, and an
+additional profile parameter override will be available in order to manually
+assign cachegroups to TM groups for polling.
+
+### Traffic Router Impact
+This proposal should have no impact on TR.
+
+### Traffic Stats Impact
+Because we will be able to disable stats polling on TM, TS will need to poll
+TMs that actually have stats polling enabled. TMs with polling enabled should
+be given a specific server status (other than `ONLINE`), which TS will be
+configured to poll, and that might mean creating a new server status
+specifically for that purpose.
+
+### Traffic Vault Impact
+This proposal has no impact on Traffic Vault.
+
+### Documentation Impact
+Any new configuration options added to TM should be documented, and the steps
+necessary to run TM in a distributed manner as well as how it works should be
+described in some form of documentation (probably the TM admin docs).
+
+### Testing Impact
+New TM unit and integration tests should be added where applicable. It would
+also be recommended to run both types of TMs in production (distributed and
+non-distributed) and compare the reported cache health states between both
+types. This would help discover any issues with running TM in a distributed
+manner using data from a production environment. However, TR should still get
+health states from the non-distributed TMs until we are confident in the health
+states reported by distributed TMs.
+
+### Performance Impact
+This proposal allows TM to be scaled horizontally, so operators can increase
+the number of TM groups in order to get the desired amount of load per TM.
+
+### Security Impact
+This proposal does not have much impact on security, but allowing TM to scale
+horizontally means that there may be more firewall rules that will need to be
+applied to any new TM servers that are deployed. However, TM will not need any
+_new_ ports opened, assuming the same `httpListener` and `httpsListener`
+configuration is used.
+
+### Upgrade Impact
+TMs running in a distributed manner can be upgraded in the same way that
+non-distributed TMs are upgraded today. For instance, we would likely upgrade
+the `OFFLINE` TMs, then set the upgraded TMs to `ONLINE` while simultaneously
+setting the old TMs to `OFFLINE`.
+
+### Operations Impact
+There should be little impact on operations other than the effort necessary to
+provision and deploy new TM servers to run in a distributed manner. Existing
+automation can still be used for upgrades, configuration, etc., but automation
+may need a way to differentiate between non-distributed and distributed TMs
+within the same environment so that both types are configured differently.
+
+Troubleshooting distributed TMs might be more difficult than non-distributed
+TMs as there will be more servers involved. However, the health of a cache
+should always be determined by the same TMs (assuming no new TM groups are
+added to the system), so it would be best to investigate the TM servers in the
+"authoritative" TM group for the cache under investigation. To help aid this
+kind of troubleshooting, we may want TM to have an API that returns information
+about which TM groups it thinks are currently monitoring which cache groups.
+
+### Developer Impact
+Developers should know that once this change is implemented, there will be two
+different "run modes" for TM -- distributed and non-distributed. TM will do
+certain things differently in the distributed mode compared to the
+non-distributed mode even though the vast majority of things will be the same.
+Therefore, developers will need to take care to ensure the proper behavior is
+followed depending on which "run mode" TM is in.
+
+Also, because this proposal will allow TMs to monitor only a subset of caches,
+it may make it easier to set up a development environment using production-like
+data and caches. It is somewhat infeasible for most TM development environments
+to poll an entire, large CDN, but with distributed TM groups, developers could
+essentially choose how many caches they want their local TM to poll.
+
+## Alternatives
+
+- Cache Self-Monitoring: Make caches monitor themselves by using remap rules,
+  essentially replacing TM's Cache Health Monitoring. The
+  [Proof-of-Concept](https://github.com/apache/trafficcontrol/pull/4529) has
+  more details.
+
+## Dependencies
+This proposal does not intend to add any new dependencies.
+
+## References
+The following mailing list threads were related to this blueprint:
+- [Proposal: Distributed Health 
Monitoring](https://lists.apache.org/thread.html/rf3307f824c0f82892cbb0fea74a5c6a274c8ea4f303d125e8f1212da%40%3Cdev.trafficcontrol.apache.org%3E)
+- [Distributed Traffic Monitor 
Feedback/Requirements](https://lists.apache.org/thread.html/rf985a2b9e8a440d396a0097a71882919bff5b3cb5f8d6c3a53143162%40%3Cdev.trafficcontrol.apache.org%3E)

Reply via email to