I'll do my best to rephrase as a potential requirement :-) 1) Traffic Monitor MUST ensure all caches are monitored upon failure of any TM server(s) or physical location. (i.e. no SPoF of TMs for polling/aggregation).
Number of TM failures to be tolerated before we stop polling some caches / how we accomplish the above/ maximum number of caches under supervision by a TM are all TBD in design phase --Eric On Fri, Jun 25, 2021 at 10:36 AM Dave Neuman <[email protected]> wrote: > Hey Eric, > Thanks for the questions/feedback. My responses are inline below. Most of > your questions will need to be addressed when we do design as right now I > just want to make sure we are not missing any requirements. I hope to > start design discussions in the next week or two. > > Thanks, > Dave > > On Fri, Jun 25, 2021 at 7:26 AM Eric Friedrich <[email protected]> wrote: > > > Some comments and questions jointly compiled > > > > - How is TM configured to monitor a subset of a CDN, is it a static > > allocation of caches to TMs? > > > > DN: I think that is to be determined when we start to think about design, > which is after we agree on the requirements. I think for our use case the > most simple way to do this would be by cache group. A Traffic Monitor > could be configured to monitor 1 to many cache groups. However, if there > is a better way we could do this, I am all ears. > > > > > - Can you describe how the primary + backup work. Do they both poll the > > cache simultaneously > > > > DN: Again, I think we can sort out the details when we talk about design. > It actually might make more sense to just have multiple TMs monitor a cache > group and treat them all as "live", this has the benefit of providing more > than one view of a cache. > > > > - If a TM fails, how do the TMs heal / reallocate polling > > responsibilities. Does another TM pick up the slack? > > > > DN: You want to dive straight into design :). I think the easiest answer > here is to ensure multiple TMs are polling each cache and that they are all > treated as live, then we can just use the optimistic consensus that is > already built into TM. > > > > > > - What prevents a misconfiguration where some caches are not polled by > > any TM? > > > > DN: Great question. I don't think that is one I have considered, but I > suppose we could add a requirement saying that TM must have a way to > identify unpolled caches...what do you think? > > > > > > - Are there any minimums/maximums to how many TMs will poll a cache? > > > DN: Minimum is one, maximum is up to the operator, I don't know of a limit > in TM. > > > > > > - What is meaning of non-boolean 0-100 health? How is this computed and > > how is it used? > > > > DN: The health score stuff is going to be an entirely different topic > because I don't think it needs to be conflated with distributed polling. I > put that requirement in because I wanted to document that this is something > we are thinking about so that we don't make it difficult on ourselves when > we do this refactor. > Right now a cache's health is boolean, it either gets traffic or it > doesn't. The idea behind the health score is that we could assign > different health scores for caches in a cache group and then TR can use > that when determining which cache to choose. Maybe you have multiple > caches that are getting close to the bandwidth limit, instead of pulling > all traffic from them, we could simply weight them lower so the TR prefers > other caches, but can still use them if needed. We have a bunch of other > use cases that are probably best saved for when we are ready to formally > present the idea. > > > > > > - What can we do to further harden TM<->TM communications and reduce > > blast radius? > > > > DN: Another topic for the design discussions, I think the basic idea is to > not have a SPoF which means multiple TMs polling each cache and multiple > TMs available to provide status to TRs, Caches, and TSs. > > > > > Big thumbs up on decoupling TM from Traffic Ops. What does this > practically > > mean - no more monitoring.json? Can we document specifically which APIs > TM > > will use? > > (Aside, we might want to think about this as an opportunity to move TM > into > > its own repository- assuming the community decides to go ahead with > > separate repos per component). > > > > DN: I think that is a stretch goal for now. TM will still have to get > it's configuration from somewhere, but ideally it does not have to come > from TO. Ultimately I would like TO to just serve the basic data from the > database and build services that can be used to generate configs using > business logic. We sort of did this with t3c where it gets all of the > information it needs from TO without relying on config file APIs > that used to be in TO (maybe still are?). However, t3c is purely client > side and I prefer a more centralized approach with something like a TM > configuration service that can read from TO and use the data to populate > APIs for TM to get it's config. That way we could define just the data we > need in TM and a user could choose to run the TM configuration service > which talks to TO or provide the required data using a different backend > system. I think this is probably a larger conversation we need to have > when we start talking about how we are going to design the distributed TM. > > As for its own repo, that is a larger conversation. I am not sure what > that means for all of the ancillary pieces like cdn-in-a-box, the pkg > script, etc. If it is worth the trouble then I am all for it, but I don't > think we should let this thread get bogged down with that conversation. > > > > > > > > > On Thu, Jun 17, 2021 at 6:09 PM Dave Neuman <[email protected]> wrote: > > > > > Hey All, > > > One of the things we have been talking about doing for a long time is > > > making Traffic Monitor capable of monitoring a subset of the CDN so > that > > it > > > can be deployed in a distributed fashion. The time has come for us to > > get > > > moving on this. We have had some discussions internally to understand > > what > > > requirements we have for doing this, but I wanted to solicit feedback > > from > > > the community to see if there are potentially other requirements that > we > > > may have missed. Please take a look at the requirements we have > > identified > > > below and let me know what feedback you have. At this point in time I > am > > > trying to keep this conversation separate from the design conversation > > and > > > just focus on the requirements. Once we all agree on the requirements > we > > > can start discussing the design. You will notice that this proposal > also > > > includes adding the ability to integrate with external monitoring > > systems. > > > I figured now would be a good time to add that functionality in as > well. > > > > > > > > > *Abstract* > > > > > > Update Traffic Monitor so that it is capable of monitoring only part of > > the > > > CDN while still providing a single API for clients to get cache stats, > > > delivery stats, and cache availability for a whole CDN. Add the > ability > > to > > > integrate with other systems that perform additional health monitoring > > and > > > consider the status of these systems when making health decisions for a > > > cache. Ensure that the Traffic Monitor API is capable of serving > > thousands > > > of simultaneous clients, such as all of the caches in a CDN. > > > > > > > > > *Problem Statement* > > > > > > Currently Traffic Monitor can only monitor an entire CDN. This means > that > > > Traffic Monitor has to poll every single cache in a CDN before making > > cache > > > health decisions and being able to provide statistics. This also means > > that > > > Traffic Monitors need to be located in a centralized place where it can > > get > > > to everything, which isn't exactly representative of what a client > might > > > see. While this has worked really well for us to date, we know that at > > some > > > point we will run into scaling issues which prohibit us from polling > > caches > > > faster. In order to solve our impending scaling issues as well as > > improve > > > our ability to make better and faster health decisions, Traffic Monitor > > > needs to run in a distributed fashion instead of an all or nothing > > > fashion. > > > > > > Furthermore, there is a growing need to provide support for external > > > monitoring systems in Traffic Monitor. Traffic Monitor needs to be > able > > to > > > use other monitoring systems to aid in the health decision process. > While > > > this could be solved in today's Traffic Monitor, it is best to solve > this > > > problem in conjunction with making the polling distributed. > > > *Business Justification* > > > > > > In order to provide the best customer experience possible, we need to > > have > > > a robust and timely health monitoring system. While Traffic Monitor > has > > > been sufficient to date, we need to make sure that we are adapting to > > meet > > > the needs of the near future and we need to make sure that we are > > evolving > > > to continue to meet customers needs. These changes to Traffic Monitor > > are > > > imperative to providing as near real time as possible cache health data > > on > > > our ever increasing in scale of the CDN. > > > *Business Requirements* > > > > > > - Traffic Monitor MUST be capable of being configured to monitor a > > > portion of a CDN > > > - Traffic Monitor MUST be capable of being configured to monitor all > > > caches in a CDN > > > - Traffic Monitor MUST provide an API to get the health status of > ALL > > > caches in the CDN > > > - Traffic Monitor MUST provide an API to get statistics (from e.g. > > > astats data) generated by ALL caches in the CDN. This does not > include > > > any > > > statistics generated by external monitoring systems. > > > - Traffic Monitor MUST log all requests to its API including AT > LEAST > > > the following information: timestamp, client IP, resource requested, > > > response code, response reason, time to serve. > > > - Traffic Monitor MUST provide an API to get the status of caches it > > > monitors > > > - Traffic Monitor MUST log all health state changes for a cache > > whether > > > the decision is made internally or from an external system. > > > - Traffic Monitor MUST provide the ability to have more than 1 > Traffic > > > Monitor monitor the same cache and come to consensus on the health > of > > > the > > > cache. > > > - Traffic Monitor SHOULD provide the way to configure more than one > > > subset of caches to monitor – e.g. as a primary and backup. > > > - Traffic Monitor SHOULD provide a way to integrate with external > > > services to provide additional cache health monitoring > > > - Traffic Monitor SHOULD have the capability to provide a > non-boolean > > > health score for a cache - e.g. a number between 0 - 100 > > > - Traffic Monitor MAY be decoupled from Traffic Ops for > configuration > > > generation > > > > > >
