Hey All, One of the things we have been talking about doing for a long time is making Traffic Monitor capable of monitoring a subset of the CDN so that it can be deployed in a distributed fashion. The time has come for us to get moving on this. We have had some discussions internally to understand what requirements we have for doing this, but I wanted to solicit feedback from the community to see if there are potentially other requirements that we may have missed. Please take a look at the requirements we have identified below and let me know what feedback you have. At this point in time I am trying to keep this conversation separate from the design conversation and just focus on the requirements. Once we all agree on the requirements we can start discussing the design. You will notice that this proposal also includes adding the ability to integrate with external monitoring systems. I figured now would be a good time to add that functionality in as well.
*Abstract* Update Traffic Monitor so that it is capable of monitoring only part of the CDN while still providing a single API for clients to get cache stats, delivery stats, and cache availability for a whole CDN. Add the ability to integrate with other systems that perform additional health monitoring and consider the status of these systems when making health decisions for a cache. Ensure that the Traffic Monitor API is capable of serving thousands of simultaneous clients, such as all of the caches in a CDN. *Problem Statement* Currently Traffic Monitor can only monitor an entire CDN. This means that Traffic Monitor has to poll every single cache in a CDN before making cache health decisions and being able to provide statistics. This also means that Traffic Monitors need to be located in a centralized place where it can get to everything, which isn't exactly representative of what a client might see. While this has worked really well for us to date, we know that at some point we will run into scaling issues which prohibit us from polling caches faster. In order to solve our impending scaling issues as well as improve our ability to make better and faster health decisions, Traffic Monitor needs to run in a distributed fashion instead of an all or nothing fashion. Furthermore, there is a growing need to provide support for external monitoring systems in Traffic Monitor. Traffic Monitor needs to be able to use other monitoring systems to aid in the health decision process. While this could be solved in today's Traffic Monitor, it is best to solve this problem in conjunction with making the polling distributed. *Business Justification* In order to provide the best customer experience possible, we need to have a robust and timely health monitoring system. While Traffic Monitor has been sufficient to date, we need to make sure that we are adapting to meet the needs of the near future and we need to make sure that we are evolving to continue to meet customers needs. These changes to Traffic Monitor are imperative to providing as near real time as possible cache health data on our ever increasing in scale of the CDN. *Business Requirements* - Traffic Monitor MUST be capable of being configured to monitor a portion of a CDN - Traffic Monitor MUST be capable of being configured to monitor all caches in a CDN - Traffic Monitor MUST provide an API to get the health status of ALL caches in the CDN - Traffic Monitor MUST provide an API to get statistics (from e.g. astats data) generated by ALL caches in the CDN. This does not include any statistics generated by external monitoring systems. - Traffic Monitor MUST log all requests to its API including AT LEAST the following information: timestamp, client IP, resource requested, response code, response reason, time to serve. - Traffic Monitor MUST provide an API to get the status of caches it monitors - Traffic Monitor MUST log all health state changes for a cache whether the decision is made internally or from an external system. - Traffic Monitor MUST provide the ability to have more than 1 Traffic Monitor monitor the same cache and come to consensus on the health of the cache. - Traffic Monitor SHOULD provide the way to configure more than one subset of caches to monitor – e.g. as a primary and backup. - Traffic Monitor SHOULD provide a way to integrate with external services to provide additional cache health monitoring - Traffic Monitor SHOULD have the capability to provide a non-boolean health score for a cache - e.g. a number between 0 - 100 - Traffic Monitor MAY be decoupled from Traffic Ops for configuration generation
