Re: Distributed Traffic Monitor Feedback/Requirements

Eric Friedrich Fri, 25 Jun 2021 08:17:09 -0700

I'll do my best to rephrase as a potential requirement :-)

1) Traffic Monitor MUST ensure all caches are monitored upon failure of any
TM server(s) or physical location. (i.e. no SPoF of TMs for
polling/aggregation).


Number of TM failures to be tolerated before we stop polling some caches /
how we accomplish the above/ maximum number of caches under supervision by
a TM are all TBD in design phase

--Eric

On Fri, Jun 25, 2021 at 10:36 AM Dave Neuman <[email protected]> wrote:

> Hey Eric,
> Thanks for the questions/feedback.  My responses are inline below.  Most of
> your questions will need to be addressed when we do design as right now I
> just want to make sure we are not missing any requirements.  I hope to
> start design discussions in the next week or two.
>
> Thanks,
> Dave
>
> On Fri, Jun 25, 2021 at 7:26 AM Eric Friedrich <[email protected]> wrote:
>
> > Some comments and questions jointly compiled
> >
> >   - How is TM configured to monitor a subset of a CDN, is it a static
> > allocation of caches to TMs?
> >
>
> DN:  I think that is to be determined when we start to think about design,
> which is after we agree on the requirements.  I think for our use case the
> most simple way to do this would be by cache group.  A Traffic Monitor
> could be configured to monitor 1 to many cache groups.  However, if there
> is a better way we could do this, I am all ears.
>
> >
> >   - Can you describe how the primary + backup work. Do they both poll the
> > cache simultaneously
> >
>
> DN: Again, I think we can sort out the details when we talk about design.
> It actually might make more sense to just have multiple TMs monitor a cache
> group and treat them all as "live", this has the benefit of providing more
> than one view of a cache.
>
>
> >   - If a TM fails, how do the TMs heal / reallocate polling
> > responsibilities. Does another TM pick up the slack?
> >
>
> DN:  You want to dive straight into design :). I think the easiest answer
> here is to ensure multiple TMs are polling each cache and that they are all
> treated as live, then we can just use the optimistic consensus that is
> already built into TM.
>
>
> >
> >   - What prevents a misconfiguration where some caches are not polled by
> > any TM?
> >
>
> DN:  Great question.  I don't think that is one I have considered, but I
> suppose we could add a requirement saying that TM must have a way to
> identify unpolled caches...what do you think?
>
>
> >
> >   - Are there any minimums/maximums to how many TMs will poll a cache?
> >
> DN: Minimum is one, maximum is up to the operator, I don't know of a limit
> in TM.
>
>
> >
> >   - What is meaning of non-boolean 0-100 health? How is this computed and
> > how is it used?
> >
>
> DN:  The health score stuff is going to be an entirely different topic
> because I don't think it needs to be conflated with distributed polling.  I
> put that requirement in because I wanted to document that this is something
> we are thinking about so that we don't make it difficult on ourselves when
> we do this refactor.
> Right now a cache's health is boolean, it either gets traffic or it
> doesn't.  The idea behind the health score is that we could assign
> different health scores for caches in a cache group and then TR can use
> that when determining which cache to choose.  Maybe you have multiple
> caches that are getting close to the bandwidth limit, instead of pulling
> all traffic from them, we could simply weight them lower so the TR prefers
> other caches, but can still use them if needed. We have a bunch of other
> use cases that are probably best saved for when we are ready to formally
> present the idea.
>
>
> >
> >   - What can we do to further harden TM<->TM communications and reduce
> > blast radius?
> >
>
> DN:  Another topic for the design discussions, I think the basic idea is to
> not have a SPoF which means multiple TMs polling each cache and multiple
> TMs available to provide status to TRs, Caches, and TSs.
>
>
>
> > Big thumbs up on decoupling TM from Traffic Ops. What does this
> practically
> > mean - no more monitoring.json? Can we document specifically which APIs
> TM
> > will use?
> > (Aside, we might want to think about this as an opportunity to move TM
> into
> > its own repository- assuming the community decides to go ahead with
> > separate repos per component).
> >
>
> DN:  I think that is a stretch goal for now.  TM will still have to get
> it's configuration from somewhere, but ideally it does not have to come
> from TO.  Ultimately I would like TO to just serve the basic data from the
> database and build services that can be used to generate configs using
> business logic.  We sort of did this with t3c where it gets all of the
> information it needs from TO without relying on config file APIs
> that used to be in TO (maybe still are?).  However, t3c is purely client
> side and I prefer a more centralized approach with something like a TM
> configuration service that can read from TO and use the data to populate
> APIs for TM to get it's config.  That way we could define just the data we
> need in TM and a user could choose to run the TM configuration service
> which talks to TO or provide the required data using a different backend
> system.  I think this is probably a larger conversation we need to have
> when we start talking about how we are going to design the distributed TM.
>
> As for its own repo, that is a larger conversation.  I am not sure what
> that means for all of the ancillary pieces like cdn-in-a-box, the pkg
> script, etc. If it is worth the trouble then I am all for it, but I don't
> think we should let this thread get bogged down with that conversation.
>
> >
> >
> >
> > On Thu, Jun 17, 2021 at 6:09 PM Dave Neuman <[email protected]> wrote:
> >
> > > Hey All,
> > > One of the things we have been talking about doing for a long time is
> > > making Traffic Monitor capable of monitoring a subset of the CDN so
> that
> > it
> > > can be deployed in a distributed fashion.  The time has come for us to
> > get
> > > moving on this.  We have had some discussions internally to understand
> > what
> > > requirements we have for doing this, but I wanted to solicit feedback
> > from
> > > the community to see if there are potentially other requirements that
> we
> > > may have missed.  Please take a look at the requirements we have
> > identified
> > > below and let me know what feedback you have.  At this point in time I
> am
> > > trying to keep this conversation separate from the design conversation
> > and
> > > just focus on the requirements.  Once we all agree on the requirements
> we
> > > can start discussing the design.  You will notice that this proposal
> also
> > > includes adding the ability to integrate with external monitoring
> > systems.
> > > I figured now would be a good time to add that functionality in as
> well.
> > >
> > >
> > > *Abstract*
> > >
> > > Update Traffic Monitor so that it is capable of monitoring only part of
> > the
> > > CDN while still providing a single API for clients to get cache stats,
> > > delivery stats, and cache availability for a whole CDN.  Add the
> ability
> > to
> > > integrate with other systems that perform additional health monitoring
> > and
> > > consider the status of these systems when making health decisions for a
> > > cache.  Ensure that the Traffic Monitor API is capable of serving
> > thousands
> > > of simultaneous clients, such as all of the caches in a CDN.
> > >
> > >
> > > *Problem Statement*
> > >
> > > Currently Traffic Monitor can only monitor an entire CDN. This means
> that
> > > Traffic Monitor has to poll every single cache in a CDN before making
> > cache
> > > health decisions and being able to provide statistics. This also means
> > that
> > > Traffic Monitors need to be located in a centralized place where it can
> > get
> > > to everything, which isn't exactly representative of what a client
> might
> > > see. While this has worked really well for us to date, we know that at
> > some
> > > point we will run into scaling issues which prohibit us from polling
> > caches
> > > faster.  In order to solve our impending scaling issues as well as
> > improve
> > > our ability to make better and faster health decisions, Traffic Monitor
> > > needs to run in a distributed fashion instead of an all or nothing
> > > fashion.
> > >
> > > Furthermore, there is a growing need to provide support for external
> > > monitoring systems in Traffic Monitor.  Traffic Monitor needs to be
> able
> > to
> > > use other monitoring systems to aid in the health decision process.
> While
> > > this could be solved in today's Traffic Monitor, it is best to solve
> this
> > > problem in conjunction with making the polling distributed.
> > > *Business Justification*
> > >
> > > In order to provide the best customer experience possible, we need to
> > have
> > > a robust and timely health monitoring system.  While Traffic Monitor
> has
> > > been sufficient to date, we need to make sure that we are adapting to
> > meet
> > > the needs of the near future and we need to make sure that we are
> > evolving
> > > to continue to meet customers needs.  These changes to Traffic Monitor
> > are
> > > imperative to providing as near real time as possible cache health data
> > on
> > > our ever increasing in scale of the CDN.
> > > *Business Requirements*
> > >
> > >    - Traffic Monitor MUST be capable of being configured to monitor a
> > >    portion of a CDN
> > >    - Traffic Monitor MUST be capable of being configured to monitor all
> > >    caches in a CDN
> > >    - Traffic Monitor MUST provide an API to get the health status of
> ALL
> > >    caches in the CDN
> > >    - Traffic Monitor MUST provide an API to get statistics (from e.g.
> > >    astats data) generated by ALL caches in the CDN. This does not
> include
> > > any
> > >    statistics generated by external monitoring systems.
> > >    - Traffic Monitor MUST log all requests to its API including AT
> LEAST
> > >    the following information: timestamp, client IP, resource requested,
> > >    response code, response reason, time to serve.
> > >    - Traffic Monitor MUST provide an API to get the status of caches it
> > >    monitors
> > >    - Traffic Monitor MUST log all health state changes for a cache
> > whether
> > >    the decision is made internally or from an external system.
> > >    - Traffic Monitor MUST provide the ability to have more than 1
> Traffic
> > >    Monitor monitor the same cache and come to consensus on the health
> of
> > > the
> > >    cache.
> > >    - Traffic Monitor SHOULD provide the way to configure more than one
> > >    subset of caches to monitor – e.g. as a primary and backup.
> > >    - Traffic Monitor SHOULD provide a way to integrate with external
> > >    services to provide additional cache health monitoring
> > >    - Traffic Monitor SHOULD have the capability to provide a
> non-boolean
> > >    health score for a cache - e.g. a number between 0 - 100
> > >    - Traffic Monitor MAY be decoupled from Traffic Ops for
> configuration
> > >    generation
> > >
> >
>

Re: Distributed Traffic Monitor Feedback/Requirements

Reply via email to