Hey Eric,
Thanks for the questions/feedback.  My responses are inline below.  Most of
your questions will need to be addressed when we do design as right now I
just want to make sure we are not missing any requirements.  I hope to
start design discussions in the next week or two.

Thanks,
Dave

On Fri, Jun 25, 2021 at 7:26 AM Eric Friedrich <[email protected]> wrote:

> Some comments and questions jointly compiled
>
>   - How is TM configured to monitor a subset of a CDN, is it a static
> allocation of caches to TMs?
>

DN:  I think that is to be determined when we start to think about design,
which is after we agree on the requirements.  I think for our use case the
most simple way to do this would be by cache group.  A Traffic Monitor
could be configured to monitor 1 to many cache groups.  However, if there
is a better way we could do this, I am all ears.

>
>   - Can you describe how the primary + backup work. Do they both poll the
> cache simultaneously
>

DN: Again, I think we can sort out the details when we talk about design.
It actually might make more sense to just have multiple TMs monitor a cache
group and treat them all as "live", this has the benefit of providing more
than one view of a cache.


>   - If a TM fails, how do the TMs heal / reallocate polling
> responsibilities. Does another TM pick up the slack?
>

DN:  You want to dive straight into design :). I think the easiest answer
here is to ensure multiple TMs are polling each cache and that they are all
treated as live, then we can just use the optimistic consensus that is
already built into TM.


>
>   - What prevents a misconfiguration where some caches are not polled by
> any TM?
>

DN:  Great question.  I don't think that is one I have considered, but I
suppose we could add a requirement saying that TM must have a way to
identify unpolled caches...what do you think?


>
>   - Are there any minimums/maximums to how many TMs will poll a cache?
>
DN: Minimum is one, maximum is up to the operator, I don't know of a limit
in TM.


>
>   - What is meaning of non-boolean 0-100 health? How is this computed and
> how is it used?
>

DN:  The health score stuff is going to be an entirely different topic
because I don't think it needs to be conflated with distributed polling.  I
put that requirement in because I wanted to document that this is something
we are thinking about so that we don't make it difficult on ourselves when
we do this refactor.
Right now a cache's health is boolean, it either gets traffic or it
doesn't.  The idea behind the health score is that we could assign
different health scores for caches in a cache group and then TR can use
that when determining which cache to choose.  Maybe you have multiple
caches that are getting close to the bandwidth limit, instead of pulling
all traffic from them, we could simply weight them lower so the TR prefers
other caches, but can still use them if needed. We have a bunch of other
use cases that are probably best saved for when we are ready to formally
present the idea.


>
>   - What can we do to further harden TM<->TM communications and reduce
> blast radius?
>

DN:  Another topic for the design discussions, I think the basic idea is to
not have a SPoF which means multiple TMs polling each cache and multiple
TMs available to provide status to TRs, Caches, and TSs.



> Big thumbs up on decoupling TM from Traffic Ops. What does this practically
> mean - no more monitoring.json? Can we document specifically which APIs TM
> will use?
> (Aside, we might want to think about this as an opportunity to move TM into
> its own repository- assuming the community decides to go ahead with
> separate repos per component).
>

DN:  I think that is a stretch goal for now.  TM will still have to get
it's configuration from somewhere, but ideally it does not have to come
from TO.  Ultimately I would like TO to just serve the basic data from the
database and build services that can be used to generate configs using
business logic.  We sort of did this with t3c where it gets all of the
information it needs from TO without relying on config file APIs
that used to be in TO (maybe still are?).  However, t3c is purely client
side and I prefer a more centralized approach with something like a TM
configuration service that can read from TO and use the data to populate
APIs for TM to get it's config.  That way we could define just the data we
need in TM and a user could choose to run the TM configuration service
which talks to TO or provide the required data using a different backend
system.  I think this is probably a larger conversation we need to have
when we start talking about how we are going to design the distributed TM.

As for its own repo, that is a larger conversation.  I am not sure what
that means for all of the ancillary pieces like cdn-in-a-box, the pkg
script, etc. If it is worth the trouble then I am all for it, but I don't
think we should let this thread get bogged down with that conversation.

>
>
>
> On Thu, Jun 17, 2021 at 6:09 PM Dave Neuman <[email protected]> wrote:
>
> > Hey All,
> > One of the things we have been talking about doing for a long time is
> > making Traffic Monitor capable of monitoring a subset of the CDN so that
> it
> > can be deployed in a distributed fashion.  The time has come for us to
> get
> > moving on this.  We have had some discussions internally to understand
> what
> > requirements we have for doing this, but I wanted to solicit feedback
> from
> > the community to see if there are potentially other requirements that we
> > may have missed.  Please take a look at the requirements we have
> identified
> > below and let me know what feedback you have.  At this point in time I am
> > trying to keep this conversation separate from the design conversation
> and
> > just focus on the requirements.  Once we all agree on the requirements we
> > can start discussing the design.  You will notice that this proposal also
> > includes adding the ability to integrate with external monitoring
> systems.
> > I figured now would be a good time to add that functionality in as well.
> >
> >
> > *Abstract*
> >
> > Update Traffic Monitor so that it is capable of monitoring only part of
> the
> > CDN while still providing a single API for clients to get cache stats,
> > delivery stats, and cache availability for a whole CDN.  Add the ability
> to
> > integrate with other systems that perform additional health monitoring
> and
> > consider the status of these systems when making health decisions for a
> > cache.  Ensure that the Traffic Monitor API is capable of serving
> thousands
> > of simultaneous clients, such as all of the caches in a CDN.
> >
> >
> > *Problem Statement*
> >
> > Currently Traffic Monitor can only monitor an entire CDN. This means that
> > Traffic Monitor has to poll every single cache in a CDN before making
> cache
> > health decisions and being able to provide statistics. This also means
> that
> > Traffic Monitors need to be located in a centralized place where it can
> get
> > to everything, which isn't exactly representative of what a client might
> > see. While this has worked really well for us to date, we know that at
> some
> > point we will run into scaling issues which prohibit us from polling
> caches
> > faster.  In order to solve our impending scaling issues as well as
> improve
> > our ability to make better and faster health decisions, Traffic Monitor
> > needs to run in a distributed fashion instead of an all or nothing
> > fashion.
> >
> > Furthermore, there is a growing need to provide support for external
> > monitoring systems in Traffic Monitor.  Traffic Monitor needs to be able
> to
> > use other monitoring systems to aid in the health decision process. While
> > this could be solved in today's Traffic Monitor, it is best to solve this
> > problem in conjunction with making the polling distributed.
> > *Business Justification*
> >
> > In order to provide the best customer experience possible, we need to
> have
> > a robust and timely health monitoring system.  While Traffic Monitor has
> > been sufficient to date, we need to make sure that we are adapting to
> meet
> > the needs of the near future and we need to make sure that we are
> evolving
> > to continue to meet customers needs.  These changes to Traffic Monitor
> are
> > imperative to providing as near real time as possible cache health data
> on
> > our ever increasing in scale of the CDN.
> > *Business Requirements*
> >
> >    - Traffic Monitor MUST be capable of being configured to monitor a
> >    portion of a CDN
> >    - Traffic Monitor MUST be capable of being configured to monitor all
> >    caches in a CDN
> >    - Traffic Monitor MUST provide an API to get the health status of ALL
> >    caches in the CDN
> >    - Traffic Monitor MUST provide an API to get statistics (from e.g.
> >    astats data) generated by ALL caches in the CDN. This does not include
> > any
> >    statistics generated by external monitoring systems.
> >    - Traffic Monitor MUST log all requests to its API including AT LEAST
> >    the following information: timestamp, client IP, resource requested,
> >    response code, response reason, time to serve.
> >    - Traffic Monitor MUST provide an API to get the status of caches it
> >    monitors
> >    - Traffic Monitor MUST log all health state changes for a cache
> whether
> >    the decision is made internally or from an external system.
> >    - Traffic Monitor MUST provide the ability to have more than 1 Traffic
> >    Monitor monitor the same cache and come to consensus on the health of
> > the
> >    cache.
> >    - Traffic Monitor SHOULD provide the way to configure more than one
> >    subset of caches to monitor – e.g. as a primary and backup.
> >    - Traffic Monitor SHOULD provide a way to integrate with external
> >    services to provide additional cache health monitoring
> >    - Traffic Monitor SHOULD have the capability to provide a non-boolean
> >    health score for a cache - e.g. a number between 0 - 100
> >    - Traffic Monitor MAY be decoupled from Traffic Ops for configuration
> >    generation
> >
>

Reply via email to