Hi All,

This thread comes to give a wider view of the two different approaches on
the table for the "management and operations sequences streamlining"
discussion.

I would still greatly appreciate a high level discussion of the issue
itself and the different approaches. I hope the below preliminary example
algorithms would shed some more light on the differences between the
approaches and help the community decide which is preferable.

Thank you all,
Nir



============================================================
=============================================================
*"Simple" traffic-ops orchestrated solution highlights*

In a "simple" solution traffic ops follows the below steps when a delivery
service list of servers is modified:

   1. Queue the delivery-service configuration added to the traffic-servers:
   E.g. Add the new remap rule to "remap.config" of each traffic-server
   newly assigned to a delivery-service.
   2. Wait for all [updated] servers to acknowledge that the new
   configuration was pulled
   3. Update traffic-router with the new delivery-service cr-config
   4. Queue the delivery-service configuration removal from the
   traffic-servers:
   E.g. Remove the remap rule from the "remap.config" of each
   traffic-server no longer assigned to a delivery-service.
   5. Possibly waiting for all [updated] servers to acknowledge that the
   latest configuration was deployed, before allowing a new configuration
   cycle.


Same steps also hold for the "delivery service HOST_REGEXP change":
#1 - Add the new remap rule to each assigned traffic-server's "remap.config"
#4 - Remove the old remap rule from each assigned traffic-server's
"remap.config"

Many more details are probably missing, but basically, this algorithm is
relatively simple and clear.
Additionally, in the first step, the operation may be done in "global"
scope, and only then improving the solution to work independently
per delivery-service.
Furthermore, most changes are likely to be limited to traffic-ops and
isolated from other flows in the system. Being centralistic may make the
process more stable as well as easy to debug via proper log messages.

============================================================
===========================================================
*"Flexible" traffic-router based solution for delivery-service
configuration deployment.*

Lets define a delivery-service configuration "generation". Such a
"generation" would be an ordinal identifier for the a delivery service
configuration.
A "generation" changes whenever a new configuration is applied that changes
the remap rule at some of the servers, or the content to server assignment.
Mainly:

   1. Adding the delivery service
   2. Assigning new traffic servers to the existing delivery service
   (changing the "consistent hash" assignment done by traffic router)
   3. Removing the delivery service
   4. Removing assigned traffic-servers from the delivery service.
   5. More complicated scenarios to be discussed:
      1. Moving a server between cache groups.
      2. Changing the HOST_REGEXP of the delivery service.

Under this definition, the remap rules and crconfig.json will be
conceptually broken into a "per delivery service segments". These segments
can be managed independently but it is not required in the first step.

At any give moment, each traffic-server holds a single generation of  a
"remap rule configuration", for each relevant delivery service.
The traffic router on the other hand, holds for each known HOST_REGEXP, a
stack of the relevant "delivery-service cr-config" segments, allowing it to
maintain a short history.
Furthermore, the traffic server knows which configuration generation was
read by which traffic-server for each delivery service. This can be done
using traffic-monitor via astat.

The main logic of this solution is implemented in the traffic-router, that
has to implement some algorithm when redirecting requests to
traffic-server, taking the "generation" into account,
For example, when a new get request reaches the traffic router, it can
follow the below algorithm (optimizations are required):

   1. Identify the HOST_REGEXP and choosing the "cr-config" stack
   accordingly.
   Point to the "top" of the stack.
   2. Based on the "cr-config" , choose the traffic-server to redirect to.
   This is done exactly as it is done today based on the the delivery
   service as well as servers' health*.
   3. If the chosen server has the proper configuration generation,
   redirect to it (and we are done)
   4. Otherwise, move to the next cr-config segment in the stack, and goto
   "2"

* A server holding a newer remap configuration generation for the delivery
service (comparing to the one pointed at in traffic router stack), is
considered "down" in the content to server assignment calculation.
Otherwise, the algorithm might end up with no router to redirect to.

The above algorithm tries to minimize the changes on the system behavior,
when no change is applied. It also tries to avoid instability / cache
trashing, by limiting temporary "consistent hash" results during the
transition.

In order to provide

On Thu, Feb 2, 2017 at 2:39 PM, Nir Sopher <[email protected]> wrote:

> Hi Eric,
> Actually, as we imaged it, a "generation" is created only when a new
> configuration is applied - when the "consistent hash" is permanently
> modified.
>
> I'll open a separate thread to discuss the technical details further,
> including an algorithm we have in mind.
>
> I also opened TC-130 - Streamlining TC management and operations sequences
> <https://issues.apache.org/jira/browse/TC-130> to further monitor the
> issue.
>
> Would appreciate community inputs about the issue, especially discussing
> the PROs and CONs of the 2 different approaches:
> Traffic Ops orchestrated solution vs. A more flexible, traffic-router
> algorithm based, solution.
>
> Nir
>
>
>
>
> On Wed, Feb 1, 2017 at 3:33 PM, Eric Friedrich (efriedri) <
> [email protected]> wrote:
>
>> Hey Nir-
>>   Interesting thought for sure.
>>
>> Would TM “health changes” (loss of connectivity, BW/loadavg too high)
>> change the generation count? It seems like the answer is Yes, because the
>> health of a cache impacts the state of the consistent hash ring.
>>
>> If so, how do these generation changes get from the Traffic Monitor to
>> the caches, when config changes typically come only from Traffic Ops and
>> only when ORT is run?
>>
>> Or maybe the generation count is just an abstraction to conceptualize the
>> problem space and not a literal approach?
>>
>> —Eric
>>
>> > On Feb 1, 2017, at 4:14 AM, Nir Sopher <[email protected]> wrote:
>> >
>> > Hi Eric,
>> >
>> > Formalizing the approach you suggested, one may introduce the concept
>> of a
>> > delivery-service configuration "generation" which would be an ordinal
>> > identifier for the a delivery service configuration. A "generation"
>> changes
>> > whenever the remap rule changes or the consistent hash mapping of
>> content
>> > to server changes (e.g. due to additional server assignment).
>> > I such a solution, each traffic-server may hold a single generation for
>> > each delivery service configuration, while traffic-router may hold a
>> > history of generations and know which server holds which configuration
>> > generation.
>> >
>> > This approach introduces a considerable flexibility. It allows
>> > configurations to be set one after the other with no need to wait
>> between
>> > them.
>> > It also fits well with Jeremy's suggestion for queue-update with a
>> delivery
>> > service granularity.
>> >
>> > On the other hand, complicated algorithms for solving the issue may
>> impose
>> > more risk to the network when applied, comparing to a simple
>> "traffic-ops"
>> > orchestrated solution.
>> >
>> > I'm not sure what is preferable from an operator point of view. I'm also
>> > not familiar with TC 3.0 configuration solution to validate he different
>> > approaches against.
>> >
>> > Please share your thoughts,
>> > Thanks,
>> > Nir
>> >
>> > On Tue, Jan 31, 2017 at 6:26 PM, Eric Friedrich (efriedri) <
>> > [email protected]> wrote:
>> >
>> >> What about an approach (apologies, still light on details), where TR
>> >> (perhaps still via TM) discovers the availability of delivery services
>> from
>> >> the cache itself, rather than from the CRConfig file? (Astats or its
>> >> remap_stats based replacement would publish its remap rules)
>> >>
>> >> Any changes to the set of servers (add/remove) or DS assignments would
>> not
>> >> require a specific step to push a changed config to the router. If a
>> cache
>> >> does not yet, or no longer has remap rules for a specific delivery
>> service,
>> >> then TR will not see that rule advertised by the cache and will not
>> send it
>> >> traffic. If adding or removing a server, TM still needs to be updated
>> to
>> >> learn about the new server.
>> >>
>> >> With current configuration, theres a race condition of a few seconds
>> where
>> >> a cache removes remap rule before TM polls and TR gets health info
>> from TM.
>> >> In these few seconds, TR would erroneously send traffic to a cache
>> without
>> >> a proper remap rule.
>> >>
>> >> We could fix this by
>> >>  a) advertising a state of the remap rule in astats to notify TR no
>> >> longer to send traffic on that DS for a short period before the rule is
>> >> actually removed - all handled inside of ORT).
>> >>    or
>> >>  b) prematurely removing the remap rule from astats, before the config
>> on
>> >> TS is actually updated (at the cost of missing the final few remap
>> stats
>> >> numbers). This is probably unacceptable.
>> >>
>> >> I’m sure there are other variants on this, but my main goal is for TR
>> to
>> >> directly learn from the caches which delivery services they actually
>> have
>> >> available. Rather than the TR learning what TO only thinks each cache
>> has
>> >> available.
>> >>
>> >> —Eric
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>> On Jan 31, 2017, at 8:10 AM, Nir Sopher <[email protected]> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> In order to further improve the simplicity and robustness of the
>> control
>> >>> path for provisioning infrastructure and delivery services, we are
>> >>> currently considering ways to streamline management and operations.
>> >>>
>> >>> Currently, when applying changes in traffic-control that require the
>> >>> synchronization between the traffic-router and traffic-servers, the
>> user
>> >>> should be conscious to do so in a certain order. Otherwise, "black
>> holes"
>> >>> may be created. Furthermore, in some of the scenarios the user have to
>> >> wait
>> >>> and verify that the configuration reached all traffic server before he
>> >> may
>> >>> apply it to the traffic-router.
>> >>>
>> >>> We have noticed that TC-3.0 is planned to include a "Config State
>> >> Machine",
>> >>> probably dealing with the issue thoroughly. We have no further
>> >> information
>> >>> about this bullet and would appreciate any additional info.
>> >>>
>> >>> We would like to start investing in making TC operations more
>> streamline,
>> >>> robust and user-friendly.
>> >>>
>> >>> The main use-cases we would like to address at this point are:
>> >>>
>> >>>  1. Assign servers to a Delivery-Service.
>> >>>  For this operation, the configuration must first be applied to the
>> >> added
>> >>>  traffic servers, propagate, and only then applied to the
>> >> traffic-router.
>> >>>  2. Remove servers assignment to a Delivery-Service.
>> >>>  For this operation, the configuration must first be applied to the
>> >>>  traffic-router, and only then to the traffic-servers.
>> >>>  3. Add a new delivery service.
>> >>>  This is practically a private case of servers assignment to a
>> >>>  delivery-service.
>> >>>  4. Delete a delivery service.
>> >>>  This is practically a private case of servers assignment removal
>> from a
>> >>>  delivery-service.
>> >>>  5. Update settings that must be applied together on the traffic
>> servers
>> >>>  and the router.
>> >>>
>> >>> We would like to simplify the procedure, allowing the deployment of
>> new
>> >>> configuration in a single operation, instead of doing it step by step.
>> >>>
>> >>> One solution can be based on the insight that deploying such
>> >> configuration
>> >>> changes may be done by initially updating the traffic server with
>> added
>> >>> functionality (e.g remap-rule), then updating the router, and lastly,
>> >>> removing old functionality from the traffic servers. Such a solution
>> can
>> >> be
>> >>> orchestrated by traffic-ops, probably without complicating other
>> >> components.
>> >>>
>> >>> Other solutions may provide more flexibility, but would probably
>> involve
>> >>> adding complexity to other components such as traffic-router.
>> >>>
>> >>> We would be glad to hear the community's thoughts on the matter, so we
>> >> can
>> >>> take this further.
>> >>>
>> >>> Thanks,
>> >>> Nir
>> >>
>> >>
>>
>>
>

Reply via email to