Re: [OPSAWG] review of draft-ietf-opsawg-service-assurance-architecture-08

Benoit Claise Tue, 20 Sep 2022 05:42:10 -0700

Hi Michael,

Thanks for your review.

And sorry for the delay: I was not too sure how to react to this review.Another review after WGLC, to be integrated in IETF LC? Documentshepherd review needed to addressed for the document to progress?


Anyway, see inline.

Attached you will see diff with some new proposed text. Let us know ifthis addresses your concern.


On 9/13/2022 12:45 AM, Michael Richardson wrote:

I have read draft-ietf-opsawg-service-assurance-architecture at the request
of a few people.  This is not part of any directorate review (that I
remember, or that shows up in my review list).  If it's useful for me to plug
this in somewhere, let me know.

I find the document well written, and to me rather ambitious.
That might be because my level of understanding of modern network management
is poor.

I found section 3.1.1. Circular Dependencies to be interesting, and I think
telling.   As soon as I saw "DAG" in the previous section, I was all, "yeah, 
but..."
I'm not convinced that the process described in 3.1.1 is something that a
computer program can do, versus that it (the service and the components that
build the service) has to designed to be cycle from from the beginning.
It seems to me that this document either has to constrain what services can
be built by deciding upon a canonical way to describe many things, or that
different vendors will create interoperable models only by chance.

Typically, it's only when assurance graphs are combined that we mighthave circular dependencies. So in practice, we don't believe we aregoing to see many instances of those.Different vendors/controllers assuring different parts of the networksdon't have the exact dependencies, even if that would be welcome.We believe that this circular dependency removal can be programmed butwe also fully agree that good design should avoid circular dependencies.


section 3.6. Handling Maintenance Windows
seems a bit light to me.
I think that there are three aspects which need to emphasized:
   a) maintenance windows where components are marked in maintenance, but
      that the service itself should continue to operate (with a lower score),
      because some redundancy takes over.
      A key issue here is sometimes this results in "boy-who-cried-wolf"
      situation, where the lower score and lack of resiliency is then
      overlooked later on.  The broken thing never gets repaired, and then
      some other fault or maintenance causes an actual failure.

Actually, it depends on the intent.

If the intent is to get have a backup link all the time, then yes, theservice continue to operate with a lower score.


   b) components are marked for maintenance, which have service impacting
      effects, but during which, other components fail.  To make analogy,
      you don't care so much if your car steering system does not operate
      while the starter motor is not operational.  But, as soon as you fix the
      starter motor (taking hours to day), you find that you still can not
      go.   You could have fixed both systems in parallel/currently, if only
      you'd known.

There are two cases here.

1. you knew (from the assurance graph) that car steering system did notoperate when going for maintenance for the starter motor. In such a case, you could be solving both in parallel duringmaintenance2. you don't know, and you will learn about the broken down car steeringsystem when back from the starter motor maintenance ... at the time of recomputing the assurance graph and looking atthe health of each subservice


   c) as the example gives about an update to an device OS.  This sometimes
      comes with unintended (or poorly documented) side effects which cause
      other failures, or knock-on updates.  For instance, you upgrade the
      OS and then TLS 1.1 is disabled in favour of TLS 1.2 and TLS 1.3, but
      other components are in critical use, and have not yet been updated,
      and only TLS 1.1 was supported.

Sure. This is similar to the case 2 above.


(c) is in many ways that the DAG *itself* might need to be updated.
How do you transition from one dependancy DAG to another dependancy DAG?
I guess that section 3.9 gets into this, but it seems rather weak.

Proposal:

1. we need to add the concept that service depending on theunder-maintenance subservices will receive the "under maintenance"symptom and has to take into account in his health computation. How? Wedon't want to in the specific of health aggregation in this specification.2. add some text that the DAG might have to recomputed after asubservice coming out of maintenance.


3.8. Timing
Starts talking about NTP, and synchronization.
Then goes into garbage collection, and I think that maybe this transition in
the text could be better presented.

You are right.

We propose to move the following text (which is not consequent enough todeserve its own section) just before 3.1


       The SAIN architecture requires time synchronization, with Network
       Time Protocol (NTP) [RFC5905  
<https://datatracker.ietf.org/doc/html/rfc5905>] as a candidate, between all 
elements:
       monitored entities, SAIN agents, Service orchestrator, the SAIN
       collector, as well as the SAIN orchestrator.  This guarantees the
       correlations of all symptoms in the system, correlated with the right
       assurance graph version.


And rename section 3.8 "Timing" to "Garbage Collection"



I feel that this SAIN architecture is quite ambitious, and I'm not sure that
there is enough here to actually create interoperable implementations.

My group created a prototype. I know of another one.

And there is an opensource implementation (presented by Prof BenoitDonnet in the past).The interop part will be with linking YANG modules, which we addressedwith the circular dependencies.


Regards, Jean and Benoit




_______________________________________________
OPSAWG mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/opsawg

<<< text/html; charset="UTF-8"; name="draft-ietf-opsawg-service-assurance-architecture-09-from-8.diff.html": Unrecognized >>>

_______________________________________________
OPSAWG mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/opsawg

Re: [OPSAWG] review of draft-ietf-opsawg-service-assurance-architecture-08

Reply via email to