Hi Benoit,

thanks for the response.  By and large we are on the same page and I support 
this work.  And as you know clearly I am of the school who believes in 
exception-driven management and providing actionable information, not raw data.

Anyway, as mentioned there should perhaps be greater emphasis on the value in 
maintaining a dependency graph in general, and explaining how it can complement 
/ aid operational tasks from troubleshooting to impact analysis.  It would be 
good to add some bits on how and where to instrument this effectively (not 
necessarily all pushed onto device agents; there will be also a role for 
controllers etc in this)  I remain sceptical regarding the specific use case of 
continuous maintaining of a synthetically derived health score but am looking 
forward to progression of this work further iterations of the drafts.

--- Alex

From: Benoit Claise <[email protected]>
Sent: Friday, July 31, 2020 3:42 AM
To: Alexander Clemm <[email protected]>; 
[email protected]
Cc: [email protected]; [email protected]
Subject: Re: Comments on Service Assurance for Intent-Based Networking 
Architecture (e.g. draft-claise-opsawg-service-assurance-architecture)

Hi Alex,

Thanks for engaging.
Hi Benoit,

I have seen your presentations on Service Assurance for Intent-Based Networking 
Architecture and read your drafts with interest 
(draft-claise-opsawg-service-assurance-yang-05 and 
draft-claise-opsawg-service-assurance-architecture-03).  Interesting stuff on 
which I do have a couple of comments.

The basis for the drafts is in essence a proposal for Model-Based Reasoning, in 
which you capture dependencies between objects and make inferences by 
traversing the corresponding graph.  MBR based on dependency graphs allows to 
reason about the impact and propagation of the status or health of one object 
on the status or health of dependent objects "downstream" from it.  Likewise, 
traversing the same graph in the opposite direction (from the "downstream" or 
dependent objects) allows to identify potential root causes for symptoms 
observed by those objects, although this seems to be not so much your focus.

While MBR as a concept makes sense and has a long tradition in network 
management, there are also a number of considerable issues with it, and I was 
wondering about your perspective and mitigation strategies for these.  For one, 
their effectiveness depends on the model being "complete".  In most cases, 
there are myriads of interdependencies which are difficult to capture 
comprehensively.  The model is still useful for many applications as a starting 
point, but rarely captures the full reality.  As long as users are clear about 
that, this is not an issue.
Point taken about the myriads of interdependencies and graph completeness.
As you observe, even if the graph is not complete, this is useful. Especially 
when we can assure (networking) components within the assurance graph.
That way, the graph will tell us where the problem is not, which is equally 
important as telling where the problem is/might be.... assuming we have 
complete heuristics for that component assurance obviously ... which implies 
that the heuristics need to improve along the time.



However, the one thing where I have a bit of concern in your model is that you 
use it to draw conclusions about the health of the dependent objects (for 
example, your end-to-end service).  It seems that a derived health score will 
be no substitute for monitoring the actual health, and should not lull users 
into a false sense of security that as long as they monitor components of a 
system or service, that they don't need to be concerned with monitoring the 
system or service as a whole.  In reality I believe the value (although there 
still is a value) is more limited than that.  I believe that this should be 
clearly acknowledged and discussed in the drafts.
This is the exact reason why I wrote in the slides: "This complements the 
end-to-end synthetic testing"
Indeed, the way service assurance is usually done is with end to end probing: 
OWAMP/TWAMP/IP SLA with delay, packet loss, jitter threshold-based, etc. . When 
the SLA degrades, the end to end probing can't really tell which components in 
the network degrades (granted, there are exceptions).The network is viewed as a 
black box. Combining the inferred health score from the assurance graph with 
the end-to-end probing provides the required correlation to have more of a 
network crystal view

Point very well taken, "This complements the end-to-end synthetic testing" 
concept is not mentioned in the draft. I will add it. Thanks.


A second set of issues concerns the intensity of maintaining the graph and of 
continuously updating the dependencies.  In a realistic system you will have 
many objects with even more interdependencies.  Maintaining derived health 
state can become computationally very expensive, which suggests a number of 
mitigation strategies:  for one, don't continuously maintain this but compute 
this only "on demand".
Yes. That's one way

Second, perhaps don't maintain this on the server at all, at least to the 
extent that you expect the server to be a networking device.  It seems much 
more feasible to perform these type of Model-Based Reasoning computations in an 
Operations Support System or application outside the network, not within the 
network.  However, it is not clear that YANG models and Netconf/Restconf would 
be applied there.  It seems to me the drafts should add clarification on where 
those models would be expected to be deployed and how/would keep them updated.  
As an OSS tool, your proposal makes sense, but trying to process this on 
networking devices strikes me as very heavy, in particular given the 
limitations as per the earlier point.   So, IMHO I think you may want to 
consider adding an according section that discusses these aspects in the draft, 
specifically the architecture draft.
The architecture, with the YANG module, is actually designed to cover 
distributed graphs.
We can stream all metrics (whether YANG leaf, MIB variable, CLI, syslog, what 
have you) to an OSS, sure
However, I believe into data aggregation as we know that we're going to quickly 
reach the streaming capabilities limitations.
And I also believe into each components being responsible for its assurance, to 
the best of its knowledge.
Hence the proposal to go via a SAIN agent, inside or outside a router, to send 
the inferred health score and symptoms to the OSS.
In the end, what do operational teams care about?
    1. knowing that an interface, a router, part of the network works fine .... 
until they tell me otherwise
    2. collecting all the metrics in a big data lake to draw the same or better 
conclusion
Ideally we need both, but we face two schools here. I'm more of in the school 
of providing information, as opposed to the much data. This would reduce the 
cost of managing networks.

Regards, Benoit
_______________________________________________
OPSAWG mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/opsawg

Reply via email to