Hi Alex,
Thanks for engaging.
Hi Benoit,
I have seen your presentations on Service Assurance for Intent-Based
Networking Architecture and read your drafts with interest
(draft-claise-opsawg-service-assurance-yang-05 and
draft-claise-opsawg-service-assurance-architecture-03). Interesting
stuff on which I do have a couple of comments.
The basis for the drafts is in essence a proposal for Model-Based
Reasoning, in which you capture dependencies between objects and make
inferences by traversing the corresponding graph. MBR based on
dependency graphs allows to reason about the impact and propagation of
the status or health of one object on the status or health of
dependent objects “downstream” from it. Likewise, traversing the same
graph in the opposite direction (from the “downstream” or dependent
objects) allows to identify potential root causes for symptoms
observed by those objects, although this seems to be not so much your
focus.
While MBR as a concept makes sense and has a long tradition in network
management, there are also a number of considerable issues with it,
and I was wondering about your perspective and mitigation strategies
for these. For one, their effectiveness depends on the model being
“complete”. In most cases, there are myriads of interdependencies
which are difficult to capture comprehensively. The model is still
useful for many applications as a starting point, but rarely captures
the full reality. As long as users are clear about that, this is not
an issue.
Point taken about the myriads of interdependencies and graph completeness.
As you observe, even if the graph is not complete, this is useful.
Especially when we can assure (networking) components within the
assurance graph.
That way, the graph will tell us where the problem is not, which is
equally important as telling where the problem is/might be.... assuming
we have complete heuristics for that component assurance obviously ...
which implies that the heuristics need to improve along the time.
However, the one thing where I have a bit of concern in your model is
that you use it to draw conclusions about the health of the dependent
objects (for example, your end-to-end service). It seems that a
derived health score will be no substitute for monitoring the actual
health, and should not lull users into a false sense of security that
as long as they monitor components of a system or service, that they
don’t need to be concerned with monitoring the system or service as a
whole. In reality I believe the value (although there still is a
value) is more limited than that. I believe that this should be
clearly acknowledged and discussed in the drafts.
This is the exact reason why I wrote in the slides: "This complements
the end-to-end synthetic testing"
Indeed, the way service assurance is usually done is with end to end
probing: OWAMP/TWAMP/IP SLA with delay, packet loss, jitter
threshold-based, etc. . When the SLA degrades, the end to end probing
can't really tell which components in the network degrades (granted,
there are exceptions).The network is viewed as a black box. Combining
the inferred health score from the assurance graph with the end-to-end
probing provides the required correlation to have more of a network
crystal view
Point very well taken, "This complements the end-to-end synthetic
testing" concept is not mentioned in the draft. I will add it. Thanks.
A second set of issues concerns the intensity of maintaining the graph
and of continuously updating the dependencies. In a realistic system
you will have many objects with even more interdependencies.
Maintaining derived health state can become computationally very
expensive, which suggests a number of mitigation strategies: for one,
don’t continuously maintain this but compute this only “on demand”.
Yes. That's one way
Second, perhaps don’t maintain this on the server at all, at least to
the extent that you expect the server to be a networking device. It
seems much more feasible to perform these type of Model-Based
Reasoning computations in an Operations Support System or application
outside the network, not within the network. However, it is not clear
that YANG models and Netconf/Restconf would be applied there. It
seems to me the drafts should add clarification on where those models
would be expected to be deployed and how/would keep them updated. As
an OSS tool, your proposal makes sense, but trying to process this on
networking devices strikes me as very heavy, in particular given the
limitations as per the earlier point. So, IMHO I think you may want
to consider adding an according section that discusses these aspects
in the draft, specifically the architecture draft.
The architecture, with the YANG module, is actually designed to cover
distributed graphs.
We can stream all metrics (whether YANG leaf, MIB variable, CLI, syslog,
what have you) to an OSS, sure
However, I believe into data aggregation as we know that we're going to
quickly reach the streaming capabilities limitations.
And I also believe into each components being responsible for its
assurance, to the best of its knowledge.
Hence the proposal to go via a SAIN agent, inside or outside a router,
to send the inferred health score and symptoms to the OSS.
In the end, what do operational teams care about?
1. knowing that an interface, a router, part of the network works
fine ... until they tell me otherwise
2. collecting all the metrics in a big data lake to draw the same
or better conclusion
Ideally we need both, but we face two schools here. I'm more of in the
school of providing information, as opposed to the much data. This would
reduce the cost of managing networks.
Regards, Benoit
_______________________________________________
OPSAWG mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/opsawg