Hi Benoit!
Thanks for bringing the issue of service assurance to the table. More work is 
needed on this topic.

Some high-level comments on the drafts

The drafts present a YANG module for performing reasoning across a “generic” 
service tree.
This has existed in assurance systems for a long time: inventory based systems 
as well as fault managers had modules for this, like Micromuse Impact, OpenView 
Service Navigator etc. 
The overall idea is to feed KPIs, events, alarms etc into the tree and 
reasoning upwards to do service impact analysis and downwards to do root-cause 
analysis.

I have some concerns based on the above:
1) The draft should be renamed. Claiming a service tree being *the* 
architecture for intent-based network assurance is maybe too ambitious. There 
are so many other things needed for service assurance in intent-based 
networking:
- how to represent service tests and service SLA monitoring as part of the 
intent?
- how to monitor the data-plane as such
- how to represent closed-loop policies
- and more
So I think the draft should be renamed to the relevant, more limited scope: A 
"YANG data-model for representing generic service trees”, or something

2) Looking back at practical experience with above  mentioned tools, it was 
very hard to get the approach to work (or someone else out there is smarter?). 
The dependencies between services/subservices are subtle. Either the 
dependencies get too coarse-grained so that everything gets red, or too 
fine-grained to have too few hits. Who are to express this knowledge in a large 
multi-vendor network? The classical service impact tools had fairly advanced 
algorithms attached to the graph to try to capture this, not just a dependency 
link. It more or less is as complex as the knowledge acquisition problem for 
classical rule-based AI, how many dependencies do we need to express until we 
can do a good job in the service tree? 

Since it is configuration data in the model, I guess it is assumed that the 
orchestrator should set up all of this. But in many cases this is "hidden" in 
other domains, like networking protocols and vendor specific details. Also with 
networks becoming more dynamic and virtual, it is hard to see how the service 
structures can be statically maintained as *configuration*.

Another underlying challenge is that most of the network problems are 
configuration related, and not detected by device instrumentation.
Non-optimal QoS config, firewall rules etc, these are not detected by the 
firewall or router, no alarms.

Hard for you to comment on this, I know, it is just a reality check.

3) Relationship to other YANG service models. 
In your approach the service tree is a separate tree from the concrete service 
trees like L3VPN service model.
Have you considered an approach to augment these concrete models with generic 
assurance state and dependency information instead of maintaining a separate 
tree? Maintaining parallell trees might result in inconsistencies at the end

4) Relationship to the Alarm YANG RFC8632
There are several opportunities to reuse definitions and concepts from RFC8632
- You could add alarms in your module according to the service tree, see 
especially Section 3.6.  Root Cause, Impacted Resources
- You could use alarm-types as one kind of symptom (there are many others like 
active measurements with TWAMP etc)

Hope this helps to flesh out more details in your work
br Stefan










_______________________________________________
OPSAWG mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/opsawg

Reply via email to