Wow. There's a lot here. Some of my reactions:
- It's good to increase visibility. - I don't know much about the importance of different kinds of visibility or what kinds of tools will consume them downstream. I don't know the ultimate goals. - OVS doesn't currently implement d-bus. I don't know anything about d-bus, such as how much work it is to implement, whether OVS would need new library dependencies or how demanding those are, or whether it could be cross-platform (i.e. also support Windows). - One can introduce new individual features for tracking different kinds of drops. One can also introduce different frameworks for reporting them and alerting/alarming on them. I guess that these can probably be separated. - There are multiple levels at which drops can happen or be detected. I wonder whether all of these can be addressed by a single framework. Do you have an idea for next steps? Sometimes it helps to have a specific proposal to discuss, even if it is a straw man. Maybe writing something up would help. On Fri, Jun 28, 2019 at 11:10:14AM +0530, Gowrishankar Muthukrishnan wrote: > Today (*), when a packet journey in the data path is disrupted and leading > towards its drop, we have OVS counters to auto-detect it and show at the > request of user space commands. Some category of drops are related to the > interfaces that can be queried from OVS DB table for that interface [2], > while some are available in real time in the data path through respective > OVS commands (eg, ovs-appctl coverage/show as in [3] and ovs-appctl > dpctl/show as in [4]). It is unavoidable that the drop stats is split > across multiple sources, but at the end of the day the user has to query > by different ways to figure out: > (1) there is packet drop > (2) reason for the drop > (3) miss precious opportunity to correct available resources in the > data path to prevent further drops. > > To ease the difficulty in monitoring these data, we already have tool > such as collectd [1] to record the events but IMHO there is slight async > between what we have today and what we develop in our upstream, meaning > collectd can know packet drops only in the context of interface table. > However, the other category of drops (related to QoS, metering, tunnel, > up call, re-circulation, mtu mismatch and even invalid packet etc) can not > be monitored by collectd because, neither the association with > the Interface table nor a separate table itself exist today. > > However, there is an indirect association for eg Flow_Table represents > all the packet flow rules, and when a packet is dropped, it can only be > checked in Flow_Table for any drop action but it is not unified attempt > to quickly detect and correct resources. Thanks to our developers that > these drops are someway recorded now but, in the field, the time to > recover from the drops easily elapses given that, these stats first to be > collected, be analysed by experts and then recovery action be applied. > Also, there could be a pressing need to have very very minimal packet > drops per million (ppm) . > > Hence, I would like to request suggestions from experts for how we can > handle this situation through OVS and my humble ideas are below. > > (1) Unify the data collection into a common place: > We can think of having a separate Data path table to record necessary > contexts of a packet (drop reason and its count to start with). This > will lead very minimal changes in the eco-system like collectd to sync. > Work around until then is to continue using existing tables where ever > possible, with additional statistics row if not exist. > > (2) Notify drop very soon or never! > Instead of detecting DB records update (even after (1) above) with some > latency in DB transactions to be in sync with real time data, why not OVS > generate events to the consuming eco-system pro-actively ? I can think > of D-bus for an instance to broadcast packet drop notifications. > As a disclaimer, I'm not d-bus expert :) but it is just an idea to > brainstorm. > > An analogy in terms of cli (though using its library it is good): > > <broadcasting event for every packet may be too much exhausting > resources in the notification chain instead, follow guidelines set by > the user. eg above allowable drop ppm ?? or even wait for signal to > enable broadcast from registered monitoring agent in dbus). > > OVS: dbus-send --system --dest=net.ovsmon/net.ovsmon.Datapath.SetProperty > string:Qfull variant:string:<port_name_that_packet_arrived> > Monitor: dbus-monitor type=signal interface="net.ovsmon.Datapath" > signal sender=net.ovsmon.Datapath -> dest=:1.102 > path=/net/ovsmon/Datapath; interface=net.ovsmon.Datapath; member=Qfull > string "vhost-port-1" > > Monitor: dbus-send --system > --dest=net.ovsmon/net.ovsmon.Interface.SetProperty string:<port_name> > variant:string:"queue_size=<new value>" > OVS: <to monitor and apply corrective action> > > If you think this sounds good, I can further think on prototyping it > for a better demonstration or if it is other way, please suggest any > better approach as well. > > * Below patches are in upstream as accepted/under review at present: > [1] > https://wiki.opnfv.org/display/fastpath/Open+vSwitch+plugins+High+Level+Design > [2] https://patchwork.ozlabs.org/patch/1123287/ > [3] https://patchwork.ozlabs.org/patch/1111568/ > [4] https://patchwork.ozlabs.org/patch/1115978/ > [5] http://www.openvswitch.org//ovs-vswitchd.conf.db.5.pdf > > Respective developers from above mail chains are CC'd however, others are > also more welcome. Also, I think it is ovs-dev as appropriate ML for this > discussion. > > Kind regards, > Gowrishankar M _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
